The core funding from the Government of India. The core funding from the Government of India



Yüklə 445 b.
tarix04.06.2017
ölçüsü445 b.











The core funding from the Government of India.

  • The core funding from the Government of India.

  • All activities will be in a project mode.

  • Will attempt to leverage expertise already available to cut avoidable cost and delay.

  • All staff will be on contract.

  • All receipts and payments through internet gateways, or through conventional means, will go to the Consolidated Fund.

  • However, the Government will release grants required to the Consortium as required. If need be, the support will be extended beyond the initial six year period.

  • As the nodal agency, CIIL will further distribute the relevant funding for specific sub-components of the scheme to other academic institutions.

  • An annual progress report will be submitted to the government.





Establishing standards

  • Establishing standards

  • Creating language resources

  • Annotating language data

  • Building systems/helping system building

  • Creating human resources

  • Co-ordinating language resource developing activities



Creation of different kinds of Corpora including Pathological speech, Historical/ Inscriptional databases

  • Creation of different kinds of Corpora including Pathological speech, Historical/ Inscriptional databases

  • Natural Language Processing

  • Speech Recognition and Synthesis

  • Character Recognition

  • By-products like Word finders, lexicons of different kind, thesauri, Usage compilations etc.





Frequency analyzers for character, word, sentence.

  • Frequency analyzers for character, word, sentence.

  • KWIC and KWOC retrievers.

  • Tool for Automatic transliterations from Indian language scripts to Roman and vice versa: Kannada, Tamil, Telugu, Assamese, Bengali, Manipuri, Manipuri, Malayalam, Punjabi, Oriya, Gujarati.

  • Parallel corpora tools for text alignment, including sentence alignment tool and chunk alignment tool as well as an interface for aligning corpora.

  • Tools for

      • Morphological analysis
      • POS tagging
      • Semantic tagging
      • Syntactic tree bank


Task 1: Hierarchical POS Tag set

  • Task 1: Hierarchical POS Tag set

  • Task 2: Dictionary - (a) closed class words and (b) open class words

  • Task 3: Morphological analyzer and generator

  • Task 4: Manual POS annotation and development of an automatic tagger

  • Task 5: Semantic tagging

  • Task 6: Chunker

  • Task 7: Tree banking

  • Task 8: Shallow parser, which will eventually turn into a deep parser



Lexical studies

  • Lexical studies

  • Semantics

  • Pragmatics & Discourse analysis

  • Sociolinguistics

  • Dialectology & Variation studies

  • Stylistics

  • Language teaching

  • Historical linguistics

  • Psycholinguistics

  • Social psychology

  • Cultural studies



Develop tools that facilitate collection of high quality speech data

    • Develop tools that facilitate collection of high quality speech data
    • Collect data that can be used for building speech recognition. speech synthesis and provide speech-to-speech translation from one language to another language spoken in India (including Indian English).
    • Apart from these like applications in the area of text corpora, speech corpora also, main efforts are on the engineering side. So, efforts shall also be made to collect
    • Child language corpora
    • Pathological speech/language data and
    • Speech error Data


Speech Recognition and Speech Synthesis

  • Speech Recognition and Speech Synthesis

  • Speech to Speech translation for a pair of Indian languages

  • Command and control applications

  • Multimodal interfaces to the computer in Indian languages

  • E-mail readers over the telephone

  • Readers for the visually disadvantaged

  • Speech enabled Office Suite etc



Phonetically Balanced Vocabulary

  • Phonetically Balanced Vocabulary

  • Phonetically Balanced Sentences

  • Connected Text created using phonetically balanced vocabulary

  • Date Format

  • Command and Control Words

  • Proper Nouns 500 place and 500 person names

  • Most Frequent Words: 1000

  • Form and Function Words

  • News domain: news, editorial, essay - each text not less than 500 words



Data will be collected from minimum of 300 (150 Male and 150 Female) speakers of each language. In addition to this, natural discourse data from various domains too shall be collected for Indian languages for research into spoken language.

  • Data will be collected from minimum of 300 (150 Male and 150 Female) speakers of each language. In addition to this, natural discourse data from various domains too shall be collected for Indian languages for research into spoken language.

  • Data for speech synthesis shall be collected from limited number of speakers - 3 male and 3 female in the studio environment. They shall invariably have very good voice quality and are professional voice givers/media announcers.



Annotation of data:

  • Annotation of data:

  • Data to be used for speech recognition shall be annotated at phoneme, syllable, word and sentence levels

  • Data to be used for speech synthesis shall be annotated at phone, phoneme, syllable, word, and phrase level.

  • Annotation tools:

  • Tools will be developed for semiautomatic annotation of speech data. These tools will also be useful for annotating speech synthesis databases.





Northern India : Delhi 1st year

  • Northern India : Delhi 1st year

  • Southern India: Mysore 2nd year

  • North-eastern India: Shillong 3rd year

  • Western India: Lchalkaranji 4th year

  • Eastern Indian: Kolkata 5th year



Development of standards, tools and linguistic resources (datasets) for the fields of Online HWR, Offline HWR and OCR.

  • Development of standards, tools and linguistic resources (datasets) for the fields of Online HWR, Offline HWR and OCR.

  • Promotion of development of these technologies.

  • Promotion of development of important and challenging applications of these technologies in the context of Indic languages and scripts.



Creation of frequency dictionaries - five per year

  • Creation of frequency dictionaries - five per year

      • First year: Bengali, Hindi, Kannada, Manipuri, Urdu.
      • Second year: Bodo, Dogri, Maithili, Nepali, Konkani.
      • Third year: Assamese, Gujarati, Oriya, Punjabi, Tamil,
      • Fourth year: Kashmiri, Malayalam, Marathi, Sanskrit, Santali.
      • Fifth year : other languages
  • Multilingual multi directional dictionary - an ongoing process

  • Aiding wordnet creation and collaborating with others for the same - an ongoing process





The data that the LDCIL creates and obtains has to be evaluated. For each kind of data, tool etc., matrices have to be evolved. Bench marking, good standards etc., have to be developed. In one year time frame, the same shall be accomplished for first set of tools. In the next year/s the same for other data and tools shall be developed

  • The data that the LDCIL creates and obtains has to be evaluated. For each kind of data, tool etc., matrices have to be evolved. Bench marking, good standards etc., have to be developed. In one year time frame, the same shall be accomplished for first set of tools. In the next year/s the same for other data and tools shall be developed



Above all and in addition to what LDCIL has projected in the roadmap the LDC-IL will positively respond to the specific language data needs of the individuals, institutions and industry by taking up their requests on priority basis for licensing purposes. In the beginning the derivatives of the databases shall be licensed and after all the licensing issues are resolved the databases shall also be licensed.

  • Above all and in addition to what LDCIL has projected in the roadmap the LDC-IL will positively respond to the specific language data needs of the individuals, institutions and industry by taking up their requests on priority basis for licensing purposes. In the beginning the derivatives of the databases shall be licensed and after all the licensing issues are resolved the databases shall also be licensed.

















          • Interns LDC-





Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2016
rəhbərliyinə müraciət

    Ana səhifə