To overpass the disparity between theory and applications in language-related technology in the text as well as speech and several other areas, a well-designed and well-developed corpus is essential. Several problems and issues encountered while developing a corpus, especially for low resource languages. The Malayalam Speech Corpus (MSC) is one of the first open speech corpora for Automatic Speech Recognition (ASR) research to the best of our knowledge. It consists of 250 hours of Agricultural speech data. We are providing a transcription file, lexicon, and annotated speech along with the audio segment. It is available in future for public use upon request at "www.iiitmk.ac.in/vrclc/utilities/ml speechcorpus". This paper details the development and collection process in the domain of agricultural speech corpora in the Malayalam Language.
Figures - uploaded by Lekshmi K.R
Author content
All figure content in this area was uploaded by Lekshmi K.R
Content may be subject to copyright.
Discover the world's research
- 20+ million members
- 135+ million publications
- 700k+ research projects
Join for free
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation, pages 25–28
Language Resources and Evaluation Conference (LREC 2020), Marseille, 11–16 May 2020
c
European Language Resources Association (ELRA), licensed under CC-BY-NC
25
Malayalam Speech Corpus: Design and Development for Dravidian Language
Lekshmi.K.R, Jithesh V S, Elizabeth Sherly
Research Scholar, Senior Linguist, Senior Professor
Bharathiar University, IIITM-K, IIITM-K
lekshmi.kr@iiitmk.ac.in, jithesh.vs@iiitmk.ac.in, sherly@iiitmk.ac.in
Abstract
To overpass the disparity between theory and applications in language-related technology in the text as well as speech and several other
areas, a well-designed and well-developed corpus is essential. Several problems and issues encountered while developing a corpus,
especially for low resource languages. The Malayalam Speech Corpus (MSC) is one of the first open speech corpora for Automatic
Speech Recognition (ASR) research to the best of our knowledge. It consists of 250 hours of Agricultural speech data. We are providing
a transcription file, lexicon and annotated speech along with the audio segment. It is available in future for public use upon request at
"www.iiitmk.ac.in/vrclc/utilities/ml speechcorpus". This paper details the development and collection process in the domain of agricul-
tural speech corpora in the Malayalam Language.
Keywords: Malayalam, ASR, Agricultural Speech corpus, Narrational and Interview Speech Corpora
1. Introduction
Malayalam is the official language of Kerala, Lakshad-
weep, and Mahe. From 1330 million people in India,
37 million people speak Malayalam ie; 2.88% of Indians.
(Wikipedia contributors, 2020). Malayalam is the youngest
of all languages in the Dravidian family. Four or five
decades were taken for Malayalam to emerge from Tamil.
The development of Malayalam is greatly influenced by
Sanskrit also.
In the Automatic Speech Recognition (ASR) area many
works are progressing in highly and low-resourced lan-
guages. The present speech recognition system has
achieved a 'Natural' degree of accuracy mainly in Standard
American English (Xiong et al., 2016). The accurate recog-
nition of speech exists only for highly resourced languages.
But it is still lagging for "non-native" speakers. To increase
the accuracy of such an ASR system the speech data for
low- resource language like Malayalam is to be increased.
To encourage the research on speech technology and its
related applications in Malayalam, a collection of speech
corpus is commissioned and named as Malayalam Speech
Corpus (MSC). The corpus consists of the following parts.
•200 hours of Narrational Speech named NS and
•50 hours of Interview Speech named IS
The raw speech data is collected from "Kissan Kr-
ishideepam" an agriculture-based program in Malayalam
by the Department of Agriculture, Government of Kerala.
The NS is created by making a script during the post-
production stage and dubbed with the help of people in dif-
ferent age groups and gender but they are amateur dubbing
artists. The speech data is thoughtfully designed - for var-
ious applications like code mixed language analysis, Auto-
matic Speech Recognition (ASR) related research, speaker
recognition – by considering sociolinguistic variables.
This paper represents the development of Narrational and
Interview Speech corpora (NS and IS) collected from na-
tive Malayalam speakers. The literature survey of different
speech corpora creation is detailed in section 2. Section 3
describes the design and demographics of speech data. The
section 4 continues with transcription and section 5 deals
with lexicon of the speech data and paper concludes with
section 6.
2. Literature Survey
Many languages have developed speech corpus and they
are open source too. The English read speech corpus is
freely available to download for research purposes (Koh et
al., 2019) (Panayotov et al., 2015). Similarly, a database
is made available with the collection of TED talks in the
English language (Hernandez et al., 2018). Databases are
available for Indian languages on free download and a pay-
ment basis also. For the Malayalam language-based emo-
tion recognition, a database is available (Rajan et al., 2019).
The corpus collection of low resourced languages is a good
initiative in the area of ASR. One of such work is done on
Latvian language (Pinnis et al., 2014). They created 100
hours of orthographically transcribed audio data and anno-
tated corpus also. In addition to that a four hours of phonet-
ically transcribed audio data is also available. The authors
presented the statistics of speech corpus along with criteria
for design of speech corpus.
South Africa has eleven official languages. An attempt is
made for the creation of speech corpora on these under re-
sourced languages (Barnard et al., 2014). A collection of
more than 50 hours of speech in each language is made
available. They validated the corpora by building acoustic
and language model using KALDI.
Similarly speech corpora for North-East Indian low-
resourced languages is also created (Hernandez et al.,
2018). The authors collected speech and text corpora on
Assamese, Bengali and Nepali. They conducted a statisti-
26
cal study of the corpora also.
3. The Speech Corpora
A recording studio is setup at our visual media lab with
a quiet and sound proof room. A standing microphone
is used for recording NS corpora. IS corpora is collected
directly from the farmers using recording portable Mic at
their place. Hundred speakers are involved in the recording
of NS and IS corpora.
3.1. Narrational and Interview Speech Corpora
The written agricultural script, which is phonetically bal-
anced and phonetically rich (up to triphone model), was
given to the speakers to record the Narrational Speech.
Scripts were different in content. An example script is pro-
vided in Fig:1. They were given enough time to record the
data. If any recording issues happened, after rectification
by the recording assistant it was rerecorded.
Figure 1: Example of script file for dubbing
The Narrational Speech is less expensive than Interview
Speech because it is difficult to get data for the ASR system.
The IS data is collected in a face-to-face interview style.
The literacy and the way to communicate information flu-
ently have given less focus. The interviewee with enough
experience in his field of cultivation is asked to speak about
his cultivation and its features. The interviewer should be
preferably a subject expert in the area of cultivation. Both
of them are given separate microphones for this purpose.
Few challenges were faced during the recording of the
speech corpus. There were lot of background noise like
sounds of vehicles, animals, birds, irrigation motor and
wind. Another main issue that happened during post pro-
duction is the difference in pronunciation styles in the In-
terview Speech corpora collection. This caused difficulty
during validation of the corpus. The recording used to ex-
tend up to 5-6 hours depending on speakers. The recorded
data is then given for post-production to clean unwanted
information from that.
3.2. Speaker Criteria
We have set a few criteria for recording the Narrational
Speech data.
•The speakers are at minimum age of 18
•They are citizens of India
•Speakers are residents of Kerala
•The mother tongue of the speaker should be Malay-
alam without any specific accents
3.3. Recording Specifications
Speech data is collected with two different microphones for
NS and IS. For Narrational Speech, Shure SM58-LC car-
dioid vocal microphone without cable is used. For IS, we
utilized Sennheiser XSW 1-ME2-wireless presentation mi-
crophone of range 548-572 MHz Steinberg Nuendo and Pro
Tools are used for the audio post-production process.
The audio is recorded in 48 kHz sampling frequency and
16 bit sampling rate for broadcasting and the same is down
sampled to 16 kHz sampling frequency and 16 bit sampling
rate for speech-related research purposes. The recordings
of speech corpora are saved in WAV files.
3.4. Demographics
MSC aims to present a good quality audio recording for
speech related research. The NS and IS corpus have both
male and female speakers. In NS, the male and female
speakers are made up with 75% and 25% respectively. IS
have more male speakers than females with 82% and 18%
of total speakers.The other demographics available from
the collected data are Community, Place of Cultivation and
Type of Cultivation.
Category NS (%) IS (%)
Hindu 85 51
Christian 10 35
Muslim 05 14
Total 100 100
Table 1: Demographic details of speakers by community
Table 2 and 3 contains the details of the place of cultivation
and the type of cultivation in Kerala.
27
Place of Cultivation (District wise) IS(%)
Thiruvananthapuram 26
Kollam 21
Pathanamthitta 02
Ernakulam 07
Alappuzha 08
Kottayam 08
Idukki 09
Thrissur 12
Wayanad 03
Kozhikode 02
Kannur 02
Total 100
Table 2: Demographic details of speakers by place of culti-
vation
Type of Cultivation IS (%)
Animal Husbandry 10
Apiculture 11
Diary 16
Fish and crab farming 05
Floriculture 07
Fruits and vegetables 22
Horticulture 04
Mixed farming 07
Organic farming 08
Poultry 07
Terrace farming 03
Total 100
Table 3: Demographic details of speakers by type of culti-
vation
4. Transcription
The NS and IS corpora are transcribed orthographically into
Malayalam text. The transcribers are provided with the au-
dio segments that the speaker read. Their task is to tran-
scribe the content of the audio into Malayalam and into
phonetic text. A sample of three transcribed data with de-
mographic details is shown below and the annotated speech
of first two sentences is depicted in Fig 2.
Figure 2: An example of Annotated Speech Corpora
Sample 1:Record Entry No : 180220 01 01
In the first sample a Narrational Speech is detailed. The
narrator is about 45 years old and he is describing the de-
tails about Palakkad a district in Kerala and a mango estate
there. Few sentences are displayed below.
Sentence 1:
<Without saying we can understand that it is Palakkad>
Sentence 2:
<Kerala's Castle door>
Sentence 3:
<Selam dharmapuri and krishnagiri are the birthplaces of
Malgova>
Sample 2: Record Entry 2: 180220 02 01
The sample shown below is an Interview Speech. The in-
terviewer is an agriculture officer of age 50 and interviewee
is the owner of farm about 55 years old.
Sentence 1:
<Do you think you could fulfill what you have wished
or envisioned from the desert, here in your homeland,
Kerala?>
Sentence 2:
28
<Definitely we could. What we have planted here by our-
selves blossomed, bore fruit, relished it and shared it with
our dear ones>
5. Lexicon
The pronunciation dictionary, called Lexicon contains a
collection of unique 4925 words. The audio collection pro-
cess is still going on which will increase the lexicon size.
The lexicon consists of word and its corresponding phone-
mic and syllabic representation as in the example shown in
Fig 3.
Figure 3: Example of the lexicon
6. Conclusion
Speech is the primary and natural mode of communication
than writing. It is possible to extract more linguistic in-
formation from speech than text like emotions and accent.
Speech related applications are more useful for illiterate
and old people. The articulatory and acoustic information
can be obtained from a good audio recording environment.
One of the important features of speech data is that, there
is less interference from a second party compared to textual
data.
To encourage the academic research in speech related appli-
cations, a good number of multilingual and multipurpose
speech corpora for Indian languages is required. The re-
sponsibility to develop such corpora still lies on the shoul-
der of the concerned researcher. Also the role of language
corpora is very significant to preserve and maintain the lin-
guistic heritage of our country.
The release of MSC will be one of the first speech cor-
pora of Malayalam, contributing 200 hours of Narrational
Speech and 50 hours of Interview Speech data for public
use. The lexicon and annotated speech is also made avail-
able with the data. Future work includes creation of corpora
related to tourism and entertainment domains and enhance-
ment of quality of speech by building an ASR using KALDI
toolkit. The updates on corpus will be accessible through
"www.iiitmk.ac.in/vrclc/utilities/ml speechcorpus".
Acknowledgements
This research is supported by the Kerala State Council for
Science, Technology and Environment (KSCSTE). I thank
KSCSTE for funding the project under the Back-to-lab
scheme. I also thank Agri team, Indian Institute of Informa-
tion Technology and Management-Kerala, Kissan Project
for collecting the audio data.
Bibliographical References
Barnard, E., Davel, M. H., van Heerden, C., De Wet, F., and
Badenhorst, J. (2014). The nchlt speech corpus of the
south african languages. In Workshop Spoken Language
Technologies for Under-resourced Languages (SLTU).
Hernandez, F., Nguyen, V., Ghannay, S., Tomashenko, N.,
and Est`
eve, Y. (2018). Ted-lium 3: twice as much data
and corpus repartition for experiments on speaker adap-
tation. In International Conference on Speech and Com-
puter, pages 198–208. Springer.
Koh, J. X., Mislan, A., Khoo, K., Ang, B., Ang, W., Ng, C.,
and Tan, Y.-Y. (2019). Building the singapore english
national speech corpus. Malay, 20(25.0):19–3.
Panayotov, V., Chen, G., Povey, D., and Khudanpur, S.
(2015). Librispeech: an asr corpus based on public do-
main audio books. In 2015 IEEE International Con-
ference on Acoustics, Speech and Signal Processing
(ICASSP), pages 5206–5210. IEEE.
Pinnis, M., Auzina, I., and Goba, K. (2014). Designing
the latvian speech recognition corpus. In LREC, pages
1547–1553.
Rajan, R., Haritha, U., Sujitha, A., and Rejisha, T. (2019).
Design and development of a multi-lingual speech cor-
pora (tamar-emodb) for emotion analysis. Proc. Inter-
speech 2019, pages 3267–3271.
Wikipedia contributors. (2020). Malayalam — Wikipedia,
the free encyclopedia. [Online; accessed 21-February-
2020].
Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M.,
Stolcke, A., Yu, D., and Zweig, G. (2016). Achiev-
ing human parity in conversational speech recognition.
arXiv preprint arXiv:1610.05256.
ResearchGate has not been able to resolve any citations for this publication.
The NCHLT speech corpus contains wide-band speech from approximately 200 speakers per language, in each of the eleven official languages of South Africa. We describe the design and development processes that were undertaken in order to develop the corpus, and report on associated materials such as orthographic transcriptions and pronunciation dictionaries that were released as part of the corpus. In order to benchmark speechrecognition performance on the corpus, we have also developed both phone-recognition and word-recognition systems for all eleven languages; we find that high accuracies can be achieved for these speaker-independent but vocabulary-dependent recognition tasks in all languages.
In this paper the authors present the first Latvian speech corpus designed specifically for speech recognition purposes. The paper outlines the decisions made in the corpus designing process through analysis of related work on speech corpora creation for different languages. The authors provide also guidelines that were used for the creation of the Latvian speech recognition corpus. The corpus creation guidelines are fairly general for them to be re-used by other researchers when working on different language speech recognition corpora. The corpus consists of two parts – an orthographically annotated corpus containing 100 hours of orthographically transcribed audio data and a phonetically annotated corpus containing 4 hours of phonetically transcribed audio data. Metadata files in XML format provide additional details about the speakers, noise levels, speech styles, etc. The speech recognition corpus is phonetically balanced and phonetically rich and the paper describes also the methodology how the phonetical balancedness has been assessed.
Ted-lium 3: twice as much data and corpus repartition for experiments on speaker adaptation
- F Hernandez
- V Nguyen
- S Ghannay
- N Tomashenko
- Y Estève
Hernandez, F., Nguyen, V., Ghannay, S., Tomashenko, N., and Estève, Y. (2018). Ted-lium 3: twice as much data and corpus repartition for experiments on speaker adaptation. In International Conference on Speech and Computer, pages 198-208. Springer.
- W Xiong
- J Droppo
- X Huang
- F Seide
- M Seltzer
- A Stolcke
- D Yu
- G Zweig
Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D., and Zweig, G. (2016). Achieving human parity in conversational speech recognition. arXiv preprint arXiv:1610.05256.
Source: https://www.researchgate.net/publication/342260329_Malayalam_Speech_Corpus_Design_and_Development_for_Dravidian_Language
Posted by: paulpaulelberter0253153.blogspot.com