Investigating data sharing in speech recognition for an underresourced language: the case of algerian dialect - Department of Natural Language Processing & Knowledge Discovery Accéder directement au contenu
Communication Dans Un Congrès Année : 2021

Investigating data sharing in speech recognition for an underresourced language: the case of algerian dialect

Résumé

The Arabic language has many varieties, including its standard form, Modern Standard Arabic (MSA), and its spoken forms, namely the dialects. Those dialects are representative examples of under-resourced languages for which automatic speech recognition is considered as an unresolved issue. To address this issue, we recorded several hours of spoken Algerian dialect and used them to train a baseline model. This model was boosted afterwards by taking advantage of other languages that impact this dialect by integrating their data in one large corpus and by investigating three approaches: multilingual training, multitask learning and transfer learning. The best performance was achieved using a limited and balanced amount of acoustic data from each additional language, as compared to the data size of the studied dialect. This approach led to an improvement of 3.8% in terms of word error rate in comparison to the baseline system trained only on the dialect data.
Fichier principal
Vignette du fichier
ArticleMENACER_SMAILI.pdf (639.07 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03137048 , version 1 (10-02-2021)

Identifiants

  • HAL Id : hal-03137048 , version 1

Citer

Mohamed Amine Menacer, Kamel Smaïli. Investigating data sharing in speech recognition for an underresourced language: the case of algerian dialect. 7th International Conference on Natural Language Processing - NATP 2021, Mar 2021, Vienna, Austria. ⟨hal-03137048⟩
149 Consultations
141 Téléchargements

Partager

Gmail Facebook X LinkedIn More