Investigating data sharing in speech recognition for an underresourced language: the case of algerian dialect

Mohamed Amine Menacer; Kamel Smaïli

Communication Dans Un Congrès Année : 2021

Investigating data sharing in speech recognition for an underresourced language: the case of algerian dialect

(1) , (1)

Mohamed Amine Menacer

Fonction : Auteur
PersonId : 14275
IdHAL : mohamed-amine-menacer
IdRef : 25240937X

Statistical Machine Translation and Speech Modelization and Text

Kamel Smaïli

Fonction : Auteur
PersonId : 2521
IdHAL : kamel-smaili
IdRef : 034429700

Statistical Machine Translation and Speech Modelization and Text

Résumé

The Arabic language has many varieties, including its standard form, Modern Standard Arabic (MSA), and its spoken forms, namely the dialects. Those dialects are representative examples of under-resourced languages for which automatic speech recognition is considered as an unresolved issue. To address this issue, we recorded several hours of spoken Algerian dialect and used them to train a baseline model. This model was boosted afterwards by taking advantage of other languages that impact this dialect by integrating their data in one large corpus and by investigating three approaches: multilingual training, multitask learning and transfer learning. The best performance was achieved using a limited and balanced amount of acoustic data from each additional language, as compared to the data size of the studied dialect. This approach led to an improvement of 3.8% in terms of word error rate in comparison to the baseline system trained only on the dialect data.

Mots clés

Automatic speech recognition Algerian dialect MSA Multilingual training Multitask learning Transfer learning

Domaines

Informatique et langage [cs.CL]

Fichier principal

ArticleMENACER_SMAILI.pdf (639.07 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Kamel Smaïli : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03137048

Soumis le : mercredi 10 février 2021-10:16:41

Dernière modification le : lundi 11 septembre 2023-17:41:19

Archivage à long terme le : mardi 11 mai 2021-18:22:02

Dates et versions

hal-03137048 , version 1 (10-02-2021)

Identifiants

HAL Id : hal-03137048 , version 1

Citer

Mohamed Amine Menacer, Kamel Smaïli. Investigating data sharing in speech recognition for an underresourced language: the case of algerian dialect. 7th International Conference on Natural Language Processing - NATP 2021, Mar 2021, Vienna, Austria. ⟨hal-03137048⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA GRID5000 UNIV-LORRAINE LORIA LORIA-NLPKD SILECS

149 Consultations

141 Téléchargements

Investigating data sharing in speech recognition for an underresourced language: the case of algerian dialect

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager