Automatic diacritization of Arabic text using recurrent neural networks

This paper presents a sequence transcription approach for the automatic diacritization of Arabic text. A recurrent neural network is trained to transcribe undiacritized Arabic text with fully diacritized sentences. We use a deep bidirectional long short-term memory network that builds high-level linguistic abstractions of text and exploits long-range context in both input directions. This approach differs from previous approaches in that no lexical, morphological, or syntactical analysis is performed on the data before being processed by the net. Nonetheless, when the network is post-processed with our error correction techniques, it achieves state-of-the-art performance, yielding an average diacritic and word error rates of 2.09 and 5.82 %, respectively, on samples from 11 books. For the LDC ATB3 benchmark, this approach reduces the diacritic error rate by 25 %, the word error rate by 20 %, and the last-letter diacritization error rate by 33 % over the best published results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic €32.70 /Month

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Rent this article via DeepDyve

Similar content being viewed by others

Multi-components System for Automatic Arabic Diacritization

Chapter © 2020

Neural Network for Arabic Text Diacritization on a New Dataset

Chapter © 2023

Predictive Natural Language Processing Analysis Applied to Arabic

Chapter © 2020

Explore related subjects

References

  1. Abandah, G., Khundakjie, F.: Issues concerning code system for Arabic letters. Dirasat Eng. Sci. J. 31(1), 165–177 (2004) Google Scholar
  2. Abandah, G.A., Jamour, F.T., Qaralleh, E.A.: Recognizing handwritten Arabic words using grapheme segmentation and recurrent neural networks. Int. J. Doc. Anal. Recognit. 17(3), 275–291 (2014) ArticleGoogle Scholar
  3. Al-Sughaiyer, I.A., Al-Kharashi, I.A.: Arabic morphological analysis techniques: a comprehensive survey. J. Am. Soc. Inf. Sci. Technol. 55(3), 189–213 (2004) ArticleGoogle Scholar
  4. Azim, A.S., Wang, X., Sim, K.C.: A weighted combination of speech with text-based models for Arabic diacritization. In: 13th Annual Conference of International Speech Communication Association, pp. 2334–2337 (2012)
  5. Azmi, A.M., Almajed, R.S.: A survey of automatic Arabic diacritization techniques. Nat. Lang. Eng. 1–19 (2013). doi:10.1017/S1351324913000284
  6. Bahanshal, A., Al-Khalifa, H.S.: A first approach to the evaluation of Arabic diacritization systems. In: International Conference on Digital Information Management, pp. 155–158 (2012)
  7. Beesley, K.R.: Arabic finite-state morphological analysis and generation. In: 16th Conference on Computational Linguistics, vol. 1, pp. 89–94 (1996)
  8. Buckwalter, T.: Buckwalter Arabic Morphological Analyzer, v2.0 edn. Linguistic Data Consortium, Philadelphia (2004) Google Scholar
  9. Dahl, G., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012) ArticleGoogle Scholar
  10. Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964) ArticleGoogle Scholar
  11. El-Sadany, T., Hashish, M.: Semi-automatic vowelization of Arabic verbs. In: 10th National Computer Conference, pp. 725–732 (1988)
  12. Gal, Y.: An HMM approach to vowel restoration in Arabic and Hebrew. In: ACL-02 Workshop on Computational Approaches to Semitic Languages, pp. 1–7 (2002)
  13. Gers, F., Schraudolph, N., Schmidhuber, J.: Learning precise timing with LSTM recurrent networks. J. Mach. Learn. Res. 3(1), 115–143 (2002) MathSciNetGoogle Scholar
  14. Graves, A.: Practical variational inference for neural networks. In: Advances in Neural Information Processing Systems, pp. 2348–2356. Curran Associates, Inc. (2011)
  15. Graves, A.: Offline Arabic handwriting recognition with multidimensional recurrent neural networks. In: Märgner, V., El Abed, H. (eds.) Guide to OCR for Arabic Scripts, pp. 297–313. Springer, London (2012) ChapterGoogle Scholar
  16. Graves, A.: Sequence transduction with recurrent neural networks. In: ICML Representation Learning Worksop (2012)
  17. Graves, A.: Supervised sequence labelling with recurrent neural networks. Springer, Berlin (2012) BookMATHGoogle Scholar
  18. Graves, A.: Generating sequences with recurrent neural networks. arXiv:1308.0850 (2013)
  19. Graves, A.: RNNLIB: a recurrent neural network library for sequence learning problems. http://sourceforge.net/projects/rnnl/ (2013)
  20. Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649 (2013)
  21. Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005) ArticleGoogle Scholar
  22. Habash, N., Rambow, O.: Arabic diacritization through full morphological tagging. In: Conference on North American Chapter of the Association for Computational Linguistics, pp. 53–56 (2007)
  23. Hifny, Y.: Smoothing techniques for Arabic diacritics restoration. In: 12th Conference on Language Engineering, pp. 6–12 (2012)
  24. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) ArticleGoogle Scholar
  25. Kirchhoff, K., Bilmes, J., Das, S., Duta, N., Egan, M., Ji, G., He, F., Henderson, J., Liu, D., Noamany, M., et al.: Novel approaches to Arabic speech recognition: report from the 2002 Johns-Hopkins summer workshop. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 344–347 (2003)
  26. Lewis, M.P. (ed.): Ethnologue: Languages of the World, 16th edn. SIL International, Dallas (2009) Google Scholar
  27. Maamouri, M., Bies, A., Buckwalter, T., Mekki, W.: The Penn Arabic treebank: building a large-scale annotated Arabic corpus. In: NEMLAR Conference on Arabic Language Resources and Tools, pp. 102–109 (2004)
  28. Märgner, V., El Abed, H.: ICDAR 2009: Arabic handwriting recognition competition. In: International Conference on Document Analysis and Recognition, pp. 1383–1387 (2009)
  29. Murray, A.F., Edwards, P.J.: Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training. IEEE Trans. Neural Netw. 5(5), 792–802 (1994) ArticleGoogle Scholar
  30. Nelken, R., Shieber, S.M.: Arabic diacritization using weighted finite-state transducers. In: ACL Workshop on Computational Approaches to Semitic Languages, pp. 79–86 (2005)
  31. Rashwan, M., Al-Badrashiny, M., Attia, M., Abdou, S., Rafea, A.: A stochastic Arabic diacritizer based on a hybrid of factorized and unfactorized textual features. IEEE Trans. Audio Speech Lang. Process. 19(1), 166–175 (2011) ArticleGoogle Scholar
  32. Ryding, K.C.: A Reference Grammar of Modern Standard Arabic. Cambridge University Press, Cambridge (2005) BookGoogle Scholar
  33. Said, A., El-Sharqwi, M., Chalabi, A., Kamal, E.: A hybrid approach for Arabic diacritization. In: Mtais, E., Meziane, F., Saraee, M., Sugumaran, V., Vadera, S. (eds.) Natural Language Processing and Information Systems. Lecture Notes in Computer Science, vol. 7934, pp. 53–64. Springer, Berlin (2013) ChapterGoogle Scholar
  34. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997) ArticleGoogle Scholar
  35. Vergyri, D., Kirchhoff, K.: Automatic diacritization of Arabic for acoustic modeling in speech recognition. In: Workshop on Computational Approaches to Arabic Script-based Languages, pp. 66–73 (2004)
  36. Zarrabi-Zadeh, H.: Tanzil: Quran Navigator. http://tanzil.net/download. Accessed 27 Nov 2014
  37. Zerrouki, T.: Arabic corpora resources, Tashkila collection from the Arabic Al-Shamela library. http://aracorpus.e3rab.com. Accessed 27 Nov 2014
  38. Zitouni, I., Sorensen, J.S., Sarikaya, R.: Maximum entropy based restoration of Arabic diacritics. In: 21st International Conference on Computational Linguistics, pp. 577–584 (2006)

Author information

Authors and Affiliations

  1. Computer Engineering Department, University of Jordan, Amman, 11942, Jordan Gheith A. Abandah, Balkees Al-Shagoor, Alaa Arabiyat & Majid Al-Taee
  2. Google DeepMind, London, UK Alex Graves
  3. King Abdullah University of Science and Technology, Thuwal, Saudi Arabia Fuad Jamour
  1. Gheith A. Abandah