Learning Word and Sub-word Vectors for Amharic (Less Resourced Language)
Keywords:
Amharic, word vectors, fasttext, word2vecAbstract
The availability of pre-trained word embedding models (also known as word vectors) empowered many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these distributed word representations is the existence of large curated corpora to train them and use the pre-trained models in downstream tasks. In this paper, we describe how we trained such quality word representations for one of less-resourced Ethiopian languages, Amharic. We used several offline and online data sources and created 100, 200, and 300-dimension word2vec and FastText word vectors. We also introduced new word analogy dataset to evaluate word vectors for Amharic language. In addition, we created Amharic sentence piece model, which can be used to decode and encode words for subsequent NLP tasks. Using this SentencePiece model, we created Amharic sub-word word2vec embedding with 25, 50, 100, 200, and 300 dimensions trained over our large curated dataset. Finally, we evaluate our pre-trained word vectors on both intrinsic word analogy and extrinsic downstream natural language processing task. The result shows promising performance for both intrinsic and extrinsic evaluations as compared to previously released model.