Category: News

  • Lugha-Llama: Adapting Large Language Models for African Languages

    By Happy Buzaaba, Alexander Wettig, David Ifeoluwa Adelani, and Christiane Fellbaum

    Low-resource African languages remain underrepresented in the large training datasets of large language models (LLMs) and, as a result, LLMs struggle to understand these languages. We are releasing three African-centric Lugha-Llama models based on Llama-3.1-8B, which achieve the best performance among open-source models on IrokoBench, a challenging African languages benchmark and AfriQA, a cross-lingual open-retrieval question answering dataset for African languages. (Lugha is the Kiswahili word for “language.”)

    All Lugha-Llama models are available on huggingface hub.

    In this blog post, we describe the training setting and data for producing Lugha-Llama.

    What data is used to train Lugha-Llama?

    We rely on an openly available multilingual WURA corpus of African languages to train our models. The corpus comprises sixteen African languages and four high-resource languages commonly spoken on the African continent. It was collected by carefully inspecting and cleaning mC4 and crawling African websites. It also includes three languages with non-latin scripts: Amharic, Arabic and Tigirinya.

    The WURA corpus

    One of the challenges of multilingual model pre-training is data imbalance. Given an imbalanced multilingual corpus like WURA, it is important to carefully control how many times a dataset can be repeated during training to avoid overfitting and memorization.

    To address this, we sample from 19 languages in WURA corpus using UniMax sampling which attempts to sample as uniformly as possible across languages while controlling the extent of data repeats for any language. Rare languages were up-sampled by at most four epochs, which was found to incur no discernible degradation during model training [Muennighoff et al.,].

    The figure below shows tokens per language in our training corpus and the sampling proportions using unimax sampling.

    Adding English language data

    We continued training Llama-3.1-8B, which was predominantly trained on English data. To prevent catastrophic forgetting of the pre-trained capabilities of the model, we explore retaining English data in the continued pre-training data mix. We experiment with two different sources of English data:

    1. FineWeb-Edu which contains high-quality and knowledge-rich educational documents.
    2. OpenWebMath a curated dataset of mathematical documents. In these experiments, we combine 40% of English data with 60% of the WURA data. This kind of “replay” has been shown to be effective in preventing catastrophic forgetting in continued pre-training [Ibrahim et al.,].

    Lugha-Llama models

    Unlike prior African language models, we continue pre-training a much larger model (8 billion parameters) on substantially more text (10 billion tokens). We open-source three models:

    1. Lugha-Llama-8B-wura: Trained exclusively on the WURA corpus and therefore with the most amount of African language data.
    2. Lugha-Llama-8B-wura_edu: Trained on a mix of African language data and educational English documents from FineWeb-Edu.
    3. Lugha-Llama-8B-wura_math: Trained on a mix of African language data and English mathematical data from OpenWebMath.

    All models are trained on batch sizes of 4 million tokens with a maximum sequence length of 8,192 tokens for 2,400 steps.

    Evaluation

    We make use of the LM Evaluation Harness to evaluate on AfriQA and on three tasks in Irokobench: knowledge-based question answering (AfriMMLU), mathematical reasoning (AfriMGSM), and natural language inference (AfriXNLI).

    • All three Lugha-Llama models consistently achieve the best performance across AfriMMLU, AfriMGSM, AfriXNLI and AfriQA amongst similarly sized baselines.
    • Including data from FineWeb-Edu (_edu) consistently boosts the performance across languages in AfriMMLU and including English data from OpenWebMath (_math) improves performance in AfriMGSM, suggesting some cross-lingual transfer of skills and knowledge.

    The table below shows a detailed comparison of Lugha-Llama models to the baselines per language for all three tasks in IrokoBench. The languages in italic are not present in the continual pre-training data. † indicates average values without English (eng) and French (fra).

    • Adapting the pre-trained model to African languages increases AfriMMLU scores by up to 8 percentage points compared to Llama-3.1-8B, with the largest improvements in Igbo (ibo) language.

    Conclusion and future directions

    We introduce three groundbreaking Lugha-Llama models, leveraging continued pre-training to advance research in low-resource African languages. Our models achieve state-of-the-art results across challenging IrokoBench tasks and the cross-lingual AfriQA dataset. Our findings reveal that integrating African language pre-training data with carefully curated, high-quality English documents from FineWeb-Edu and OpenWeb-Math substantially improves downstream task performance. This work represents a significant contribution in realizing the vision of greater representation for African languages in NLP, LLMs and broader AI research.

    Looking ahead, promising future directions include a deeper investigation into linguistic transfer mechanisms and assessing the impact of integrating high-quality non-English data from FineWeb2 for achieving similar improvements.

    Citation

    If you are using the Lugha Llama models, please cite our blog post!

    @article{buzaaba2025lugha,
      title={Lugha-Llama: Adapting Large Language Models for African Languages},
      author={Buzaaba, Happy and Wettig, Alexander and Adelani, David Ifeoluwa and Fellbaum, Christiane},
      journal={arXiv preprint arXiv:2504.06536},
      year={2025}
    }