Open-source datasets

Our contribution to AI research

Cogite deeply believes that African French-speaking languages and cultures deserve better representation in global AI models. That's why we periodically publish open-source datasets freely usable for academic research and open-source development.

Available datasets

cogite-fr-african-sentiments (upcoming)

Dataset of 20,000 African French sentences annotated for sentiment, covering Cameroonian, Ivorian, Senegalese and Congolese variants. Ideal for fine-tuning French-language sentiment analysis models.

License: CC-BY-SA 4.0 · Availability: Q3 2026

cogite-fr-mobile-money-ner (upcoming)

Named Entity Recognition dataset specialized in the mobile money domain in African French: operator entities, transaction types, currencies, locations. 8,000 annotated sentences.

License: CC-BY 4.0 · Availability: Q4 2026

cogite-bilingual-codeswitch (upcoming)

Dataset of sentences mixing French and English (code-switching), a widespread linguistic phenomenon in Anglophone Africa and bilingual Cameroon. 12,000 sentences with token-level language annotation.

License: CC-BY 4.0 · Availability: Q1 2027

For the research community

If you're a researcher or doctoral student and want early access to these datasets, or want to propose a research partnership, contact our team. We grant early access to research projects whose results are published in open access.

Why these datasets?

Current AI models are massively trained on English-speaking and Western data. The resulting biases — cultural, linguistic, economic — are documented but rarely corrected. By contributing to these African French-speaking datasets, Cogite participates in a collective effort to make AI more inclusive, more representative, and therefore more useful to the 280 million French speakers of Africa.