A community handbook for building African-language datasets.
A practical, open guide to collecting, annotating, and releasing high-quality data for African languages — across text, speech, and vision — written and maintained by the community that speaks them.
The handbook walks the whole pipeline — start with the foundations, follow the guide for your modality, get the quality and governance right, then document and release. Jump in wherever you are.
Why African-language data has to be built with its speakers, the core principles that run through every chapter, and how to read (and contribute to) the handbook.
Sourcing, collecting, and annotating each modality — from scraping and APIs for text, to recording and transcribing speech, to image and video data.
Inter-annotator agreement, quality control, ethics and bias, consent, ownership, and licensing — so the data you build can be trusted and reused.
Data statements, datasheets, contributor agreements, storage, and a sustainability plan — everything needed to release a dataset responsibly.
The Waraka Playbook is free to read online and open to contributions. Fixing an error, translating a page, or sharing what worked on a real project all count.