Community handbook

Waraka Playbook

A community handbook for building African-language datasets.

A practical, open guide to collecting, annotating, and releasing high-quality data for African languages — across text, speech, and vision — written and maintained by the community that speaks them.

Read online Download PDF

modalities — text, speech, vision

languages

Open

source & community-owned

What’s inside

From a blank page to a documented dataset

The handbook walks the whole pipeline — start with the foundations, follow the guide for your modality, get the quality and governance right, then document and release. Jump in wherever you are.

Foundations

Start with the principles

Why African-language data has to be built with its speakers, the core principles that run through every chapter, and how to read (and contribute to) the handbook.

Modality guides

Text, speech & vision

Sourcing, collecting, and annotating each modality — from scraping and APIs for text, to recording and transcribing speech, to image and video data.

Quality & governance

Get the data right

Inter-annotator agreement, quality control, ethics and bias, consent, ownership, and licensing — so the data you build can be trusted and reused.

Documentation

Document & release

Data statements, datasheets, contributor agreements, storage, and a sustainability plan — everything needed to release a dataset responsibly.

Open & community-owned

Read it, use it, help build it.

The Waraka Playbook is free to read online and open to contributions. Fixing an error, translating a page, or sharing what worked on a real project all count.

Read online Download PDF