Application Programming Interfaces (APIs)
Learn how API-based collection differs from scraping, what it's well suited for, and why API access is one of the least stable foundations a data collection plan can be built on.
How this differs from scraping
An API (Application Programming Interface) is a structured, sanctioned channel a platform provides for accessing its data — as opposed to extracting it from rendered web pages. Where scraping takes what's publicly visible and infers structure from a page's HTML, an API hands you already-structured data (a tweet object, a Wikipedia article's wikitext, a government dataset's rows) directly, usually with documentation, authentication, and explicit rate limits.
This makes APIs more reliable to build a pipeline around in the short term: the data is cleaner, the schema is documented, and you're not fighting page-layout changes. It does not make them more reliable in the long term, for reasons worth taking seriously before you commit a project's collection plan to one.
A cautionary, very real example
AfriSenti, the sentiment dataset behind SemEval-2023's first Afrocentric shared task, was built on more than 110,000 tweets across 14 African languages, gathered through Twitter's then-free API access for academic research (Muhammad et al., 2023). That access model no longer exists. In 2023, Twitter (now X) ended free academic API access and introduced paid tiers, with enterprise-level access priced at tens of thousands of dollars per month — far beyond what most academic projects can pay (Brown et al., 2024). A project planned today that assumed it could reproduce AfriSenti's collection process, on the same terms, simply could not.
This isn't a story about one platform behaving badly. It's the generic risk of any API: it's a privilege the platform grants, not a right you hold, and it can be repriced, restricted, or revoked with little notice, for reasons that have nothing to do with you or your research. Plan accordingly.