Data sources

data
corpora
r
api
download
repositories
A guide to popular data sources for text analysis.
Published

Spring 2024

In this guide, I list some data sources for text analysis. The list is short and clearly incomplete, but it provides a starting point for researchers who are looking for text data to analyze. The list includes general corpora, language-specific corpora, domain-specific corpora, and R APIs for text analysis.

Warning

Web resources are always changing. Some of the corpora listed below may no longer be available. I will do my best to keep this list up to date, but I cannot guarantee that all of the links will work.

Some of the corpora listed below are open access, while others are restricted access. Open access corpora are freely available to the public, while restricted access corpora require a subscription or a license to access.

Downloads

English

  • ANC (American National Corpus)
  • ENNTT (Europarl Corpus of Native, Non-native, and Translated Texts)
  • SBCSAE (Santa Barbara Corpus of Spoken American English)
  • SDAC (Switchboard Dialog Act Corpus, LDC open access)
  • ICE (International Corpus of English, restricted access)
  • BNC (British National Corpus, restricted access)
  • COCA (Corpus of Contemporary American English, restricted access)
  • … (suggest more!)

Other languages

  • LCMC (Lancaster Corpus of Mandarin Chinese)
  • SCC (Sheffield Corpus of Chinese)
  • ACTIV-ES (Film/TV dialog corpus for Argentine, Mexican, and Peninsular Spanish)

Domain specific

L2 learner corpora

  • Langsnap (Spanish, French learner corpora)
  • CEDEL2 (Corpus Escrito del Español como L2)

Translation corpora

  • OPUS (Open Parallel Corpus)

R APIs

  • tuber (YouTube)
  • rtweet (Twitter)
  • rtoot (Mastodon)
  • gutenbergr (Project Gutenberg)
  • TBDBr (TalkBank Database)
  • jstor (JSTOR Data for Research)
  • lingtypology (Linguistic typology data)

R packages

  • textdata (Text data for text analysis)
  • fivethirtyeight (Data from the FiveThirtyEight website)
  • quanteda.corpora (Corpora for the quanteda package)
  • corpora (Corpora for the corpora package)

Other repositories