Data sources
data
corpora
r
api
download
repositories
A guide to popular data sources for text analysis.
In this guide, I list some data sources for text analysis. The list is short and clearly incomplete, but it provides a starting point for researchers who are looking for text data to analyze. The list includes general corpora, language-specific corpora, domain-specific corpora, and R APIs for text analysis.
Some of the corpora listed below are open access, while others are restricted access. Open access corpora are freely available to the public, while restricted access corpora require a subscription or a license to access.
Downloads
English
- ANC (American National Corpus)
- ENNTT (Europarl Corpus of Native, Non-native, and Translated Texts)
- SBCSAE (Santa Barbara Corpus of Spoken American English)
- SDAC (Switchboard Dialog Act Corpus, LDC open access)
- ICE (International Corpus of English, restricted access)
- BNC (British National Corpus, restricted access)
- COCA (Corpus of Contemporary American English, restricted access)
- … (suggest more!)
Other languages
Domain specific
L2 learner corpora
- Langsnap (Spanish, French learner corpora)
- CEDEL2 (Corpus Escrito del Español como L2)
- …
Translation corpora
- OPUS (Open Parallel Corpus)
- …
R APIs
tuber
(YouTube)rtweet
(Twitter)rtoot
(Mastodon)gutenbergr
(Project Gutenberg)TBDBr
(TalkBank Database)jstor
(JSTOR Data for Research)- JSTOR Data for Research (DfR) provides access to the data behind the research on the JSTOR digital library.
lingtypology
(Linguistic typology data)- …
R packages
textdata
(Text data for text analysis)fivethirtyeight
(Data from the FiveThirtyEight website)quanteda.corpora
(Corpora for thequanteda
package)corpora
(Corpora for thecorpora
package)- …