“The data speaks for itself, but only if you are willing to listen.”
— Nate Silver”
Mar 27, 2024




Goals
When to use
How to use
Simply put, its counting tokens.
| Method | Description | |
|---|---|---|
| Raw frequency | Number of occurrences of a token within a corpus | |
| Dispersion | Distribution of a token across a corpus | |
| Relative frequency | Proportion of a token in relation to the total number of tokens in a corpus |
Identify patterns of association between tokens
| Method | Description | |
|---|---|---|
| n-grams | Sequence of n tokens | |
| Collocation | Tokens that frequently co-occur |
Patterns of association between tokens can signal:
Bottom-up approach to grouping similar data points
| Method | Description | |
|---|---|---|
| K-means | Partition data into k clusters | |
| Hierarchical clustering | Build a tree of clusters |

Operation to reduce the number of variables in a dataset, while preserving as much information as possible
| Method | Description | |
|---|---|---|
| PCA | Linear transformation to reduce dimensionality | |
| t-SNE | Non-linear transformation to visualize high-dimensional data |
Use of distributed representations of words in a continuous vector space where words with similar contextual distributions are closer together
| Method | Description | |
|---|---|---|
| Word2Vec | Popular word embedding model | |
| GloVe | Global vectors for word representation |
Note: word embedding models are highly contingent on the size of the corpus, the algorithm used, and parameters set.


Explore | Quantitative Text Analysis | Wake Forest University