“The data speaks for itself, but only if you are willing to listen.”
— Nate Silver”
Mar 27, 2024
Goals
When to use
How to use
Simply put, its counting tokens.
Method | Description | |
---|---|---|
Raw frequency | Number of occurrences of a token within a corpus | |
Dispersion | Distribution of a token across a corpus | |
Relative frequency | Proportion of a token in relation to the total number of tokens in a corpus |
Identify patterns of association between tokens
Method | Description | |
---|---|---|
n-grams | Sequence of n tokens | |
Collocation | Tokens that frequently co-occur |
Patterns of association between tokens can signal:
Bottom-up approach to grouping similar data points
Method | Description | |
---|---|---|
K-means | Partition data into k clusters | |
Hierarchical clustering | Build a tree of clusters |
Operation to reduce the number of variables in a dataset, while preserving as much information as possible
Method | Description | |
---|---|---|
PCA | Linear transformation to reduce dimensionality | |
t-SNE | Non-linear transformation to visualize high-dimensional data |
Use of distributed representations of words in a continuous vector space where words with similar contextual distributions are closer together
Method | Description | |
---|---|---|
Word2Vec | Popular word embedding model | |
GloVe | Global vectors for word representation |
Note: word embedding models are highly contingent on the size of the corpus, the algorithm used, and parameters set.
Explore | Quantitative Text Analysis | Wake Forest University