Explore

“The data speaks for itself, but only if you are willing to listen.”
— Nate Silver”

Dr. Jerid Francom

Mar 27, 2024

Overview

  • Orientation
  • Descriptive analysis
  • Unsupervised learning

Process

The Big Picture

Orientation

Exploratory data analysis

Goals

  • Discover patterns
  • Describe emerging relationships
  • Posit hypotheses

When to use

  • Literature is scarce
  • Gap in knowledge is wide
  • New territory

How to use

  • Identify, Inspect, Interrogate, Interpret
  • Iterative process
    • Update: unit of observation, variables, methods

Descriptive analysis

Frequency analysis: what is it?

Simply put, its counting tokens.

Method Description
Raw frequency Number of occurrences of a token within a corpus

Dispersion Distribution of a token across a corpus

Relative frequency Proportion of a token in relation to the total number of tokens in a corpus

What can a frequency analysis tell us?

  • Varying usage and distributin of tokens can signal:
    • Different populations of speakers
    • Different topics, genres, registers, etc.
    • Cognitive processes (entrenchment, salience)

Things to consider

  • Tokenization
  • Operationalization
  • Zipf distribution
(a) Original
(b) Log-transformed
Figure 1: Zipf distribution (N = 100)

Co-occurrence analysis: what is it?

Identify patterns of association between tokens

Method Description
n-grams Sequence of n tokens

Collocation Tokens that frequently co-occur

What can a co-occurrence analysis tell us?

Patterns of association between tokens can signal:

  • Grammatical structures
  • Semantic relationships
  • Formulaic/ Idomatic expressions

Things to consider

  • Operationalization
  • Context window
  • Measures of association
    • Pointwise mutual information
    • Log-likelihood ratio

Unsupervised learning

Clustering: what is it?

Bottom-up approach to grouping similar data points

Method Description
K-means Partition data into k clusters

Hierarchical clustering Build a tree of clusters

What can clustering tell us?

  • Patterns of similarity/ dissimilarity
  • Number of groups
  • Homogeneity within groups
  • Heterogeneity between groups
(a) K-means
(b) Hierarchical clustering
Figure 2: Examples of K-means and hierarchical clustering

Things to consider

  • Previous knowledge
  • Number of clusters
  • Distance metric (Heirarchical clustering)

Dimensionality reduction: what is it?

Operation to reduce the number of variables in a dataset, while preserving as much information as possible

Method Description
PCA Linear transformation to reduce dimensionality
t-SNE Non-linear transformation to visualize high-dimensional data

What can dimensionality reduction tell us?

  • How variables are related
  • How redundant variables are
  • Identify dimensions that explain the most variance

Things to consider

  • Interpretability
  • Number of dimensions
  • Overfitting/ Underfitting
(a) Without PCA
(b) With PCA
Figure 3: K-means clustering with and without PCA

Word embeddings: what is it?

Use of distributed representations of words in a continuous vector space where words with similar contextual distributions are closer together

Method Description
Word2Vec Popular word embedding model
GloVe Global vectors for word representation

Note: word embedding models are highly contingent on the size of the corpus, the algorithm used, and parameters set.

What can word embeddings tell us?

  • Semantic relationships between words
  • Grammatical relationships between words
  • Analogies between words
  • Clustering of words based on context

Things to consider

  • Context window
  • Number of dimensions
  • Pre-trained models vs. training your own model

Wrap-up

Final thoughts

  • Exploratory data analysis covers a wide range of methods and techniques.
  • Of the three approaches we will cover, it requires the most creativity and flexibility.
  • The evaluative process is iterative and associative, and the results are often open to interpretation.

References

Kassambara, Alboukadel, and Fabian Mundt. 2020. Factoextra: Extract and Visualize the Results of Multivariate Data Analyses. https://CRAN.R-project.org/package=factoextra.
Meyer, David, and Christian Buchta. 2022. Proxy: Distance and Similarity Measures. https://CRAN.R-project.org/package=proxy.
Wijffels, Jan, and Kohei Watanabe. 2023. Word2vec: Distributed Representations of Words. https://CRAN.R-project.org/package=word2vec.