Analysis

Approaching statistical thinking for text analysis.

Dr. Jerid Francom

Feb 7, 2024

Overview

  • Describe
    • mixed in!
  • Analyze
  • Communicate

Describe

Descriptive methods

Summarize the data to understand its characteristics.

  • Central Tendency: What is the “typical” value?
  • Dispersion: How much do the values vary?
  • Distribution: What does the data look like?
  • Association: How do variables relate to one another?

Central Tendency

A single statistic that aims to represent a variable.

Mode

Most common
0.89
used most for categorical data

Mean

Average
1.98

Median

Middle
1.67

Dispersion

A single statistic to represent the variability of a variable.

Standard Deviation

1.38 around the mean

IQR (Interquartile Range)

1.69 75\(^{th}\) - 25\(^{th}\) percentiles

Distribution

Normal distribution

Skewed distributions

Association

Relationship between one variable and another

Categorical Ordinal Numeric
Categorical Contingency Table Contingency Table/ Bar plot Pivot Table/ Boxplot
Ordinal - Contingency Table/ Bar plot Pivot Table/ Boxplot
Numeric - - Scatterplot

Demo

Load necessary packages

Very Short Stores vss_df dataset

Statistical overview

Targeted statistics

Distribution

Association

Categorical x Categorical/ Ordinal

Categorical x Numeric (measure)

Categorical x Numeric

Numeric x Numeric

Analyze

Aims, approach, methods, and evaluation

Table 1: Exploratory Data Analysis
Aims Explore: gain insight, open new avenues
Approach Inductive, data-driven, and iterative
Methods Descriptive, pattern detection with machine learning (unsupervised)
Evaluation Associative
Table 2: Predictive Data Analysis
Aims Examine: support and validate
Approach Semi-deductive, data/theory-driven, and iterative
Methods Predictive modeling with machine learning (supervised)
Evaluation Accuracy measures, associative
Table 3: Inferential Data Analysis
Aims Extrapolate: generalize and explain
Approach Deductive, theory-driven, and non-iterative
Methods Inferential statistics (theory- or simulation-based)
Evaluation Causal inference, associative

Communicate

Report

Presentations, articles, and reports are the primary means of communicating results.

flowchart LR
  subgraph "Motivation"
    A[Literature review] --> B[Research question]
    A --> C[Hypothesis]
  end
  subgraph "Methods"
    C --> D["Data\n description"]
    B --> D
    D --> F["Data analysis\n description"]
  end
  subgraph "Analysis"
    F --> G["Descriptive statistics"]
    G --> H["Exploratory findings"]
    G --> I["Predictive modeling"]
    G --> J["Inferential estimates"]
    H --> K[Results]
    I --> K
    J --> K
  end
  subgraph "Discussion"
    K --> L[Interpretation]
    K --> N[Limitations]
    K --> M[Implications]
  end

Document

Summary

Upshot

  • Descriptive statistics are the first step in understanding data.
  • Statistical thinking is a process of asking questions and answering them with data.
  • The process of analysis depends on the aims of the research.
  • Communication and documentation is key to sharing results and understanding.

Looking ahead

References

Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2022. Skimr: Compact and Flexible Summaries of Data. https://CRAN.R-project.org/package=skimr.
Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.