Predict

“All models are wrong, but some are useful.”
— George E.P. Box

Dr. Jerid Francom

Apr 3, 2024

Overview

Orientation
Predictive modeling
- Workflow with tidymodels

Process

Orientation

Predictive Data Analysis

Goals

Prescribe actions
Examine outcome-predictor relationship
Assess hypotheses

When to use

To perform tasks
Specific knowledge gap
Alternative to inference

How to use

Identify, Inspect, Interrogate, Interpret
Iterative:
- Features, Model

Predictive modeling

Classification vs. Regression tasks

Classification: Predicting a categorical variable

Regression: Predicting a continuous variable

Features: tokenization

In text analysis, features are often linguistic units (tokens).

Features: metadata

But they can also be other types of variables such as metadata.

Features: text features

Or derived features.

Workflow with `tidymodels`

A. Identify

Variables
Splits
Recipe

B. Inspect

Features

C. Interrogate

Model
Tune
Fit
Evaluate

D. Interpret

Predict
Evaluate
Explore

Identify: variables

Outcome variable: The variable you want to predict
Predictor variables: The variables you will use to make the prediction

Variable	Type	Description
`gender`	Outcome	Aim to predict ‘female’ or ‘male’
`text`	Predictor	Text data to predict `gender`

Identify: Splits

Training set: Used to train, tune, and evaluate the model
Testing set: Used to evaluate the final model

Identify: Recipe

Recipe: A blueprint for how to process the data

Identify: Recipe

Feature selection: Choosing the most relevant variables

Identify: Recipe

Feature engineering: Deriving new variables and transforming existing ones

Interrogate: Model selection

Model specification: A blueprint for the model
Model family: The type of model to use (e.g., logistic regression, random forest)
Engine: The software that will fit the model (e.g., LiblineaR, ranger)
Hyperparameters: Settings that control the model’s behavior (e.g., number of trees in a random forest)

Model	Family	Engine
`logistic_reg()`	Logistic regression	`LiblineaR`
`decision_tree()`	Decision tree	`C5.0`
`random_forest()`	Random forest	`ranger`
`svm_linear()`	Support vector machine	`LiblineaR`

Each model has hyperparameters that can be tuned to improve performance.

Interrogate: Model selection

Model specification: A blueprint for the model

The logistic_reg() model has a penalty hyperparameter that controls the minimum number of observations in a node. Tuning this parameter and the max_tokens() filter will help the model generalize better.

Interrogate: Model selection

Create a workflow that combines the recipe and model specification.

Interrogate: Model tuning

Hyperparameter tuning: Finding the best settings for the model
Resampling: Using the training set to estimate how well the model will perform on new (slices of) data

Interrogate: Model tuning

Choose the best hyperparameters and finalize the workflow.

Interrogate: Fit the model

Fit the model: Train the model on the training set
Cross-validation: Repeatedly train and evaluate the model on different slices of the training set

Interrogate: Evaluate the model

Performance metrics: Measures of how well the model is doing

Classification

Confusion matrix: A table showing the model’s predictions versus the actual outcomes
ROC curve: A graph showing the trade-off between true positive rate and false positive rate

Regression

RMSE: Root mean squared error
Standard deviation of residuals: How much the model’s predictions deviate from the actual outcomes

Identify: Recipe (x2)

Our previous feature selection:

Tokenization: words
Feature engineering: tf-idf
Feature selection: 150 tokens

Interrogate: Model selection (x2)

Update the workflow with the new recipe.

Interrogate: Model tuning (x2)

Update the grid and resampling.

Interrogate: Fit the model (x2)

Fit the model: Train the model on the training set
Cross-validation: Repeatedly train and evaluate the model on different slices of the training set

Interrogate: Evaluate the model (x2)

Performance metrics: Measures of how well the model is doing

Identify: Recipe (x3)

Our previous feature selection:

Tokenization: words
Feature engineering: tf-idf
Feature selection: 150 tokens

Interrogate: Model selection (x3)

Update the workflow with the new recipe.

Interrogate: Model tuning (x3)

Update the grid and resampling.

Interpret: predict

Predictions: Using the model to make predictions on new data (test set)

Interpret: Evaluate

Generalization: How well the model performs on new data

Overfitting	Underfitting
When the model performs well on the training set but poorly on new data	When the model performs poorly on both the training set and new data

Interpret: Evaluate

Feature importance: Which variables are most important for the model’s predictions

For linear models we get coefficients, for tree-based models we get variable importance.

Interpret: Evaluate

We need to standardize the coefficients to compare them.

Wrap-up

Final thoughts

Predictive modeling is a powerful tool for examining relationships in data which can perform tasks (as AI) or provide insights into features that are important for the outcome.
The tidymodels package provides a consistent and flexible framework for building and evaluating models

References

Kuhn, Max, and Hadley Wickham. 2020. Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles. https://www.tidymodels.org.

Predict

Overview

Process

Orientation

Predictive Data Analysis

Predictive modeling

Classification vs. Regression tasks

Features: tokenization

Features: metadata

Features: text features

Workflow with tidymodels

Identify: variables

Identify: Splits

Identify: Recipe

Identify: Recipe

Identify: Recipe

Interrogate: Model selection

Interrogate: Model selection

Interrogate: Model selection

Interrogate: Model tuning

Interrogate: Model tuning

Interrogate: Fit the model

Interrogate: Evaluate the model

Identify: Recipe (x2)

Interrogate: Model selection (x2)

Interrogate: Model tuning (x2)

Interrogate: Fit the model (x2)

Interrogate: Evaluate the model (x2)

Identify: Recipe (x3)

Interrogate: Model selection (x3)

Interrogate: Model tuning (x3)

Interpret: predict

Interpret: Evaluate

Interpret: Evaluate

Interpret: Evaluate

Wrap-up

Final thoughts

References

Workflow with `tidymodels`