βAll models are wrong, but some are useful.β
β George E.P. Box
Apr 3, 2024
tidymodels
Loading webR...
Goals
When to use
How to use
In text analysis, features are often linguistic units (tokens).
Loading webR...
But they can also be other types of variables such as metadata.
Or derived features.
tidymodels
A. Identify
B. Inspect
C. Interrogate
D. Interpret
Variable | Type | Description |
---|---|---|
gender |
Outcome | Aim to predict βfemaleβ or βmaleβ |
text |
Predictor | Text data to predict gender |
Loading webR...
Loading webR...
Loading webR...
Loading webR...
Loading webR...
LiblineaR
, ranger
)Model | Family | Engine |
---|---|---|
logistic_reg() |
Logistic regression | LiblineaR |
decision_tree() |
Decision tree | C5.0 |
random_forest() |
Random forest | ranger |
svm_linear() |
Support vector machine | LiblineaR |
Each model has hyperparameters that can be tuned to improve performance.
The logistic_reg()
model has a penalty
hyperparameter that controls the minimum number of observations in a node. Tuning this parameter and the max_tokens()
filter will help the model generalize better.
Loading webR...
Create a workflow that combines the recipe and model specification.
Loading webR...
Loading webR...
Choose the best hyperparameters and finalize the workflow.
Loading webR...
Loading webR...
Performance metrics: Measures of how well the model is doing
Classification
Regression
Our previous feature selection:
Loading webR...
Update the workflow with the new recipe.
Loading webR...
Update the grid and resampling.
Loading webR...
Loading webR...
Performance metrics: Measures of how well the model is doing
Our previous feature selection:
Loading webR...
Update the workflow with the new recipe.
Loading webR...
Update the grid and resampling.
Loading webR...
Overfitting | Underfitting |
---|---|
When the model performs well on the training set but poorly on new data | When the model performs poorly on both the training set and new data |
For linear models we get coefficients, for tree-based models we get variable importance.
We need to standardize the coefficients to compare them.
tidymodels
package provides a consistent and flexible framework for building and evaluating modelsPredict | Quantitative Text Analysis | Wake Forest University