Transformation

Prepare and enrich datasets

Dr. Jerid Francom

Mar 20, 2024

Overview

Preparation

Normalization
Tokenization

Enrichment

Recoding
Generation
Integration

Process

Preparation

Normalization

Sanitize and standardize: Removing artififacts, coding anomalies, and other inconsistencies.

Table 1: Characteristics of the Europarl Corpus dataset that may require normalization.

Description	Examples
Non-speech annotations	`(Abucheos)`, `(A4-0247/98)`, `(The sitting was opened at 09:00)`
Inconsistent whitespace	`5 % ,`, , `Palacio' s`
Non-sentence punctuation	`-`
Abbreviations	`Mr.`, `Sr.`, `Mme.`, `Mr`, `Sr`, `Mme`, `Mister`, `Señor`, `Madam`
Text case	`The`, `the`, `White`, `white`

Normalizing: example

sent_df
str_view(sent_df$text, pattern = "\\s{2,}")
str_view(sent_df$text, "\\w\\.\\b")
str_view(sent_df$text, "^\\[.*?\\]")

sent_df |>
  mutate(
    text = str_replace_all(text, "\\s{2,}", " "),
    text = str_replace_all(text, "(\\w)\\.(\\w)", "\\1 \\2"),
    text = str_remove(text, "^\\[.*\\]:\\s"),
    text = str_to_sentence(text)
  )

Tokenization

Change linguistic unit: larger, smaller, or groupings.

It was the esscence of life itself.

Table 2: Word tokens

Description	Examples
Unigrams	`It`, `was`, `the`, `essence`, `of`, `life`, `itself`
Bigrams	`It was`, `was the`, `the essence`, `essence of`, `of life`, `life itself`
Trigrams	`It was the`, `was the essence`, `the essence of`, `essence of life`, `of life itself`

Tokenization

Change linguistic unit: larger, smaller, or groupings.

It was the esscence of life itself.

Table 3: Character tokens

Description	Examples
Unigrams	`I`, `t`, `w`, `a`, `s`, `t`, `h`, `e`, `e`, `s`, `s`, `e`, `n`, `c`, `e`, `o`, `f`, `l`, `i`, `f`, `e`, `i`, `t`, `s`, `e`, `l`, `f`
Bigrams	`It`, `tw`, `as`, `th`, `e_`, `es`, `se`, `en`, `nc`, `ce`, `of`, `f_`, `li`, `if`, `fe`, `ei`, `it`, `ts`, `se`, `el`, `lf`
Trigrams	`It_`, `was`, `the`, `ess`, `enc`, `eof`, `lif`, `e_i`, `tse`, `lfi`, `tse`, `lf`

Note: It is also possible to reconstruct the larger tokens from the smaller ones (i.e words from characters, sentences from words).

Tokenization: case

Consider the following paragraph:

“As the sun dipped below the horizon, the sky was set ablaze with shades of orange-red, illuminating the landscape. It’s a sight Mr. Johnson, a long-time observer, never tired of. On the lakeside, he’d watch with friends, enjoying the ever-changing hues—especially those around 6:30 p.m.—and reflecting on nature’s grand display. Even in the half-light, the water’s glimmer, coupled with the echo of distant laughter, created a timeless scene. The so-called ‘magic hour’ was indeed magical, yet fleeting, like a well-crafted poem; it was the essence of life itself.”

What text conventions would pose issues for word tokenization based on a whitespace critieron?

Tokenization: example

para_df

# `tokenizers` package
args(tokenize_words)

tokenize_words(para_df$text)
tokenize_sentences(para_df$text)

# Manually nest/ unnest
para_df |>
  mutate(
    sents = tokenize_sentences(text)
  ) |>
  unnest(cols = sents) |>
  select(doc_id, sents)

# `tidytext` package
args(unnest_tokens)

# Unnesting by words
para_df |>
  unnest_tokens(token, text) |>
  pull(token)

# Unnesting by sentences
para_df |>
  unnest_tokens(token, text, token = "sentences") |>
  pull(token)

Enrichment

Generation

Derive attributes: from implicit information in the dataset.

Lemmatization
Part-of-speech tagging
Morphological analysis
Named entity recognition
Sentiment analysis
Dependency parsing
…

Generation: example

Part-of-speech tagging, lemmatization, and morphological analysis


# Full annotation
sent_ann <-
  udpipe(x = sent_df, object = en_mdl) |>
  as_tibble()

# Part-of-speech tagging
sent_tok <-
  udpipe_annotate(
    object = en_mdl,
    x = sent_df$text,
    tagger = "default",
    parser = "none") |>
    as_tibble()

Recoding

Recast values: to make explicit more accessible.

a different grouping, scale, or measure

Type: Numeric > ordinal > categorical
Scale:
- Logarithmic transformation
- Standardization
Measures: Results from a calculation

Recoding: example

# Recode `feats` to `tense`
sent_tok_tense <-
  sent_tok |>
  mutate(
    tense = case_when(
      str_detect(feats, "Tense=Past") ~ "Past",
      str_detect(feats, "Tense=Pres") ~ "Present",
      TRUE ~ "other"
    )
  )

Integration

Juxapose datasets: to create a new dataset.

Join: to add columns or rows based on a common key.
Concatenate: to add rows to a common set of columns.

Joining: example

Sentiment lexicon

Concatenating: example

Two populations

Final thoughts

Transformation is a critical step in the data analysis process.
It builds on the curated dataset to create one or more datasets that are more in-line with the analysis goals.
It is a process that is iterative.
Diagnostics and validation are important to apply as you go along.

References

Mullen, Lincoln A., Kenneth Benoit, Os Keyes, Dmitry Selivanov, and Jeffrey Arnold. 2018. “Fast, Consistent Tokenization of Natural Language Text.” Journal of Open Source Software 3: 655. https://doi.org/10.21105/joss.00655.

Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in r.” JOSS 1 (3). https://doi.org/10.21105/joss.00037.

Wickham, Hadley. 2023. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.

Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Wijffels, Jan. 2023. Udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the ’UDPipe’ ’NLP’ Toolkit. https://CRAN.R-project.org/package=udpipe.