Prepare and enrich datasets
Mar 20, 2024
Preparation
Enrichment
Sanitize and standardize: Removing artififacts, coding anomalies, and other inconsistencies.
Description | Examples |
---|---|
Non-speech annotations | (Abucheos) , (A4-0247/98) , (The sitting was opened at 09:00) |
Inconsistent whitespace | 5 % , , , Palacio' s |
Non-sentence punctuation | - |
Abbreviations | Mr. , Sr. , Mme. , Mr , Sr , Mme , Mister , Señor , Madam |
Text case | The , the , White , white |
Change linguistic unit: larger, smaller, or groupings.
It was the esscence of life itself.
Description | Examples |
---|---|
Unigrams | It , was , the , essence , of , life , itself |
Bigrams | It was , was the , the essence , essence of , of life , life itself |
Trigrams | It was the , was the essence , the essence of , essence of life , of life itself |
Change linguistic unit: larger, smaller, or groupings.
It was the esscence of life itself.
Description | Examples |
---|---|
Unigrams | I , t , w , a , s , t , h , e , e , s , s , e , n , c , e , o , f , l , i , f , e , i , t , s , e , l , f |
Bigrams | It , tw , as , th , e_ , es , se , en , nc , ce , of , f_ , li , if , fe , ei , it , ts , se , el , lf |
Trigrams | It_ , was , the , ess , enc , eof , lif , e_i , tse , lfi , tse , lf |
Note: It is also possible to reconstruct the larger tokens from the smaller ones (i.e words from characters, sentences from words).
Consider the following paragraph:
“As the sun dipped below the horizon, the sky was set ablaze with shades of orange-red, illuminating the landscape. It’s a sight Mr. Johnson, a long-time observer, never tired of. On the lakeside, he’d watch with friends, enjoying the ever-changing hues—especially those around 6:30 p.m.—and reflecting on nature’s grand display. Even in the half-light, the water’s glimmer, coupled with the echo of distant laughter, created a timeless scene. The so-called ‘magic hour’ was indeed magical, yet fleeting, like a well-crafted poem; it was the essence of life itself.”
What text conventions would pose issues for word tokenization based on a whitespace critieron?
Derive attributes: from implicit information in the dataset.
Recast values: to make explicit more accessible.
a different grouping, scale, or measure
Type: Numeric > ordinal > categorical
Scale:
Measures: Results from a calculation
Juxapose datasets: to create a new dataset.
Transformation | Quantitative Text Analysis | Wake Forest University