Source, acquisition, and documentation of data
Feb 23, 2024
Chapter | Topic | Lessons | Recipes/ Labs |
---|---|---|---|
0 | Preface | Intro to swirl | Quarto basics |
1 | Text analysis | Workspace, Vectors | Academic Quarto |
2 | Data | Objects, Packages and functions | Reading, writing, and inspecting datasets |
3 | Analysis | Summarizing data, Visual summaries | Descriptive assessments of datasets |
4 | Research | Project environment | Understanding the computing environment |
Published sources:
Unpublished sources:
Published data sources:
Manual approach
Some data sources require human intervention
Programmatic approach
Other data sources can be accessed programmatically
To clarify, archive files can be individual files or a folder structure that has been grouped together and compressed. This makes downloading and transferring files more efficient.
download.file()
functionAn extensible base R function for individual and archive files.
Archive files need to be ‘unarchived’ to access the individual files and folders. The function to use depends on the archive file type:
.zip
files: unzip()
.tar
and .tar.gz
files: untar()
data/original/
directory.data/original/
directory.Add comments to describe the code and its steps
APIs are a way to programmatically interact with a web service. They are a set of rules and protocols that allow different software applications to communicate with each other.
The R community has developed packages to interact with various APIs.
Each API has its own set of functions and parameters to interact with the data. Furthermore, some APIs require authentication and sometimes a subscription.
gutenbergr
packageThe gutenbergr
package (Johnston and Robinson 2023) provides access to the Project Gutenberg collection of public domain books.
Some key metadata objects:
gutenberg_metadata
: Show metadata for all worksgutenberg_subjects
: Show LCC subjectsgutenberg_authors
: Get author informationSome key functions:
gutenberg_works()
: Search for worksgutenberg_download()
: Download a workgutenbergr
examplegutenbergr
exampleThese metadata objects are useful for finding works and authors.
Custom functions can be written to group a set of coding instructions into one process.
In R, a function includes:
do_this_thing <-
function()
function(arg_1, arg_2)
function(arg_1, arg_2 = "default")
{}
Add comments to describe the function and its steps
It is key to make sure that the data is in a format that can be read by R, and provide the most flexibility for future use (and future users).
Common file formats:
We avoid proprietary or software-specific formats.
The file structure should be organized and well-documented.
Original data files are to be separate from derived and analysis files.
data/
├── analysis/
├── derived/
└── original/
├── corpus-name/
│ ├── file1.csv
│ ├── file2.csv
Data in original/
should be left untouched. Any changes or derived data should be stored in derived/
.
The data doesn’t speak for itself. It is important to document the data to provide context and understanding.
Information | Description |
---|---|
Resource name | Name of the corpus resource. |
Data source | URL, DOI, etc. |
Data sampling frame | Language, language variety, modality, genre, etc. |
Data collection date(s) | The date or date range of the data collection. |
Data format | Plain text, XML, HTML, etc. |
Data schema | Relationships between data elements: files, folders, etc. |
License | CC BY, CC BY-NC, etc. |
Attribution | Citation information for the data source. |
Recipe 05: Collecting and documenting data
We will cover the following topics:
Lab 05: Collecting and documenting data
You will have a choice of data source to acquire data from. Before you start the lab, you should consider which data source you would like to use, what strategy you will use to acquire the data, and what data you will acquire. You should also consider the information you need to document the data collection process.
Skills to be covered
Acquire | Quantitative Text Analysis | Wake Forest University