Acquire

Source, acquisition, and documentation of data

Dr. Jerid Francom

Feb 23, 2024

Overview

Available data sources
Data acquisition
Data storage and documentation

Refresh

Topics touched upon so far…
Chapter	Topic	Lessons	Recipes/ Labs
0	Preface	Intro to swirl	Quarto basics
1	Text analysis	Workspace, Vectors	Academic Quarto
2	Data	Objects, Packages and functions	Reading, writing, and inspecting datasets
3	Analysis	Summarizing data, Visual summaries	Descriptive assessments of datasets
4	Research	Project environment	Understanding the computing environment

Available data sources

Types

Published sources:

Repositories
Corpus and dataset pages

Unpublished sources:

APIs (Reddit, Project Gutenberg, etc.)
Web scraping (with permission)
Document scanning (OCR)

Corpora

Published data sources:

Identifying data and data sources guide.

WFU Guide

Linguistic datasets
- Linguistic Data Consortium

Data acquisition

Downloading files

Manual approach

Some data sources require human intervention

Web forms
Captchas
User authentication

Programmatic approach

Other data sources can be accessed programmatically

Open data repositories

File types

Individual files (e.g .csv, .txt, .xlsx)
Archive files (e.g. .zip, .tar, .tar.gz)

To clarify, archive files can be individual files or a folder structure that has been grouped together and compressed. This makes downloading and transferring files more efficient.

`download.file()` function

An extensible base R function for individual and archive files.

download.file(
  url = "", # link to the file on the web
  destfile = "" # path to the file on your computer (to write)
)

Access files

Archive files need to be ‘unarchived’ to access the individual files and folders. The function to use depends on the archive file type:

.zip files: unzip()
.tar and .tar.gz files: untar()

Stepwise view (basic)

Download the file
Unarchive the file to the data/original/ directory.

Stepwise view (improved)

If the file doesn’t exist…
Create temporary directory
Download the file to the temporary directory
Unarchive the file to the data/original/ directory.

If the data cannot be shared on the web, add it to the .gitignore file before pushing to GitHub.

data/original/

Code view

Add comments to describe the code and its steps

API interaction

APIs are a way to programmatically interact with a web service. They are a set of rules and protocols that allow different software applications to communicate with each other.

The R community has developed packages to interact with various APIs.

Each API has its own set of functions and parameters to interact with the data. Furthermore, some APIs require authentication and sometimes a subscription.

`gutenbergr` package

The gutenbergr package (Johnston and Robinson 2023) provides access to the Project Gutenberg collection of public domain books.

Some key metadata objects:

gutenberg_metadata: Show metadata for all works
gutenberg_subjects: Show LCC subjects
gutenberg_authors: Get author information

Some key functions:

gutenberg_works(): Search for works
gutenberg_download(): Download a work

`gutenbergr` example

# Download a work
gutenberg_download(
  gutenberg_id == 33,
  meta_fields = c("title", "author", "gutenberg_id")
  )

To run this code we need to identify the gutenberg_id or set of gutenberg_ids for the works we want to download.

`gutenbergr` example

These metadata objects are useful for finding works and authors.

Custom functions

Custom functions can be written to group a set of coding instructions into one process.

In R, a function includes:

A name do_this_thing <-
The function call function()
The names of arguments function(arg_1, arg_2)
The inclusion of optional arguments function(arg_1, arg_2 = "default")
The body of the function {}

Example custom function

Add comments to describe the function and its steps

Data storage and documentation

File formats

It is key to make sure that the data is in a format that can be read by R, and provide the most flexibility for future use (and future users).

Common file formats:

Plain files (TXT)
Structured data (XML, JSON)
Plain text datasets (CSV, TSV)
Compressed files (ZIP, TAR, GZ, etc.)

We avoid proprietary or software-specific formats.

Data storage

The file structure should be organized and well-documented.

Original data files are to be separate from derived and analysis files.

data/
  ├── analysis/
  ├── derived/
  └── original/
      ├── corpus-name/
      │   ├── file1.csv
      │   ├── file2.csv

Data in original/ should be left untouched. Any changes or derived data should be stored in derived/.

Data documentation

The data doesn’t speak for itself. It is important to document the data to provide context and understanding.

Data origin file information
Information	Description
Resource name	Name of the corpus resource.
Data source	URL, DOI, etc.
Data sampling frame	Language, language variety, modality, genre, etc.
Data collection date(s)	The date or date range of the data collection.
Data format	Plain text, XML, HTML, etc.
Data schema	Relationships between data elements: files, folders, etc.
License	CC BY, CC BY-NC, etc.
Attribution	Citation information for the data source.

Looking ahead

Recipe 05: Collecting and documenting data
We will cover the following topics:
- Finding data sources
- Data collection strategies
- Data documentation
Lab 05: Collecting and documenting data
You will have a choice of data source to acquire data from. Before you start the lab, you should consider which data source you would like to use, what strategy you will use to acquire the data, and what data you will acquire. You should also consider the information you need to document the data collection process.

Skills to be covered

Identifying data sources
Acquiring data through manual and programmatic downloads and APIs
Creating a data acquisition plan
Documenting the data collection process
Using Control statments and/ or writing a custom function
Documenting the data source with a data origin file

References

Hester, Jim, Hadley Wickham, and Gábor Csárdi. 2023. Fs: Cross-Platform File System Operations Based on Libuv. https://fs.r-lib.org.

Johnston, Myfanwy, and David Robinson. 2023. Gutenbergr: Download and Process Public Domain Works from Project Gutenberg. https://docs.ropensci.org/gutenbergr/.

Acquire

Overview

Refresh

Available data sources

Types

Corpora

Data acquisition

Downloading files

File types

download.file() function

Access files

Stepwise view (basic)

Stepwise view (improved)

Code view

API interaction

gutenbergr package

gutenbergr example

gutenbergr example

Custom functions

Example custom function

Data storage and documentation

File formats

Data storage

Data documentation

Looking ahead

References

`download.file()` function

`gutenbergr` package

`gutenbergr` example

`gutenbergr` example