Acquire

Source, acquisition, and documentation of data

Dr. Jerid Francom

Feb 23, 2024

Overview

  • Available data sources
  • Data acquisition
  • Data storage and documentation

Refresh

Topics touched upon so far…
Chapter Topic Lessons Recipes/ Labs
0 Preface Intro to swirl Quarto basics
1 Text analysis Workspace, Vectors Academic Quarto
2 Data Objects, Packages and functions Reading, writing, and inspecting datasets
3 Analysis Summarizing data, Visual summaries Descriptive assessments of datasets
4 Research Project environment Understanding the computing environment

Available data sources

Types

Published sources:

Unpublished sources:

Corpora

Published data sources:

Identifying data and data sources guide.

WFU Guide

Data acquisition

Downloading files

Manual approach

Some data sources require human intervention

  • Web forms
  • Captchas
  • User authentication

Programmatic approach

Other data sources can be accessed programmatically

  • Open data repositories

File types

  • Individual files (e.g .csv, .txt, .xlsx)
  • Archive files (e.g. .zip, .tar, .tar.gz)

To clarify, archive files can be individual files or a folder structure that has been grouped together and compressed. This makes downloading and transferring files more efficient.

download.file() function

An extensible base R function for individual and archive files.

download.file(
  url = "", # link to the file on the web
  destfile = "" # path to the file on your computer (to write)
)

Access files

Archive files need to be ‘unarchived’ to access the individual files and folders. The function to use depends on the archive file type:

  • .zip files: unzip()
  • .tar and .tar.gz files: untar()

Stepwise view (basic)

  • Download the file
  • Unarchive the file to the data/original/ directory.

Stepwise view (improved)

  • If the file doesn’t exist…
  • Create temporary directory
  • Download the file to the temporary directory
  • Unarchive the file to the data/original/ directory.

If the data cannot be shared on the web, add it to the .gitignore file before pushing to GitHub.

data/original/

Code view

Add comments to describe the code and its steps

API interaction

APIs are a way to programmatically interact with a web service. They are a set of rules and protocols that allow different software applications to communicate with each other.

The R community has developed packages to interact with various APIs.

Each API has its own set of functions and parameters to interact with the data. Furthermore, some APIs require authentication and sometimes a subscription.

gutenbergr package

The gutenbergr package (Johnston and Robinson 2023) provides access to the Project Gutenberg collection of public domain books.

Some key metadata objects:

  • gutenberg_metadata: Show metadata for all works
  • gutenberg_subjects: Show LCC subjects
  • gutenberg_authors: Get author information

Some key functions:

  • gutenberg_works(): Search for works
  • gutenberg_download(): Download a work

gutenbergr example

# Download a work
gutenberg_download(
  gutenberg_id == 33,
  meta_fields = c("title", "author", "gutenberg_id")
  )

To run this code we need to identify the gutenberg_id or set of gutenberg_ids for the works we want to download.

gutenbergr example

These metadata objects are useful for finding works and authors.

Custom functions

Custom functions can be written to group a set of coding instructions into one process.

In R, a function includes:

  • A name do_this_thing <-
  • The function call function()
  • The names of arguments function(arg_1, arg_2)
  • The inclusion of optional arguments function(arg_1, arg_2 = "default")
  • The body of the function {}

Example custom function

Add comments to describe the function and its steps

Data storage and documentation

File formats

It is key to make sure that the data is in a format that can be read by R, and provide the most flexibility for future use (and future users).

Common file formats:

  • Plain files (TXT)
  • Structured data (XML, JSON)
  • Plain text datasets (CSV, TSV)
  • Compressed files (ZIP, TAR, GZ, etc.)

We avoid proprietary or software-specific formats.

Data storage

The file structure should be organized and well-documented.

Original data files are to be separate from derived and analysis files.

data/
  ├── analysis/
  ├── derived/
  └── original/
      ├── corpus-name/
      │   ├── file1.csv
      │   ├── file2.csv

Data in original/ should be left untouched. Any changes or derived data should be stored in derived/.

Data documentation

The data doesn’t speak for itself. It is important to document the data to provide context and understanding.

Data origin file information
Information Description
Resource name Name of the corpus resource.
Data source URL, DOI, etc.
Data sampling frame Language, language variety, modality, genre, etc.
Data collection date(s) The date or date range of the data collection.
Data format Plain text, XML, HTML, etc.
Data schema Relationships between data elements: files, folders, etc.
License CC BY, CC BY-NC, etc.
Attribution Citation information for the data source.

Looking ahead

  • Recipe 05: Collecting and documenting data
    We will cover the following topics:

    • Finding data sources
    • Data collection strategies
    • Data documentation
  • Lab 05: Collecting and documenting data
    You will have a choice of data source to acquire data from. Before you start the lab, you should consider which data source you would like to use, what strategy you will use to acquire the data, and what data you will acquire. You should also consider the information you need to document the data collection process.

Skills to be covered

  • Identifying data sources
  • Acquiring data through manual and programmatic downloads and APIs
  • Creating a data acquisition plan
  • Documenting the data collection process
  • Using Control statments and/ or writing a custom function
  • Documenting the data source with a data origin file

References

Hester, Jim, Hadley Wickham, and Gábor Csárdi. 2023. Fs: Cross-Platform File System Operations Based on Libuv. https://fs.r-lib.org.
Johnston, Myfanwy, and David Robinson. 2023. Gutenbergr: Download and Process Public Domain Works from Project Gutenberg. https://docs.ropensci.org/gutenbergr/.