Reading, inspecting, and writing datasets

First approach at combining Quarto and R

Dr. Jerid Francom

Feb 2, 2024

Overview

  • Quarto + code blocks
  • Packages
  • Reading with readr
  • Inspecting with dplyr
  • Writing with readr
  • Lab 02: Dive into datasets

Quarto + code blocks

Code block options

As the front-matter controls the behavior of the document, the code block options control the behavior of the code.

If not specified, the default behavior is to show the code, evaluate it, and display the output. If there are any warnings or errors, they will be displayed as well.

Default behavior:

```{r}
#| echo: true
#| include: true
#| message: true

1 + 1
```
1 + 1
[1] 2

Has the same result as:

```{r}
1 + 1
```
1 + 1
[1] 2

Code block options

We can change these defaults, as needed.

No code

```{r}
#| echo: false

1 + 1
```
[1] 2

No code or output

```{r}
#| include: false

1 + 1
```

No messages

```{r}
#| message: false

1 + 1
```
1 + 1
[1] 2

Code block: example

```{r}
# Load libraries
library(tidyverse)
```
# Loading libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
```{r}
#| message: false
# Load libraries
library(tidyverse)
```
# Loading libraries
library(tidyverse)

Packages

Loading packages

R packages that are installed in the system exist in a “library”. These packages can be loaded (checked out) using the library() function.

```{r}
#| label: load-packages
#| message: false

# Load packages
library(readr)     # for reading/ writing datasets
library(dplyr)     # for data manipulation
```

Now the functions from the readr (Wickham, Hester, and Bryan 2024) and dplyr (Wickham et al. 2023) packages are available for use.

Reading with readr

About readr

The readr package provides a set of functions for reading and writing data and datasets.

Data:

  • read_file_raw(): for reading in raw data
  • read_lines(): for reading in lines of a file
  • etc.

Datasets:

  • read_csv(): for reading comma-separated values
  • read_tsv(): for reading tab-separated values
  • etc.

Reading a dataset

Let’s read a CSV file. The file is called corpora-vss.csv and is located in the data folder.

project/
  ├── data/
     └── corpora-vss.csv
  └── my_file.qmd

Therefore, we use this ‘path’ to read the file:

```{r}
#| label: read-dataset
#| message: false

# Read dataset
vss_df <- read_csv("data/corpora-vss.csv")
```

R objects: Data frame

The read_csv() function returns a data frame, which is a type of R object which is rectangular (has rows and columns) and can store different types of data (vectors).

To confirm that we have a data frame object, we can use the class() function.

```{r}
#| label: check-if-df

# Preview dataset
class(vss_df)
```
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 

The output should contain data.frame.
It does! … and some other information….

R objects: Tibble

The read_csv() function returns a tibble, which is a type of R object that is similar to a data frame, but has some nice user-friendly features.

  • Prints the dimensions of the data frame
  • Prints the vector type of each column (variable)
  • Only prints first 10 rows
  • Only prints the columns that fit on the screen
vss_df
# A tibble: 8,043 × 6
   story sentence word        pos   lemma       nchar
   <chr>    <dbl> <chr>       <chr> <chr>       <dbl>
 1 264          1 The         DT    the             3
 2 264          1 constant    JJ    constant        8
 3 264          1 hum         NN    hum             3
 4 264          1 of          IN    of              2
 5 264          1 the         DT    the             3
 6 264          1 Toshiba     NP    toshiba         7
 7 264          1 DM-707/40   CD    dm-707/40       9
 8 264          1 fully       RB    fully           5
 9 264          1 integrated  VBN   integrate      10
10 264          1 alarm-clock NN    alarm-clock    11
# ℹ 8,033 more rows

Inspecting with dplyr

About dplyr

The dplyr package provides a set of functions for data manipulation. We will look a few of these functions.

  • glimpse(): for a compact summary of the data
  • slice_head(): for a preview of the first n = rows
  • slice_tail(): for a preview of the last n = rows
  • slice_sample(): for a random sample of n = rows
  • arrange(): for sorting rows by column values
  • select(): for selecting columns (variables)
  • filter(): for filtering rows by column values

Quick summary

The glimpse() function provides a compact summary of the data.

# Preview dataset
glimpse(vss_df)
Rows: 8,043
Columns: 6
$ story    <chr> "264", "264", "264", "264", "264", "264", "264", "264", "264"…
$ sentence <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2…
$ word     <chr> "The", "constant", "hum", "of", "the", "Toshiba", "DM-707/40"…
$ pos      <chr> "DT", "JJ", "NN", "IN", "DT", "NP", "CD", "RB", "VBN", "NN", …
$ lemma    <chr> "the", "constant", "hum", "of", "the", "toshiba", "dm-707/40"…
$ nchar    <dbl> 3, 8, 3, 2, 3, 7, 9, 5, 10, 11, 11, 5, 2, 2, 5, 1, 7, 4, 2, 3…

This particularly useful when the number of variables is large and the tibble gets truncated on the screen.

Sort and preview

The arrange() function sorts the rows of the data frame by the values of a column. The slice_head() function previews the first n = rows of the data frame.

slice_head(arrange(vss_df, desc(nchar)), n = 5)
# A tibble: 5 × 6
  story                           sentence word                pos   lemma nchar
  <chr>                              <dbl> <chr>               <chr> <chr> <dbl>
1 264                                  156 potassium-nitrate-… JJ    pota…    23
2 264                                  156 copper-sulphate-bl… JJ    copp…    20
3 264                                    8 state-of-the-art    JJ    stat…    16
4 264                                   35 non-governmental    JJ    non-…    16
5 An Example of Idiomatic English      409 enthusiastically    RB    enth…    16

This code works, but is not very readable. We can use the pipe operator |> to make it more readable!

Sort and preview v.2

The pipe operator |> allows us to chain functions together in a more readable way.

vss_df |>             # data frame
  arrange(-nchar) |>  # sort by nchar (desc)
  slice_head(n = 5)   # preview first 5 rows
# A tibble: 5 × 6
  story                           sentence word                pos   lemma nchar
  <chr>                              <dbl> <chr>               <chr> <chr> <dbl>
1 264                                  156 potassium-nitrate-… JJ    pota…    23
2 264                                  156 copper-sulphate-bl… JJ    copp…    20
3 264                                    8 state-of-the-art    JJ    stat…    16
4 264                                   35 non-governmental    JJ    non-…    16
5 An Example of Idiomatic English      409 enthusiastically    RB    enth…    16

Select columns and/or rows

The select() function selects columns (variables) from the data frame. The filter() function selects rows based on the values of a column.

# Filter rows
vss_df |>
  filter(nchar > 10)
# A tibble: 160 × 6
   story sentence word             pos   lemma            nchar
   <chr>    <dbl> <chr>            <chr> <chr>            <dbl>
 1 264          1 alarm-clock      NN    alarm-clock         11
 2 264          1 mercilessly      RB    mercilessly         11
 3 264          2 resemblance      NN    resemblance         11
 4 264          3 alarm-clock      NN    alarm-clock         11
 5 264          7 communicated     VBD   communicate         12
 6 264          8 state-of-the-art JJ    state-of-the-art    16
 7 264          8 alarm-clock      NN    alarm-clock         11
 8 264         16 recognition      NN    recognition         11
 9 264         17 pronunciations   NNS   pronunciation       14
10 264         17 interspersed     VBN   intersperse         12
# ℹ 150 more rows
# Select columns
vss_df |>
  filter(pos == "NN") |>
  select(sentence, word)
# A tibble: 1,101 × 2
   sentence word       
      <dbl> <chr>      
 1        1 hum        
 2        1 alarm-clock
 3        1 sleep      
 4        2 side       
 5        2 sound      
 6        2 resemblance
 7        2 equivalent 
 8        3 alarm-clock
 9        3 volume     
10        3 hum        
# ℹ 1,091 more rows

Assigning output

If you want to save the output of a function, you can use the assignment operator <-.

# Filter rows
long_nn_vss <-                    # assign output (of the following...)
  vss_df |>                       # data frame
  filter(nchar > 10, pos == "NN") # filter rows (two conditions)

# Preview
slice_head(long_nn_vss, n = 5)
# A tibble: 5 × 6
  story sentence word        pos   lemma       nchar
  <chr>    <dbl> <chr>       <chr> <chr>       <dbl>
1 264          1 alarm-clock NN    alarm-clock    11
2 264          2 resemblance NN    resemblance    11
3 264          3 alarm-clock NN    alarm-clock    11
4 264          8 alarm-clock NN    alarm-clock    11
5 264         16 recognition NN    recognition    11

Writing with readr

Writing a dataset

The readr package includes functions for writing datasets, similar to the functions for reading datasets. We will use the write_csv() function to write a data frame to a CSV file.

# Write dataset
write_csv(long_nn_vss, "data/long-nn-vss.csv")

Now the file long-nn-vss.csv is located in the data/ folder.

project/
  ├── data/
     ├── long-nn-vss.csv
     └── corpora-vss.csv
  └── my_file.qmd

Lab 02: Dive into datasets

Setup

  • Clone the Lab 02 repository from GitHub
  • Open the project in RStudio
  • Follow the instructions in README.md

Looking ahead

Current tasks

  1. Lab 02: Dive into datasets

Next week

  1. Reading: Analysis
    • Annotate with Hypothes.is
  2. Lessons (Swirl): Summarizing data, Visual summaries

References

Evert, Stephanie. 2023. Corpora: Statistics and Data Sets for Corpus Frequency Data. https://CRAN.R-project.org/package=corpora.
Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, Jim Hester, and Jennifer Bryan. 2024. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.