[1] 2
First approach at combining Quarto and R
Feb 2, 2024
readr
dplyr
readr
As the front-matter controls the behavior of the document, the code block options control the behavior of the code.
If not specified, the default behavior is to show the code, evaluate it, and display the output. If there are any warnings or errors, they will be displayed as well.
We can change these defaults, as needed.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
R packages that are installed in the system exist in a “library”. These packages can be loaded (checked out) using the library()
function.
```{r}
#| label: load-packages
#| message: false
# Load packages
library(readr) # for reading/ writing datasets
library(dplyr) # for data manipulation
```
Now the functions from the readr
(Wickham, Hester, and Bryan 2024) and dplyr
(Wickham et al. 2023) packages are available for use.
readr
readr
The readr
package provides a set of functions for reading and writing data and datasets.
Data:
read_file_raw()
: for reading in raw dataread_lines()
: for reading in lines of a fileDatasets:
read_csv()
: for reading comma-separated valuesread_tsv()
: for reading tab-separated valuesLet’s read a CSV file. The file is called corpora-vss.csv
and is located in the data
folder.
Therefore, we use this ‘path’ to read the file:
The read_csv()
function returns a data frame, which is a type of R object which is rectangular (has rows and columns) and can store different types of data (vectors).
To confirm that we have a data frame object, we can use the class()
function.
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
The output should contain data.frame
.
It does! … and some other information….
The read_csv()
function returns a tibble, which is a type of R object that is similar to a data frame, but has some nice user-friendly features.
# A tibble: 8,043 × 6
story sentence word pos lemma nchar
<chr> <dbl> <chr> <chr> <chr> <dbl>
1 264 1 The DT the 3
2 264 1 constant JJ constant 8
3 264 1 hum NN hum 3
4 264 1 of IN of 2
5 264 1 the DT the 3
6 264 1 Toshiba NP toshiba 7
7 264 1 DM-707/40 CD dm-707/40 9
8 264 1 fully RB fully 5
9 264 1 integrated VBN integrate 10
10 264 1 alarm-clock NN alarm-clock 11
# ℹ 8,033 more rows
dplyr
dplyr
The dplyr
package provides a set of functions for data manipulation. We will look a few of these functions.
glimpse()
: for a compact summary of the dataslice_head()
: for a preview of the first n =
rowsslice_tail()
: for a preview of the last n =
rowsslice_sample()
: for a random sample of n =
rowsarrange()
: for sorting rows by column valuesselect()
: for selecting columns (variables)filter()
: for filtering rows by column valuesThe glimpse()
function provides a compact summary of the data.
Rows: 8,043
Columns: 6
$ story <chr> "264", "264", "264", "264", "264", "264", "264", "264", "264"…
$ sentence <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2…
$ word <chr> "The", "constant", "hum", "of", "the", "Toshiba", "DM-707/40"…
$ pos <chr> "DT", "JJ", "NN", "IN", "DT", "NP", "CD", "RB", "VBN", "NN", …
$ lemma <chr> "the", "constant", "hum", "of", "the", "toshiba", "dm-707/40"…
$ nchar <dbl> 3, 8, 3, 2, 3, 7, 9, 5, 10, 11, 11, 5, 2, 2, 5, 1, 7, 4, 2, 3…
This particularly useful when the number of variables is large and the tibble gets truncated on the screen.
The arrange()
function sorts the rows of the data frame by the values of a column. The slice_head()
function previews the first n =
rows of the data frame.
# A tibble: 5 × 6
story sentence word pos lemma nchar
<chr> <dbl> <chr> <chr> <chr> <dbl>
1 264 156 potassium-nitrate-… JJ pota… 23
2 264 156 copper-sulphate-bl… JJ copp… 20
3 264 8 state-of-the-art JJ stat… 16
4 264 35 non-governmental JJ non-… 16
5 An Example of Idiomatic English 409 enthusiastically RB enth… 16
This code works, but is not very readable. We can use the pipe
operator |>
to make it more readable!
The pipe
operator |>
allows us to chain functions together in a more readable way.
vss_df |> # data frame
arrange(-nchar) |> # sort by nchar (desc)
slice_head(n = 5) # preview first 5 rows
# A tibble: 5 × 6
story sentence word pos lemma nchar
<chr> <dbl> <chr> <chr> <chr> <dbl>
1 264 156 potassium-nitrate-… JJ pota… 23
2 264 156 copper-sulphate-bl… JJ copp… 20
3 264 8 state-of-the-art JJ stat… 16
4 264 35 non-governmental JJ non-… 16
5 An Example of Idiomatic English 409 enthusiastically RB enth… 16
The select()
function selects columns (variables) from the data frame. The filter()
function selects rows based on the values of a column.
# A tibble: 160 × 6
story sentence word pos lemma nchar
<chr> <dbl> <chr> <chr> <chr> <dbl>
1 264 1 alarm-clock NN alarm-clock 11
2 264 1 mercilessly RB mercilessly 11
3 264 2 resemblance NN resemblance 11
4 264 3 alarm-clock NN alarm-clock 11
5 264 7 communicated VBD communicate 12
6 264 8 state-of-the-art JJ state-of-the-art 16
7 264 8 alarm-clock NN alarm-clock 11
8 264 16 recognition NN recognition 11
9 264 17 pronunciations NNS pronunciation 14
10 264 17 interspersed VBN intersperse 12
# ℹ 150 more rows
If you want to save the output of a function, you can use the assignment operator <-
.
# Filter rows
long_nn_vss <- # assign output (of the following...)
vss_df |> # data frame
filter(nchar > 10, pos == "NN") # filter rows (two conditions)
# Preview
slice_head(long_nn_vss, n = 5)
# A tibble: 5 × 6
story sentence word pos lemma nchar
<chr> <dbl> <chr> <chr> <chr> <dbl>
1 264 1 alarm-clock NN alarm-clock 11
2 264 2 resemblance NN resemblance 11
3 264 3 alarm-clock NN alarm-clock 11
4 264 8 alarm-clock NN alarm-clock 11
5 264 16 recognition NN recognition 11
readr
The readr
package includes functions for writing datasets, similar to the functions for reading datasets. We will use the write_csv()
function to write a data frame to a CSV file.
Now the file long-nn-vss.csv
is located in the data/
folder.
README.md
Current tasks
Next week
Reading, inspecting, and writing datasets | Quantitative Text Analysis | Wake Forest University