Curate

Data to information

Dr. Jerid Francom

Mar 1, 2024

Overview

  • Data to information
  • Curate data
  • Documenting datasets

Data to information

Data

  • Un/semi-structured
  • Non-tabular

Plain text
Sound is a vibration. Sound travels as a mechanical wave through a medium, and in space, there is no medium. So when my shuttle malfunctioned and the airlocks didn't keep the air in, I heard nothing. After the first whoosh of the air being sucked away, there was lightning, but no thunder. Eyes bulging in panic, but no screams. Quiet and peaceful, right? Such a relief to never again hear my crewmate Jesse natter about his girl back on Earth and that all-expenses-paid vacation-for-two she won last time he was on leave. I swore, if I ever had to see a photo of him in a skimpy bathing suit again, giving the camera a cheesy thumbs-up from a lounge chair on one of those white sandy beaches, I'd kiss a monkey. Metaphorically, of course.

XML

<document>
  <sent id="1">
    <word id="1" modality="written">Sound</word>
    <word id="2" modality="written">is</word>
    <word id="3" modality="written">a</word>
    <word id="4" modality="written">vibration</word>
    <word id="5" modality="written">.</word>
  </sent>
  <sent id="2">
    <word id="1" modality="written">Sound</word>
    <word id="2" modality="written">travels</word>
    <word id="3" modality="written">as</word>
    <word id="4" modality="written">a</word>
    <word id="5" modality="written">mechanical</word>
    <word id="6" modality="written">wave</word>
    <word id="7" modality="written">through</word>
    <word id="8" modality="written">a</word>
    <word id="9" modality="written">medium</word>
    <word id="10" modality="written">,</word>
    <word id="11" modality="written">and</word>
    <word id="12" modality="written">in</word>
    <word id="13" modality="written">space</word>
    <word id="14" modality="written">,</word>
    <word id="15" modality="written">there</word>
    <word id="16" modality="written">is</word>
    <word id="17" modality="written">no</word>
    <word id="18" modality="written">medium</word>
    <word id="19" modality="written">.</word>
  </sent>
</document>

JSON

{
  "document": {
    "sent": [
      {
        "id": "1",
        "word": [
          {
            "id": "1",
            "modality": "written",
            "word": "Sound"
          },
          {
            "id": "2",
            "modality": "written",
            "word": "is"
          },
          {
            "id": "3",
            "modality": "written",
            "word": "a"
          },
          {
            "id": "4",
            "modality": "written",
            "word": "vibration"
          },
          {
            "id": "5",
            "modality": "written",
            "word": "."
          }
        ]
      },
      {
        "id": "2",
        "word": [
          {
            "id": "1",
            "modality": "written",
            "word": "Sound"
          },
          {
            "id": "2",
            "modality": "written",
            "word": "travels"
          },
          {
            "id": "3",
            "modality": "written",
            "word": "as"
          }]
      }
  }
}

Information

Physical Semantic

Curate data

Common R packages

  • fs: File system operations
  • readr: Read plain text, csv, tsv, and other delimited files
  • stringr: String manipulation, find/ replace (regular expressions)
  • tidyr: Tidy data, reshape data (more transformational)

Structured data

Characteristics

  • Tabular
  • Relational
  • Hierarchical

Already tidy!

Unstructured data

Characteristics

  • No structure
  • No schema

Requires tidying!

Unstructured data: tidy

Approaches

  • Regular expressions
  • String manipulation
  • File/ directory manipulation

Semi-structured data

Characteristics

  • Non-tabular
  • A schema or structure

Requires tidying!

Semi-structured data: tidy

Approaches

  • Format-specific packages
  • Regular expressions
  • String manipulation
  • File/ directory manipulation

Semi-structured data: tidy

Approaches

  • Format-specific packages
  • Regular expressions
  • String manipulation
  • File/ directory manipulation

Documenting datasets

Write the data

  • Quarto prose and code comments
  • Predictable file names
  • Logical directory structure

Data dictionaries

  • Variable names
  • Variable friendly names
  • Variable descriptions
  • Variable types

Summary

  • Data to information
  • Types of data
  • Documenting datasets