Taming data
The process of curating data
Mar 6, 2024
Overview
- Setup
- Orientation
- Preparation
- Implementation
- Documentation
Setup
- Lab-06: Fork, clone, and create RStudio project
- Create
2_curate.qmd
file in process/
directory
- Prepare front-matter
- Prepare sections
Orientation
- Data origin?
- Sampling frame, data collection, schema design
- Data type?
- Data format?
- Standardized (CSV, JSON, XML, etc.) or non-standardized
- Metadata?
- Inline, external, file/ directory structure
Preparation
- Idealized structure
- What would be the ideal structured data?
- How many columns, what types, what names, etc.?
- Steps to achieve idealized structure
- Create an outline
- Add comments, notes, and questions
- Identify packages, strategies, and tools
Implementation
- Typical steps
- Read data (
readr
, readtext
, etc.)
- Clean dataset (
dplyr
, tidyr
, stringr
etc.)
- Organize dataset (
dplyr
, tidyr
, stringr
etc.)
- Write dataset (
readr
)
Documentation
Separate data/ datasets
- Read-only data (data/original/)
- Derived data (data/derived/)
Dataset documentation
- Quarto process and code block comments
- Data versioning/ naming
- Data dictionary (
qtalrkit
)
Secure data sharing
- Data sharing (
.gitignore
file)