Research Data Management with DataLad

SFB1451 workshop, day 2

Michał Szczepanik

1. Day 2 outline

  • Remote collaboration
  • Dataset management

2. Part 3: Remote collaboration

2.1. Recap

  • Before: basics of local version control
    • recording changes, interacting with dataset history
      • built a small dataset
      • have a record of what led to its current state
      • single location, single person

2.2. Introduction

  • Research data rarely lives just on a single computer.
  • Research projects aren't single-person affairs.
  • Want to:
    • synchronise with a remote location (backup/archival)
    • keep only a subset on your PC, rotating files (save space)
    • send data to colleagues, ensure up to date with version control
    • have them contribute to your dataset (add files, make changes)
    • publish to a repository

DataLad has tools to facilitate all that.

2.3. Plan

  • Publish our dataset from yesterday
  • Use GIN (G-Node Infrastructure): https://gin.g-node.org
    • Convenient integration with DataLad (all files, annexed or not)
    • DataLad supports many different scenarios (incl. separation)
    • Some quirks, but steps for GIN will be similar elswhere
  • Make changes to each other's datasets through GIN

3. Part 4: Dataset management

3.1. Introduction

  • Analysis, simplified: collect inputs, produce outputs
    • same input can be used for multiple analyses
    • output (transform / preprocess) may become input for next one

3.1.1. Subdataset hierarchy

dataset_modules.svg

Figure 1: Dataset modules - from DataLad handbook

3.1.2. Reasons to use subdatasets:

  • a logical need to make your data modular
    • eg. raw data - preprocessing - analysis - paper
  • a technical need to divide your data
    • hundreds of thousands of files start hurting performance

3.2. Plan

  • Inspect a published nested DataLad dataset
  • Create a toy example from scratch

3.3. Data we will use

  • "Highspeed Analysis" DataLad dataset
  • Tabular data from Palmer Station Antarctica LTER
    • Gorman KB, Williams TD, Fraser WR, PLoS ONE 9(3):e90081 (2014)
    • see also: palmerpenguins R dataset, alternative to Iris

3.4. Published DataLad dataset: the plan

  • Obtain the dataset
  • Inspect its nested structure
  • Obtain a specific file

3.5. Toy example: the plan:

  • Investigate the relationship between flipper length and body mass in 3 penguin species.
    • Create a "penguin-report" dataset, with "inputs" subdataset
    • Populate the subdataset with data
    • Run an analysis, → figure in the main dataset
    • Write our "report" → document in the main dataset

3.5.1. Folder structure we're aiming for

penguin-report/
├── figures
│   └── lmplot.png
├── inputs
│   ├── adelie.csv
│   ├── chinstrap.csv
│   └── gentoo.csv
├── process.py
├── report.html
└── report.md

4. Wrap-up

4.1. Where to next