Research Data Management with DataLad

Michał Szczepanik, Michael Hanke & the DataLad team

1. Day 1: topics

  • Content tracking with DataLad
  • Good practicies in data management

2. Introduction: life of digital objects

  • Alice is a PhD student.
  • She works on a fairly typical research project: data collection & and processing.
    • Exact kind of data not relevant for us
    • first sample → final result: cumulative process
  • When working locally, Alice likes to have an automated record of:
    • when a given file was last changed
    • where it came from
    • what input files were used to generate a given output
    • why some things were done.
  • Even without sharing, essential for her future self
  • Project is exploratory: often large changes to her analysis scripts
  • Enjoys comfort of being able to return to a previously recorded state

This is local version control.

  • Alice's work not confined to a single computer
    • laptop / desktop / remote server
    • automatic and efficient way to synchronise
  • Some data collected / analysed by colleagues from other team
    • all synchronize with centralized storage
    • preserving origin & authorship
    • combining simultaneous contributions

This is distributed version control.

  • Needs to work on a subset of data at a given time
    • all files are kept on a server
    • few files are rotated into and out of the laptop
  • Needs to publish data at project's end
    • raw data / outputs / both
    • completely or selectively

… all these were typical data management issues which we will touch upon during this workshop, using DataLad as our primary tool.

3. Workshop plan

  • Day 1:
    • Content tracking with DataLad (local version control)
    • Good practices in data management
  • Day 2:
    • Distributed version control (publish / consume)
    • Linked datasets

4. Resources

5. Part 1: Content tracking with DataLad

  • Gradually build up an example dataset
    • Discover version control and basic DataLad concepts in the process.
    • Introduce basic DataLad commands - a technical foundation for all above
  • DataLad is agnostic about the kind of data it manages
    • Add photos and text as "data"
    • Convert photos to b/w as "data processing"
  • Add files → record origin → make changes → track changes → undo things