Research data management with DataLad

@ IMPRS-MMFD

1 - Introduction to DataLad
title image
Stephan Heunis
jsheunis @jsheunis@mas.to
MichaΕ‚ Szczepanik
mslw @doktorpanik@masto.ai

Psychoinformatics lab,
Institute of Neuroscience and Medicine (INM-7)
Research Center JΓΌlich


Slides: https://psychoinformatics-de.github.io/imprs-mmfd-workshop/

Welcome!

Approximate workshop schedule

Session 1 (today, 14.00-15h30)
Logistics & Intro πŸ§‘β€πŸ«,
Hands-on Terminal Basics πŸ’»,
Demo of core functionality πŸ§‘β€πŸ«πŸ’»

Session 2 (today, 16.00-18.00)
Hands-on DataLad Basics & Exercises πŸ’»

Session 3 (tomorrow, 09.00-11.00)
Sharing and Collaboration πŸ§‘β€πŸ«,
Hands-on Data publication πŸ’»

Session 4 (tomorrow, 11.30-13.00)
Reusability and reproducibility πŸ§‘β€πŸ«πŸ’»,
Outro πŸ§‘β€πŸ«,

Logistics and links

Interactivity



  • The workshop centers around DataLad for real-world research data management use cases
  • There are no stupid questions; ask anything any time
  • Something doesn't look right on your system? Put your hand up or stick a post-it to your screen. We'll take a look together
  • We're available outside of sessions, too. Chat about your use cases or questions over a coffee or meal
  • 4 sessions = time for more than a standard introduction.
  • Materials are available online & persistent, we can be flexible & spontaneous if specific topics interest you

After the workshop

Audience response system

Use your phone to scan the QR code, or open the link in a new browser window

On a scale of rubber ducks...

Research data management

Common problems in science

You write a paper & stay up late to generate good-looking figures, but you have to tweak many parameters and display options. The next morning, you have no idea which parameters produced which figures, and which of the figures fit to what you report in the paper.
Illustration adapted from Scriberia and The Turing Way

Common problems in science

Your research project produces phenomenal results, but your laptop, the only place that stores the source code for the results, is stolen or breaks
https://co.pinterest.com/pin/551128073121451139//imgcredit>

Common problems in science

A graduate student complains that a research idea does not work. Their supervisor can't figure out what the student did and how, and the student can't sufficiently explain their approach (data, algorithms, software). Weeks of discussion and mis-communication ensues because the supervisor can't first-hand explore or use the students project.
http://phdcomics.com/comics.php?f=1693

Common problems in science

You wrote a script during your PhD that applied a specific method to a dataset. Now, with new data and a new project, you try to reuse the script, but forgot how it worked.
http://phdcomics.com/comics.php?f=1693

common problems in science

You try to recreate results from another lab's published paper. You base your re-implementation on everything reported in their paper, but the results you obtain look nowhere like the original.
http://phdcomics.com/comics.php?f=1693

common old problems in science

All these problems were paraphrased from Buckheit & Donoho, 1995
Let's do better!

What should we do about it???

The pipeline needs to become transparent


Digital Provenance = A complete description of how a digital file came to be (FAIR principles)

What should we do about it???

The pipeline needs to become automated

Thus: everything should be FAIR...

  • F

    indable
  • A

    ccessible
  • I

    nteroperable
  • R

    eusable


https://www.go-fair.org/fair-principles Wilkinson et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18

But what does FAIR really mean, practically?

  • Bench/bed/field-side researchers are an essential source of
    valid metadata, critical for FAIR data
  • Their resources are limited, and they need something in exchange, otherwise FAIR won't happen


Why not focus on enabling practical collaboration
(even if just with one's future self)?


Why not make the aspirational goal "FAIR data"
a by-product of enabling efficient research?

Be FAIR and immediately benefit from it yourself...

...while still working towards the greater good of FAIR data
  • V

    ersion-controlled
  • A

    ctionable metadata
  • M

    odular
  • P

    ortable


a.k.a. The DataLad approach

DataLad can help
with small or large-scale
data management
Free,
open source,
command line tool & Python API

  • A command-line tool, available for all major operating systems (Linux, macOS/OSX, Windows), MIT-licensed
  • Build on top of Git and Git-annex
  • Allows...
  • ... version-controlling arbitrarily large content
    version control data and software alongside to code!
    ... transport mechanisms for sharing and obtaining data
    consume and collaborate on data (analyses) like software
    ... (computationally) reproducible data analysis
    Track and share provenance of all digital objects
    ... and much more
  • Completely domain-agnostic

Acknowledgements

Software
  • Joey Hess (git-annex)
  • The DataLad team & contributors
Illustrations
  • The Turing Way
    project & Scriberia
Funders
Collaborators

Examples of what DataLad can be used for:

Examples of what DataLad can be used for:

  • Creating and sharing reproducible, open science: Sharing data, software, code, and provenance
  • a screenrecording of cloning REMODNAV paper dataset from github

Examples of what DataLad can be used for:

  • Creating and sharing reproducible, open science: Sharing data, software, code, and provenance
  • a screenrecording of cloning REMODNAV paper dataset from github

Examples of what DataLad can be used for:

  • Central data management and archival system

Examples of what DataLad can be used for:

  • Scalable computing framework for reproducible science

Prerequisites: Terminal

  • DataLad can be used from the command line
  • data-trim class="language-bash copycode"datalad create mydataset
  • ... or with its Python API
  • import datalad.api as dl
    dl.create(path="mydataset")
  • ... and other programming languages can use it via system call
  • # in R
    > system("datalad create mydataset")
    


Prerequisites: Terminal

Useful: Unix terminal cheatsheet

πŸ’»Your turnπŸ’» try a few commands at workshop-hub.datalad.org

pwd
ls
mkdir test_dir
cd test_dir
echo "my_words" >> my_file.txt
ls -la
cd ..
tree
cat test_dir/my_file.txt
pip list | grep datalad

Prerequisites: Installation and Configuration

  • Your installed version of DataLad should be 1.0.2
  •             datalad --version
  • DataLad relies on Git to create a revision history with detailed information on what was changes, when, and how. Therefore, you should tell Git who you are and configure a Git identity (name and email). Find out if an identity is set by running either of:
  • $ git config --get user.name
    Stephan Heunis
    $ git config --get user.email
    s.heunis@fz-juelich.de                               .
    
    $ datalad configuration get user.name user.email
    Stephan Heunis
    s.heunis@fz-juelich.de
                                                           .
    
  • πŸ’»Your turnπŸ’» set a Git identity using either of
    $ git config set --global \
      user.name "Stephan Heunis"
    $ git config set --global \
      user.email "s.heunis@fz-juelich.de"                    .
    $ datalad configuration --scope global \
      set user.name="Stephan Heunis"
    $ datalad configuration --scope global \
      set user.email="s.heunis@fz-juelich.de"                     .
  • Allow brand-new DataLad functionality:
    datalad configuration --scope global set datalad.extensions.load=next
  • Find installation and configuration instructions at handbook.datalad.org

Prerequisites: Using DataLad

  • Every DataLad command consists of a main command followed by a sub-command. The main and the sub-command can have options.
  • Example (main command, subcommand, several subcommand options):
    $ datalad save -m "Saving changes" --recursive 
  • Use --help to find out more about any (sub)command and its options, including detailed description and examples (q to close). Use -h to get a short overview of all options
    $ datalad save -h
          Usage: datalad save [-h] [-m MESSAGE] [-d DATASET] [-t ID] [-r] [-R LEVELS]
                        [-u] [-F MESSAGE_FILE] [--to-git] [-J NJOBS] [--amend]
                        [--version]
                        [PATH ...]
    
    Use '--help' to get more comprehensive information.
              

So how does it actually work??

focus now, you're going to be doing this soon ;-)


they install datalad...


datalad create --help

datalad status --help
datalad save --help

Data publishing

datalad siblings --help
datalad push --help

Data consumption

datalad clone --help
datalad get --help
datalad drop --help
datalad update --help

datalad subdatasets --help
datalad containers-add --help
datalad run --help
datalad containers-run --help

datalad rerun --help

Let's give it a go!

(after coffee :D)