Research data management with DataLad

@ IMPRS-MMFD

4 - Reusability and reproducibility
title image
Stephan Heunis
jsheunis @jsheunis@mas.to
Michał Szczepanik
mslw @doktorpanik@masto.ai

Psychoinformatics lab,
Institute of Neuroscience and Medicine (INM-7)
Research Center Jülich


Slides: https://psychoinformatics-de.github.io/imprs-mmfd-workshop/

Reusability and Reproducibility

Why Modularity?

Why Modularity?

  • 3. Transparency

  • Original:
    
    /dataset
    ├── sample1
    │   └── a001.dat
    ├── sample2
    │   └── a001.dat
    ...
    
    Without modularity, after applied transform (preprocessing, analysis, ...):
    
    /dataset
    ├── sample1
    │   ├── ps34t.dat
    │   └── a001.dat
    ├── sample2
    │   ├── ps34t.dat
    │   └── a001.dat
    ...
    
    Without expert/domain knowledge, no distinction between original and derived data possible.

Why Modularity?

  • 3. Transparency

  • Original:
    
    /raw_dataset
    ├── sample1
    │   └── a001.dat
    ├── sample2
    │   └── a001.dat
    ...
    
    With modularity after applied transform (preprocessing, analysis, ...)
    
    /derived_dataset
    ├── sample1
    │   └── ps34t.dat
    ├── sample2
    │   └── ps34t.dat
    ├── ...
    └── inputs
        └── raw
            ├── sample1
            │   └── a001.dat
            ├── sample2
            │   └── a001.dat
            ...
    
    Clearer separation of semantics, through use of pristine version of original dataset within a new, additional dataset holding the outputs.

A machine-learning example

handbook.datalad.org/usecases/ml-analysis.html

Analysis layout

  • Prepare an input data set
  • Configure and setup an analysis dataset
  • Prepare data
  • Train models and evaluate them
  • Compare different models, repeat with updated data
Imagenette dataset

Prepare an input dataset

  • Create a stand-alone input dataset
  • Either add data and datalad save it, or use commands such as datalad download-url or datalad add-urls to retrieve it from web-sources

Configure and setup an analysis dataset

  • Given the purpose of an analysis dataset, configurations can make it easier to use:
    • -c yoda prepares a useful structure
    • -c text2git keeps text files such as scripts in Git
  • The input dataset is installed as a subdataset
  • Required software is containerized and added to the dataset

Sharing software environments: Why and how

Science has many different building blocks: Code, software, and data produce research outputs. The more you share, the more likely can others reproduce your results

Sharing software environments: Why and how

  • Software can be difficult or impossible to install (e.g. conflicts with existing software, or on HPC) for you or your collaborators
  • Different software versions/operating systems can produce different results: Glatard et al., doi.org/10.3389/fninf.2015.00012

Software containers


  • Put simple, a cut-down virtual machine that is a portable and shareable bundle of software libraries and their dependencies
  • Docker runs on all operating systems, but requires "sudo" (i.e., admin) privileges
  • Singularity can run on computational clusters (no "sudo") but is not (well) on non-Linux
  • Their containers are different, but interoperable - e.g., Singularity can use and build Docker Images

The datalad-container extension

  • The datalad-container extension gives DataLad commands to add, track, retrieve, and execute Docker or Singularity containers.
  • pip/conda install datalad-container
  • If this extension is installed, DataLad can register software containers as "just another file" to your dataset, and datalad containers-run analysis inside the container, capturing software as additional provenance

Did you know...

    Helpful resources for working with software containers:
  • repo2docker can fetch a Git repository/DataLad dataset and builds a container image from configuration files
  • neurodocker can generate custom Dockerfiles and Singularity recipes for neuroimaging.
  • The ReproNim container collection, a DataLad dataset that includes common neuroimaging software as configured singularity containers.
  • rocker - Docker container for R users

Prepare data

  • Add a script for data preparation (labels train and validation images)
  • Execute it using datalad containers-run

Train models and evaluate them

  • Add scripts for training and evaluation. This dataset state can be tagged to identify it easily at a later point
  • Execute the scripts using datalad containers-run
  • By dumping a trained model as a joblib object the trained classifier stays reusable

💻Your turn💻

Follow the hands-on tutorial steps at handbook.datalad.org/en/latest/usecases/ml-analysis.html to:

  1. Create an input dataset and add files to it using download-url
  2. Create an analysis dataset using the yoda configuration
  3. Nest the input dataset inside the analysis dataset
  4. Add the software container as a dependency using datalad containers-add
  5. Save code to the analysis dataset for data preparation, training, and evaluation
  6. Run code with the software container using datalad containers-run
  7. (Re)run code with the software container, at specific tags and in specific branches, usingdatalad containers-run
  8. Inspect differences between the outputs of different runs using git diff

After the workshop