Research Data Management with DataLad: Summary of basic DataLad commands

Key Points

Content tracking with DataLad
  • With version control, lineage of all files is preserved

  • You can record and revert changes made to the dataset

  • DataLad can be used to version control a dataset and all its files

  • You can manually save changes with datalad save

  • You can use datalad download-url to preserve file origin

  • You can use datalad run to capture outputs of a command

  • “Large” files are annexed, and protected from accidental modifications

Structuring data
  • Use filenames which are machine-readable, human readable, easy to sort and search

  • Avoid including identifying information in filenames from the get-go

  • Files can be categorized as text or binary

  • Lightweight text files can go a long way

  • A well thought-out directory structure simplifies computation

  • Be modular to facilitate reuse

Remote collaboration
  • A dataset can be published with datalad push

  • A dataset can be cloned with datalad clone

  • The clone operation does not obtain annexed file content, the contents can be obtained selectively

  • Annexed file contents can be removed (drop) and reobtained (get) as long as a copy exists somewhere

  • A dataset can be synchronised with its copy (sibling) with datalad update

  • GIN is one of the platforms with which DataLad can interact

  • GIN can serve as a store for both annexed and non-annexed contents

Dataset management
  • A dataset can contain other datasets

  • The super- and sub-datasets have separate histories.

  • The superdataset only contains a reference to a specific commit in the subdataset’s history

Extras: The Basics of Branching
  • Your dataset contains branches. The default branch is usually called either main or master.

  • There’s no limit to the number of branches one can have, and each branch can become an alternative timeline with developments independent from the developments in other branches.

  • Branches can be merged to integrate the changes from one branch into another.

  • Using branches is fundamental in collaborative workflows where many collaborators start from a clean default branch and propose new changes to a central dataset sibling.

  • Typically, central datasets are hosted on services like GitHub, GitLab, or Gin, and if collaborators push their branches with new changes, these services help to create pull requests.

Extras: Removing datasets and files
  • Your dataset keeps annexed data safe and will refuse to perform operations that could cause data loss

  • Removing files or datasets with known copies is easy, removing files or datasets without known copies requires by-passing security checks

  • There are two ‘destructive’ commands: drop and remove

  • drop is the antagonist command to get, and remove is the antagonist command to clone

  • Both commands have a --reckless [MODE] parameter to override safety checks

Summary of basic DataLad commands

Action Description
create Create a new dataset from scratch
save Save the current state of a dataset
status Report on the state of a dataset and / or its subdatasets
get Get dataset content (files / directories / subdatasets)
clone Install an existing dataset from path / url / open data collection
update Update a dataset from a sibling
remove Remove datasets + contents, unregister from potential top-level datasets
unlock Unlock file(s) of a dataset to enable editing their content
drop Drop file content from dataset (remove data, retain symlink)
siblings Manage sibling configurations
publish Publish a dataset to a known sibling
run Run arbitrary shell command and record its impact
rerun Re-execute a previous run command identified by its hash, and save resulting modifications
run-procedure Run prepared procedures (execudables) on a dataset
download-url Download, save, and record origin of content from websources.

See the DataLad cheat sheet in the DataLad Handbook.

Glossary

absolute path
A path that refers to a particular location in a file system. Absolute paths are usually written with respect to the file system’s root directory, and begin with either “/” (on Unix) or “\” (on Microsoft Windows). See also: relative path.
current working directory
The directory that relative paths are calculated from; equivalently, the place where files referenced by name only are searched for. Every process has a current working directory. The current working directory is usually referred to using the shorthand notation . (pronounced “dot”).
file system
A set of files, directories, and I/O devices (such as keyboards and screens). A file system may be spread across many physical devices, or many file systems may be stored on a single physical device; the operating system manages access.
path
A description that specifies the location of a file or directory within a file system. See also: absolute path, relative path.
relative path
A path that specifies the location of a file or directory with respect to the current working directory. Any path that does not begin with a separator character (“/” or “\”) is a relative path. See also: absolute path.
root directory
The top-most directory in a file system. Its name is “/” on Unix (including Linux and macOS) and “\” on Microsoft Windows.

External references

DataLad

Other sources

Miscellaneous