Summary of basic DataLad commands

Key Points

Content tracking with DataLad	With version control, lineage of all files is preserved You can record and revert changes made to the dataset DataLad can be used to version control a dataset and all its files You can manually save changes with `datalad save` You can use `datalad download-url` to preserve file origin You can use `datalad run` to capture outputs of a command “Large” files are annexed, and protected from accidental modifications
Structuring data	Use filenames which are machine-readable, human readable, easy to sort and search Avoid including identifying information in filenames from the get-go Files can be categorized as text or binary Lightweight text files can go a long way A well thought-out directory structure simplifies computation Be modular to facilitate reuse
Remote collaboration	A dataset can be published with `datalad push` A dataset can be cloned with `datalad clone` The clone operation does not obtain annexed file content, the contents can be obtained selectively Annexed file contents can be removed (`drop`) and reobtained (`get`) as long as a copy exists somewhere A dataset can be synchronised with its copy (sibling) with `datalad update` GIN is one of the platforms with which DataLad can interact GIN can serve as a store for both annexed and non-annexed contents
Dataset management	A dataset can contain other datasets The super- and sub-datasets have separate histories The superdataset only contains a reference to a specific commit in the subdataset’s history
Extras: The Basics of Branching	Your dataset contains branches. The default branch is usually called either `main` or `master`. There’s no limit to the number of branches one can have, and each branch can become an alternative timeline with developments independent from the developments in other branches. Branches can be `merged` to integrate the changes from one branch into another. Using branches is fundamental in collaborative workflows where many collaborators start from a clean default branch and propose new changes to a central dataset sibling. Typically, central datasets are hosted on services like GitHub, GitLab, or Gin, and if collaborators push their branches with new changes, these services help to create pull requests.
Extras: Removing datasets and files	Your dataset keeps annexed data safe and will refuse to perform operations that could cause data loss Removing files or datasets with known copies is easy, removing files or datasets without known copies requires by-passing security checks There are two ‘destructive’ commands: `drop` and `remove` `drop` is the antagonist command to `get`, and `remove` is the antagonist command to `clone` Both commands have a `--reckless [MODE]` parameter to override safety checks

Action	Description
create	Create a new dataset from scratch
save	Save the current state of a dataset
status	Report on the state of a dataset and / or its subdatasets
get	Get dataset content (files / directories / subdatasets)
clone	Install an existing dataset from path / url / open data collection
update	Update a dataset from a sibling
remove	Remove datasets + contents, unregister from potential top-level datasets
unlock	Unlock file(s) of a dataset to enable editing their content
drop	Drop file content from dataset (remove data, retain symlink)
siblings	Manage sibling configurations
publish	Publish a dataset to a known sibling
run	Run arbitrary shell command and record its impact
rerun	Re-execute a previous run command identified by its hash, and save resulting modifications
run-procedure	Run prepared procedures (execudables) on a dataset
download-url	Download, save, and record origin of content from websources.

See the DataLad cheat sheet in the DataLad Handbook.

Glossary

absolute path: A path that refers to a particular location in a file system. Absolute paths are usually written with respect to the file system’s root directory, and begin with either “/” (on Unix) or “\” (on Microsoft Windows). See also: relative path.
current working directory: The directory that relative paths are calculated from; equivalently, the place where files referenced by name only are searched for. Every process has a current working directory. The current working directory is usually referred to using the shorthand notation . (pronounced “dot”).
file system: A set of files, directories, and I/O devices (such as keyboards and screens). A file system may be spread across many physical devices, or many file systems may be stored on a single physical device; the operating system manages access.
path: A description that specifies the location of a file or directory within a file system. See also: absolute path, relative path.
relative path: A path that specifies the location of a file or directory with respect to the current working directory. Any path that does not begin with a separator character (“/” or “\”) is a relative path. See also: absolute path.
root directory: The top-most directory in a file system. Its name is “/” on Unix (including Linux and macOS) and “\” on Microsoft Windows.

Research Data Management with DataLad: Summary of basic DataLad commands

Key Points

Summary of basic DataLad commands

Glossary

External references

DataLad

Other sources

Miscellaneous