Content tracking with DataLad
|
With version control, lineage of all files is preserved
You can record and revert changes made to the dataset
DataLad can be used to version control a dataset and all its files
You can manually save changes with datalad save
You can use datalad download-url to preserve file origin
You can use datalad run to capture outputs of a command
“Large” files are annexed, and protected from accidental modifications
|
Structuring data
|
Use filenames which are machine-readable, human readable, easy to sort and search
Avoid including identifying information in filenames from the get-go
Files can be categorized as text or binary
Lightweight text files can go a long way
A well thought-out directory structure simplifies computation
Be modular to facilitate reuse
|
Remote collaboration
|
A dataset can be published with datalad push
A dataset can be cloned with datalad clone
The clone operation does not obtain annexed file content, the contents can be obtained selectively
Annexed file contents can be removed (drop ) and reobtained (get ) as long as a copy exists somewhere
A dataset can be synchronised with its copy (sibling) with datalad update
GIN is one of the platforms with which DataLad can interact
GIN can serve as a store for both annexed and non-annexed contents
|
Dataset management
|
A dataset can contain other datasets
The super- and sub-datasets have separate histories.
The superdataset only contains a reference to a specific commit in the subdataset’s history
|
Extras: The Basics of Branching
|
Your dataset contains branches. The default branch is usually called either main or master .
There’s no limit to the number of branches one can have, and each branch can become an alternative timeline with developments independent from the developments in other branches.
Branches can be merged to integrate the changes from one branch into another.
Using branches is fundamental in collaborative workflows where many collaborators start from a clean default branch and propose new changes to a central dataset sibling.
Typically, central datasets are hosted on services like GitHub, GitLab, or Gin, and if collaborators push their branches with new changes, these services help to create pull requests.
|
Extras: Removing datasets and files
|
Your dataset keeps annexed data safe and will refuse to perform operations that could cause data loss
Removing files or datasets with known copies is easy, removing files or datasets without known copies requires by-passing security checks
There are two ‘destructive’ commands: drop and remove
drop is the antagonist command to get , and remove is the antagonist command to clone
Both commands have a --reckless [MODE] parameter to override safety checks
|