Content tracking with DataLad

Overview

Teaching: 30 min
Exercises: 60 min
Questions
  • What does version control mean for datasets?

  • How to create a DataLad dataset?

Objectives
  • Learn basics of version control

  • Work locally to create a dataset

  • Practice basic DataLad commands

Introduction

Alice is a PhD student. She works on a fairly typical research project, which involves collection and processing of data. The exact kind of data she works with is not relevant for us, what’s relevant is that getting from the first sample to the final result is a cumulative process.

When Alice is working locally, she likes to have an automated record of when a given file was last changed, where it came from, what input files were used to generate a given output, or why some things were done. Even if she won’t be sharing the data with anyone, these records might be essential for her future self, when she needs to return to the project after some time. Moreover, Alice’s project is exploratory, and she often makes large changes to her analysis scripts. She enjoys the comfort of being able to return all files to a previously recorded state if she makes a mistake or figures out a better solution. This is local version control.

Alice’s work is not confined to a single computer. She has a laptop and a desktop, and she uses a remote server to run some time-consuming analysis steps. She likes having an automatic and efficient way to synchronise her project files between these places. Moreover, some of the data within the project is collected or analysed by her colleagues, possibly from another team. She uses the same mechanism to synchronise the data with a centralized storage (e.g. network storage owned by her lab), preserving origin and authorship of files, and combining simultaneous contributions. This is distributed version control.

Finally, Alice wants to have a mechanism to publish, completely or selectively, her raw data, or outputs, or both. Or to work selectively with a large collection of files - keeping all of them on a server, and only fetching some to her laptop.

These are typical data management issues which we will touch upon during this workshop. From the technical point of view we will be using DataLad, a data management multi-tool that can assist you in handling the entire life cycle of digital objects. It is a command-line tool, free and open source, and available for all major operating systems. The first module will deal only with local version control. In the next one, we will set the technical details aside and talk about good practices in file management. Later during the workshop we will discuss distributed version control, publish a dataset, and see what it looks like from the perspective of data consumers. In the last module we will talk about more complex scenarios with linked datasets.

In this lesson we will gradually build up an example dataset, discovering version control and basic DataLad concepts in the process. Along the way, we will introduce basic DataLad commands - a technical foundation for all the operations outlined above. Since DataLad is agnostic about the kind of data it manages, we will use photographs and text files to represent our dataset content. We will add these files, record their origin, make changes, track these changes, and undo things we don’t want to keep.

Setting up

In order to code along, you should have a recent DataLad version. The workshop was developed based on DataLad version 0.16. Installation instructions are included in the Setup page. If you are unsure about your version of DataLad, you can check it using the following command:

datalad --version

You should should have a configured Git identity, consisting of your name and email (and the command above will display a complaint if you don’t). That identity will be used to identify you as the author of all dataset operations. If you are unsure if you have configured your Git identity already, you can check if your name and email are printed to the terminal when you run:

git config --get user.name
git config --get user.email

If nothing is returned (or the values are incorrect), you can set them with:

git config --global user.name "John Doe"
git config --global user.email johndoe@example.com

With the --global option, you need to do this once on a given system, as the values will be stored for your user account. Of course you can change or override them later.

Note for participants using their own computers. Some examples used to illustrate data processing require python with pillow library. If you are using a virtual environment, now is a good time to activate it (e.g. source ~/.venvs/rdm-workshop/bin/activate). You’ll find more details in the Setup page.

How to use DataLad

DataLad is a command line tool and it has a Python API. It is operated in your terminal using the command line (as done above), or used in scripts such as shell scripts, Python scripts, Jupyter Notebooks, and so forth. We will only use the command line interface during the workshop.

The first important skill in using a program is asking for help. To do so, you can type:

datalad --help

This will display a help message, which you can scroll up and down using arrows and exit with q. The first line is a usage note:

Usage: datalad [global-opts] command [command-opts]

This means that to use DataLad you will need to type in the main command (datalad) followed by a sub-command. The (sub-)commands are listed in the help message. The most important for now are datalad create and datalad save, and we will explain them in detail during this lesson.

Both the main command and the sub-command can accept options. Options usually start with a dash (single letter, e.g. -m) or two dashes (longer names, e.g. --help which we have just used). Some commands will have both the long form and the short form.

You can also request help for a specific command, for example:

datalad create --help

Using the shorter -h flag instead of --help will return a concise overview of all subcommands or command options.

datalad create -h
Usage: datalad create [-h] [-f] [-D DESCRIPTION] [-d DATASET] [--no-annex]
                      [--fake-dates] [-c PROC] [--version]
                      [PATH] ...

Use '--help' to get more comprehensive information.

Getting started: create an empty dataset

All actions we do happen in or involve DataLad datasets. Creating a dataset from scratch is done with the datalad create command.

datalad create only needs a name, and it will subsequently create a new directory under this name and instruct DataLad to manage it. Here, the command also has an additional option, the -c text2git option. With the -c option, datasets can be pre-configured in a certain way at the time of creation, and text2git is one of the available run procedures (later we’ll explain why we chose to use it in this example):

datalad create -c text2git my-dataset
[INFO   ] Creating a new annex repo at /home/bob/Documents/rdm-workshop/my-dataset
[INFO   ] Running procedure cfg_text2git
[INFO   ] == Command start (output follows) =====
[INFO   ] == Command exit (modification check follows) =====
create(ok): /home/bob/Documents/rdm-workshop/my-dataset (dataset)

The last output line confirms that the create operation was successful. Now, my-dataset is a new directory, and you can change directories (cd) inside it:

cd my-dataset

Let’s inspect what happened. Let’s start by listing all contents, including hidden ones (on UNIX-like system, files or folders starting with a dot are treated as hidden):

ls -a
.  ..  .datalad  .git  .gitattributes

The . and .. represent current and parent directory, respectively. More interestingly, there are two hidden folders, .datalad and .git as well as a hidden .gitattributes file. They are essential for dataset functioning, but typically we have no need to touch them.

Next, we can invoke tig, a tool which we will use to view the dataset history. Tig displays a list of commits - a record of changes made to the document. Each commit has a date, author, and description, and is identified by a unique 40-character sequence (displayed at the bottom) called shasum or hash. You can move up and down the commit list using up and down arrows on your keyboard, use enter to display commit details, and q to close detail view or Tig itself.

We can see that DataLad has already created two commits on our behalf. They are shown with the most recent on top:

tig
2021-10-18 16:58 +0200 John Doe o [main] Instruct annex to add text files to Git
2021-10-18 16:58 +0200 John Doe I [DATALAD] new dataset

Version control

Version controlling a file means to record its changes over time, associate those changes with an author, date, and identifier, creating a lineage of file content, and being able to revert changes or restore previous file versions. DataLad datasets can version control their contents, regardless of size. Let’s start small, and just create a README.md.

We will use a text editor called nano to work without leaving the command line. You can, of course, use an editor of your preference. Open the editor by typing nano and write the file content:

# Example dataset

This is an example datalad dataset.

Nano displays the available commands on the bottom. To save (Write Out) the file, hit Ctrl-O, type the file name (README.md), and hit enter. Then, use Ctrl-X to exit.

datalad status can report on the state of a dataset, and we will use it a lot. As we added a new file, README.md will show up as being untracked if you run datalad status:

datalad status
untracked: README.md (file)

In order to save a modification in a dataset use the datalad save command. datalad save will save the current state of your dataset: It will save both, modifications to known files and yet untracked files. The -m/--message option lets you attach a concise summary of your changes. Such a commit message makes it easier for others and your later self to understand a dataset’s history:

datalad save -m "Add a short README"

Let’s verify that it got recorded in history:

tig
2021-10-18 17:20 +0200 John Doe o [main] Add a short README
2021-10-18 16:58 +0200 John Doe o Instruct annex to add text files to Git
2021-10-18 16:58 +0200 John Doe I [DATALAD] new dataset

Let’s add some “image data”, represented here by jpeg images. For demonstration purposes, we will use photos available with a permissive license from Unsplash. Start by creating a directory for your data. Let’s call it inputs/images, to make it clear what it represents.

mkdir -p inputs/images

Then, let’s put a file in it. To avoid leaving terminal, we will use the Linux wget command. This is just for convenience - the effect would be the same if we opened the link in the browser and saved the file from there. The -O option specifies the output file - since this is a photo of chinstrap penguins, and we may expect multiple of those, let’s name the file chinstrap_01.jpg. We are specyfying the URL as a string (i.e. in quotation marks), to avoid confusing our computer with the ? character, which can be interpreted as a placeholder for any character.

wget -O inputs/images/chinstrap_01.jpg "https://unsplash.com/photos/3Xd5j9-drDA/download?force=true"

We can view the current file / folder structure by using the Linux tree command:

tree
.
├── inputs
│   └── images
│       └── chinstrap_01.jpg
└── README.md

While we’re at it, lets open the readme file (nano README.md) and make a note on how we organize the data. Note the unobtrusive markdown syntax for headers, monospace, and list items, which may be used for rendering by software or websites. With nano, save and exit with: Ctrl-O, enter, Ctrl-X:

# Example dataset

This is an example DataLad dataset.

Raw data is kept in `inputs` folder:
- penguin photos are in `inputs/images`

Okay, time to check the datalad status:

untracked: inputs (directory)
 modified: README.md (file)

The inputs directory has some new contents, and it is shown as untracked. The README file now differs from its last known state, and it shows up as modified. This is a good moment to record these changes. Note that datalad save would save all modifications in the dataset at once! If you have several modified files, you can supply a path to the file or files you want to save. We will do it this way, and record two separate changes:

datalad save -m "Add first penguin image" inputs/images/chinstrap_01.jpg
datalad save -m "Update readme" README.md

We can see that these changes got recorded with tig.

For now, we have manually downloaded the file and saved it to the dataset. However, saving a file from an URL is a common scenario, whether we are using a public repository or a local network storage. For that, DataLad has a datalad download-url method. Let’s use it to download another file (this command also provides the -O option to specify an output path, similar to wget):

datalad download-url -O inputs/images/chinstrap_02.jpg "https://unsplash.com/photos/8PxCm4HsPX8/download?force=true"

Afterwards, datalad status shows us that there is nothing to save. The download-url command not only downloaded the file, but also performed a datalad save on our behalf. We can use tig to inspect the commit message:

[DATALAD] Download URLs
	
URLs:
  https://unsplash.com/photos/8PxCm4HsPX8/download?force=true

This is a notable improvement compared to the previous image, because in addition to recording the addition of the picture we also stored its source. What’s more, DataLad is aware of that source, and has all the information needed to remove and reobtain the file on demand… but that’s another topic altogether.

To practice saving changes and to make our example dataset more similar to the real-life datasets, let’s add some more files, this time in the form of sidecar metadata. Let’s suppose we want to store the picture author, license under which the file is available, and, let’s say, the number of penguins visible in the photo. For each image, we will create a yaml file (a simple text file following a set of rules to store variables) with the same name but different extension:

nano inputs/images/chinstrap_01.yaml
photographer: Derek Oyen
license: Unsplash License
penguin_count: 3
nano inputs/images/chinstrap_02.yaml
photographer: Derek Oyen
license: Unsplash License
penguin_count: 2

We can use the already familiar datalad save command to record these changes (addition of two files):

datalad save -m "Add sidecar metadata to photos"

Breaking things (and repairing them)

A huge appeal of version control lies in the ability to return to a previously recorded state, which enables experimentation without having to worry about breaking things. Let’s demonstrate by breaking things on purpose. Open the README.md file, remove most of its contents and save. You can use cat README.md to display the file contents and make sure that they are, indeed, gone. The datalad status reports that the file changed, but the change has not been saved in the dataset history:

datalad status
modified: README.md (file)

In this situation, you can restore the file to its previously recorded state by running:

git restore README.md

Note that git is the program used by DataLad under the hood for version control. While most dataset operations can be performed using datalad commands, some will require calling git directly. After running git restore, you can use datalad status to see that the dataset is now clean, and cat README.md to see that the original file contents are back as if nothing happened - disaster averted. Finally, check tig to see that the dataset history remained unaffected.

Now, let’s take things one step further and actually datalad save some undesired changes. Open the README.md, wreak havoc, and save it:

nano README.md
# Example dataset

HAHA all description is gone

This time we are committing these changes to the dataset history:

datalad save -m "Break things"

The file was changed, and the changes have been committed. Luckily, git has a method for undoing such changes, git revert, which can work even if subsequent save operations have been performed on the dataset. To call it, we need to know the commit hash (unique identifier) of the change which we want to revert. It is displayed by tig at the bottom of the window and looks like this: 8ddaaad243344f38cd778b013e7e088a5b2aa11b (note: because of the algorithm used by git, yours will be different). Don’t worry, we only need the first couple characters. Find your commit hash and call git revert taking the first few characters (seven should be plenty):

git revert --no-edit 8ddaaad

With the --no-edit option, git revert will create a default commit message; without it it would open your default editor and let you edit the commit message. Like previously, after reverting the changes, datalad status shows that there is nothing to save and cat README.md proves that the removed file contents are back. This time, tig shows that git revert created a new commit that reverted the changes (note that recent commits can also be completely removed from history with git reset but this is beyond the scope of this lesson).

Data processing

We have demonstrated building a dataset history by collecting data and changing it manually. Now it is time to demonstrate some script-based data processing. Let’s assume that our project requires us to convert the original files to greyscale. We can do this with a simple Python script. First, let’s create two new directories to keep code and outputs, i.e. processing results, in designated places:

mkdir code
mkdir -p outputs/images_greyscale

Now, let’s “write” our custom script. You can download it using wget (below), or copy its content from here and then save it as part of the dataset:

wget -O code/greyscale.py https://github.com/psychoinformatics-de/rdm-course/raw/gh-pages/data/greyscale.py
datalad save -m "Add an image processing script"

This script for greyscale conversion takes two arguments, input_file and output file. You can check this with python code/greyscale.py --help. Let’s apply it for the first image, and place the output in the outputs/images_greyscale directory, slightly changing the name:

python code/greyscale.py inputs/images/chinstrap_01.jpg outputs/images_greyscale/chinstrap_01_grey.jpg

Note that our working directory is in the root of the dataset, and we are calling the script using relative paths (meaning that they are relative to the working directory, and do not contain the full path to any of the files). This is a good practice: the call looks the same regardless of where the dataset is on our drive.

You should be able to verify that the output file has been created and that the image is, indeed, converted to greyscale. Now all that remains is to save the change in the dataset:

datalad save -m "Convert the first image to greyscale"

Let’s take a look at our history with tig. It already looks pretty good: we have recorded all our operations. However, this record is only as good as our descriptions. We can take it one step further.

Datalad has the ability to record the exact command which was used, and all we have to do for this is to prepend datalad run to our command. We can also provide the commit message to datalad run, just as we could with datalad save. Let’s try this on the other image:

datalad run -m "Convert the second image to greyscale" python code/greyscale.py inputs/images/chinstrap_02.jpg outputs/images_greyscale/chinstrap_02_grey.jpg

As we can see, datalad run executes the given command and follows that by automatically calling datalad save to store all changes resulting from this command in the dataset. Let’s take a look at the full commit message with tig (highlight the commit you want to see and press enter):

[DATALAD RUNCMD] Convert second image to grayscale

=== Do not change lines below ===
{
"chain": [],
"cmd": "python code/greyscale.py inputs/images/chinstrap_02.jpg outputs/images_greyscale/chinstrap_02_grey.jpg",
"dsid": "b4ee3e2b-e132-4957-9987-ca8aad2d8dfc",
"exit": 0,
"extra_inputs": [],
"inputs": [],
"outputs": [],
"pwd": "."
}
^^^ Do not change lines above ^^^

There is some automatically generated text, and inside we can easily find the command that was executed (under "cmd" keyword). The record is stored using json formatting, and as such can be read not just by us, but also by DataLad. This is very useful: now we will be able to rerun the exact command if, for example, input data gets changed, the script gets changed, or we decide to remove the outputs. We won’t try that now, but the command to do so is datalad rerun.

Locking and unlocking

Let’s try something else: editing an image which already exists. We have done so with text files, so why should it be different?

Let’s try doing something nonsensical: using the first input image (chinstrap_01.jpg) and writing its greyscale version onto the second output image (chinstrap_02_grey.jpg). Of course the computer doesn’t know what makes sense - the only thing which might stop us is that we will be writing to a file which already exists. This time we will skip datalad run to avoid creating a record of our little mischief:

python code/greyscale.py inputs/images/chinstrap_01.jpg outputs/images_greyscale/chinstrap_02_grey.jpg
Traceback (most recent call last):
  File "/home/bob/Documents/rdm-warmup/example-dataset/code/greyscale.py", line 20, in <module>
    grey.save(args.output_file)
  File "/home/bob/Documents/rdm-temporary/venv/lib/python3.9/site-packages/PIL/Image.py", line 2232, in save
    fp = builtins.open(filename, "w+b")
PermissionError: [Errno 13] Permission denied: 'outputs/images_greyscale/chinstrap_02_grey.jpg'

Something went wrong: PermissionError: [Errno 13] Permission denied says the message. What happened? Why don’t we have the permission to change the existing output file? Why didn’t we run into the same problems when editing text files? To answer that question we have to introduce the concept of annexed files and go back to the moment when we created our dataset.

DataLad uses two mechanisms to control files: git and git-annex. This duality exists because it is not possible to store large files in git. While git is especially good at tracking text files (and can also handle files other than text) it would quickly run into performance issues. We will refer to the files controlled by git-annex as annexed files. There are no exact rules for what is a large file, but a boundary between “regular” and annexed files has to be drawn somewhere.

Let’s look at the first two commit messages in tig. The second says:

o Instruct annex to add text files to Git

Remember how we created the dataset with datalad create -c text2git my-dataset? The -c text2git option defined the distinction in a particular way: text files are controlled with git, other (binary) files are annexed. By default (without text2git) all files would be annexed. There are also other predefined configuration options, and it’s easy to tweak the setting manually (however, we won’t do this in this tutorial). As a general rule you will probably want to hand some text files to git (code, descriptions), and annex others (especially those huge in size or number). In other words, while text2git works well for our example, you should not treat it as the default approach.

One essential by-product of the above distinction is that annexed files are write-protected to prevent accidental modifications:

git vs git-annex

If we do want to edit the annexed file, we have to unlock it:

datalad unlock outputs/images_greyscale/chinstrap_02_grey.jpg

Now, the operation should succeed:

python code/greyscale.py inputs/images/chinstrap_01.jpg outputs/images_greyscale/chinstrap_02_grey.jpg

We can open the image to see that it changed, and check:

datalad status
modified: outputs/images_greyscale/chinstrap_02-grey.jpg (file)

The file will be locked again after running datalad save:

datalad save -m "Make a mess by overwriting"

We could revert the changes we just saved, but let’s overwrite the file using correct inputs instead, to demonstrate another feature of datalad run. The sequence of actions we just did (unlock - change - save) is not uncommon, and datalad run has provisions to make all three things happen at once, without the explicit unlock call. What we need is the --output argument to tell datalad to prepare the given file for writing (unlock it). Additionally, we will also use the --input option (which tells datalad that this file is needed to run the command). Although --input is not necessary in the current example, we will introduce it for the future. Finally, to avoid repetition, we will use {inputs} and {outputs} placeholder in the run call.

datalad run \
    --input inputs/images/chinstrap_02.jpg \
    --output outputs/images_greyscale/chinstrap_02_grey.jpg \
    -m "Convert the second image again" \
    python code/greyscale.py {inputs} {outputs}
[INFO   ] Making sure inputs are available (this may take some time)
unlock(ok): outputs/images_greyscale/chinstrap_02_grey.jpg (file)
[INFO   ] == Command start (output follows) ===== 
[INFO   ] == Command exit (modification check follows) ===== 
add(ok): outputs/images_greyscale/chinstrap_02_grey.jpg (file)

Success! Time to look at the images, and then check the dataset history with tig. The commit message contains the following:

[DATALAD RUNCMD] Convert the second image again

=== Do not change lines below ===
{
 "chain": [],
 "cmd": "python code/greyscale.py '{inputs}' '{outputs}'",
 "dsid": "b4ee3e2b-e132-4957-9987-ca8aad2d8dfc",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [
  "inputs/images/chinstrap_02.jpg"
 ],
 "outputs": [
  "outputs/images_greyscale/chinstrap_02_grey.jpg"
 ],
 "pwd": "."
}
^^^ Do not change lines above ^^^

Making some more additions

Let’s make a few more changes to the dataset. We will return to it in the workshop module on remote collaboration. As an exercise, do the following steps using DataLad commands:

Solution

Download file using download-url:

datalad download-url \
  -m "Add third image" \
  -O inputs/images/king01.jpg \
  "https://unsplash.com/photos/8fmTByMm8wE/download?force=true"

Create the yaml file, e.g. using nano, and update the dataset:

nano inputs/images/king_01.yaml
# paste the contents and save
datalad save -m "Add a description to the third picture"

Edit the readme file, e.g. using nano, and update the dataset:

nano README.md
# paste the contents and save
datalad save -m "Add credit to README"

Key Points

  • With version control, lineage of all files is preserved

  • You can record and revert changes made to the dataset

  • DataLad can be used to version control a dataset and all its files

  • You can manually save changes with datalad save

  • You can use datalad download-url to preserve file origin

  • You can use datalad run to capture outputs of a command

  • “Large” files are annexed, and protected from accidental modifications