Extras: Removing datasets and files

Overview

Teaching: 15 min
Exercises: 0 min
Questions
  • How can I remove files or datasets?

Objectives
  • Learn how to remove dataset content

  • Remove unwanted datasets

This is an extra lesson!

Unlike the main Episodes of this course, this lesson focuses on only a specific set of commands independent of a larger storyline. It aims to provide a practical hands-on walk-through of removing annexed files or datasets and the different safety checks that can be in place to prevent data loss. A complete overview of file system operations, including renaming or moving files, is in handbook.datalad.org/basics/101-136-filesystem.html.

Prerequisites

This episode requires DataLad version 0.16 or higher due to minor API changes in drop and remove that were introduced with 0.16.

Drop and remove

Because we’re dealing with version control tools whose job it is to keep your data safe, normal file system operations to get rid of files will not yield the intended results with DataLad datasets. If you were to run “rm” on a file or <right-click> - <delete> it, the file would still be in your dataset’s history. If you were to run rm -rf on a dataset or <right-click> - <delete> it, your screen would likely be swamped with permission denied errors. This section specifically highlights how you can and can’t remove annexed contents or complete datasets using the commands datalad drop and datalad remove.

drop removes file contents from your local dataset annex, and remove wipes complete datasets from your system and does not leave a trace of them. Even though these commands are potentially destructive, they can be useful in many use cases - drop is most useful to save disk space by removing file contents of currently or permanently unneeded files, and remove is most useful to get rid of a dataset that you do not need anymore.

drop the easy way: file content with verified remote availability

By default, both commands make sure that you wouldn’t loose any data when they are run. To demonstrate dropping file content, let’s start by cloning a public dataset with a few small-ish (less than 80MB in total) files. This dataset is a collection of machine-learning books and available from GitHub with the following clone command:

# make sure to run this command outside of any existing dataset
datalad clone https://github.com/datalad-datasets/machinelearning-books.git

The PDFs in this dataset are all annexed contents, and can be retrieved using datalad get:

cd machinelearning-books
datalad get A.Shashua-Introduction_to_Machine_Learning.pdf

Once you have obtained one file’s contents, they are known to exist in their remote registered location and locally on your system:

git annex whereis A.Shashua-Introduction_to_Machine_Learning.pdf
whereis A.Shashua-Introduction_to_Machine_Learning.pdf (2 copies) 
  	00000000-0000-0000-0000-000000000001 -- web
   	0d757ca9-ea20-4646-96cb-19dccd732f8c -- adina@muninn:/tmp/machinelearning-books [here]

  web: https://arxiv.org/pdf/0904.3664v1.pdf
ok

This output shows that the file contents of this book exist locally and “on the web”, more precisely under the URL arxiv.org/pdf/0904.3664v1.pdf.

A datalad drop <file> performs an internal check if the to-be-dropped file content could be re-obtained before dropping it. When the file content exists in more locations than your local system, and at least one of those other location is successfully probed to have this content, dropping its file content succeeds. Here is a successful drop:

datalad drop A.Shashua-Introduction_to_Machine_Learning.pdf
drop(ok): A.Shashua-Introduction_to_Machine_Learning.pdf (file)

remove the easy way: datasets with remote availability

To demonstrate removing datasets, let’s remove the machinelearning-books dataset.

Because this dataset was cloned from a GitHub repository, it has a registered sibling:

datalad siblings
.: here(+) [git]
.: origin(-) [https://github.com/datalad-datasets/machinelearning-books.git (git)]

remove performs an internal check whether all changes in the local dataset would also be available from a known sibling. If the dataset has a sibling and this sibling has all commits that the local dataset has, too, remove succeeds.

Note that remove needs to be called from outside of the to-be-removed dataset, it can not “pull the rug from underneath its own feet”. Make sure to use -d/--dataset to point to the correct dataset:

cd ../
datalad remove -d machinelearning-books
uninstall(ok): . (dataset)

drop with disabled availability checks

When drop’s check for continued file content availability from other locations fails, the drop command will fail, too. To demonstrate this, let’s create a new dataset, add a file into it, and save it.

datalad create local-dataset
cd local-dataset
echo "This file content will only exist locally" > local-file.txt
datalad save -m "Added a file without remote content availability"

Verify that its content only has 1 copy, and that it is only available from the local dataset:

git annex whereis local-file.txt
whereis local-file.txt (1 copy) 
  	0d757ca9-ea20-4646-96cb-19dccd732f8c -- adina@muninn:/tmp/local-dataset [here]
ok

Running datalad drop will fail now:

datalad drop local-file.txt
drop(error): local-file.txt (file) [unsafe; Could only verify the existence of 0 out of 1 necessary copy; (Note that these git remotes have annex-ignore set: origin upstream); (Use --reckless availability to override this check, or adjust numcopies.)]

Impeded file availability despite multiple registered sources

The check for continued file content availability from other locations could fail when there is no other location registered, but also if the registered location is currently unavailable - for example because you disabled your internet connection, because the webserver hosting your contents is offline for maintenance, or because the registered location belongs to an external hard drive that is currently not plugged in.

The error message drop displays hints at two ways to drop the file content even though no other available copy could be verified: Via a git-annex configuration (“adjust numcopies”, which is by default set to ensure at least 1 other copy) or via DataLad’s --reckless parameter.

The --reckless parameter has multiple modes. Each mode is one type of safety check. Because it is the “availability” check fails, --reckless availability will disable it:

datalad drop local-file.txt --reckless availability

The file content of local-file.txt is irretrievably gone now.

Dropping multiple versions of a file

If a file has been updated and saved multiple times, the revision history of the dataset will have each version in the annex, and only the most recent versions’ content would be dropped. Consider the following example:

  datalad run -m "Make a first version" --output local-file-multiple-revision.txt "echo 'This is the first revision' > {outputs}"
  # overwrite the contents of the file 
  datalad run -m "Make a second version" --output local-file-multiple-revision.txt "echo 'This is the second revision' > {outputs}"

datalad drop --reckless availability local-file-multiple-revision.txt would drop the current revision of the file, but the first version (with the content “This is the first revision”) would still exist in the dataset’s history. To verify this, git checkout the first run commit and check the contents of the file.

  git checkout HEAD~1
  cat local-file-multiple-revision.txt 
  # outputs "This is the first revision"
  git checkout main     # or master, depending on your default branch

In order to remove all past versions of this file, too, you can run git annex unused to find unused data, and git annex dropunused (potentially with a --force flag) to drop it:

  git annex unused
  unused . (checking for unused data...) (checking origin/HEAD...) (checking update-book-2...) (checking 
  update-book...) (checking main...) 
  Some annexed data is no longer used by any files:
      NUMBER  KEY
      1       MD5E-s27--f90c649b1fe585564eb5cdfdd16ec345.txt
  (To see where this data was previously used, run: git annex whereused --historical --unused
 
  To remove unwanted data: git-annex dropunused NUMBER
 
  ok

remove with disabled availability checks

When remove’s check to ensure the availability of the dataset and all its revisions in a known sibling fails - either because there is no known sibling, or because the local dataset has branches or commits that are not available at that sibling, remove will fail. This is the case for the local-dataset:

cd ../
datalad remove -d local-dataset
uninstall(error): . (dataset) [to-be-dropped dataset has revisions that are not available at any known sibling. Use `datalad push --to ...` to push these before dropping the local dataset, or ignore via `--reckless availability`. Unique revisions: ['main']]

remove advises to either push the “unique revisions” to a different place (i.e., creating a sibling to host your pristine, version-controlled changes), or, similarly to how it was done for drop, to disable the availability check with --reckless availability.

datalad remove -d local-dataset --reckless availability
uninstall(ok): . (dataset)

remove gone wrong - how to fix an accidental rm -rf

Removing datasets without datalad remove is a bad idea - for one, because it skips safety checks that aim to prevent data loss, and secondly, because it doesn’t work so well. If you were to remove a dataset using common Unix commands such as rm -rf or your operating systems’ file browser, you would see the residue of write-protected annexed files refusing to be deleted. On the command line, it would look something like this:

cannot remove 'machinelearning-books/.git/annex/objects/zw/WF/URL-s1787416--http&c%%www1.maths.leeds.ac.uk%,126charles%statlog%whole.pdf/URL-s1787416--http&c%%www1.maths.leeds.ac.uk%,126charles%statlog%whole.pdf': Permission denied
rm: cannot remove 'machinelearning-books/.git/annex/objects/jQ/GM/MD5E-s21322662--8689c3c26c3a1ceb60c1ba995d638677.pdf/MD5E-s21322662--8689c3c26c3a1ceb60c1ba995d638677.pdf': Permission denied
[...]

Afterwards, a few left-over but unusable dataset ruins remain. To remove those from your system, you need to make the remaining files writable:

# add write permissions to the directory
chmod -R +w <dataset>
# remove everything that is left
rm -rf <dataset>

Further reading

This overview only covers the surface of all possible file system operations, and not even all ways in which files or datasets can be removed. For a more complete overview, check out handbook.datalad.org/basics/101-136-filesystem.html - the navigation bar on the left of the page lists a number of useful file system operations.

Key Points

  • Your dataset keeps annexed data safe and will refuse to perform operations that could cause data loss

  • Removing files or datasets with known copies is easy, removing files or datasets without known copies requires by-passing security checks

  • There are two ‘destructive’ commands: drop and remove

  • drop is the antagonist command to get, and remove is the antagonist command to clone

  • Both commands have a --reckless [MODE] parameter to override safety checks