Stephan Heunis
jsheunis
|
Michał Szczepanik
mslw
|
|
|
Psychoinformatics lab,
Institute of Neuroscience and Medicine (INM-7) Research Center Jülich |
sh -c 'tr "a-z" "A-Z" < inputpath > outputpath'
can convert text in this fashion.
tr command made.$ datalad create -c text2git my-dataset
Procedurally, version control is easy with DataLad!
datalad download-url is helpful
datalad create creates an empty dataset.datalad save records the dataset or file state to the history. datalad download-url obtains web content and records its origin. datalad status reports the current state of the dataset.text2git configurationtigdatalad statusdatalad savetreedatalad download-url
git restore is a dangerous (!), but sometimes useful command:git revert [hash] transparently undoes a past commitgit checkout git rebase changes and git reset rewinds history without creating a commit about it (see Handbook chapter for examples).git refloggit restore and git clean.git restoredatalad savegit revertGit does not handle large files well.
Git does not handle large files well.
And repository hosting services refuse to handle large files:


git-annex to the rescue! Let's take a look how it works
$ datalad create mydataset
[INFO] Creating a new annex repo at /home/adina/mydataset
create(ok): /home/adina/mydataset (dataset)
but datasets can also be installed from paths or from URLs:
$ datalad clone https://github.com/datalad-datasets/human-connectome-project-openaccess HCP
install(ok): /tmp/HCP (dataset)
Hint: Did you know that you can get the Human Connectome Project Open Access Data as a Dataset?
Try it yourself with github.com/datalad-datasets/machinelearning-books.git
$ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-phase2.git
install(ok): /tmp/studyforrest-data-phase2 (dataset)
$ cd studyforrest-data-phase2 && du -sh
18M .
$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
get(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]
# eNKI dataset (1.5TB, 34k files):
$ du -sh
1.5G .
# HCP dataset (~200TB, >15 million files)
$ du -sh
48G .
| Git | git-annex |
| handles small files well (text, code) | handles all types and sizes of files well |
| file contents are in the Git history and will be shared upon git/datalad push | file contents are in the annex. Not necessarily shared |
| Shared with every dataset clone | Can be kept private on a per-file level when sharing the dataset |
| Useful: Small, non-binary, frequently modified, need-to-be-accessible (DUA, README) files | Useful: Large files, private files |
datalad get from whereever it is stored.
|
|
text2git, yoda)
or created and shared by users
(Tutorial) .gitattributes (e.g., based on file type,
file/path name, size, ...;
rules and examples
)datalad save --to-git)$ git annex whereis inputs/images/chinstrap_02.jpg
whereis inputs/images/chinstrap_02.jpg (1 copy)
00000000-0000-0000-0000-000000000001 -- web
c1bfc615-8c2b-4921-ab33-2918c0cbfc18 -- adina@muninn:/tmp/my-dataset [here]
web: https://unsplash.com/photos/8PxCm4HsPX8/download?force=true
ok
$ datalad drop inputs/images/chinstrap_02.jpg
drop(ok): /home/my-dataset/inputs/images/chinstrap_02.jpg (file)
$ datalad get inputs/images/chinstrap_02.jpg
get(ok): inputs/images/chinstrap_02.jpg (file)
$ datalad drop inputs/images/chinstrap_01.jpg
drop(error): inputs/images/chinstrap_01.jpg (file)
[unsafe; Could only verify the existence of 0 out of 1 necessary copy;
(Use --reckless availability to override this check, or adjust numcopies.)]
Delineation and advantages of decentral versus central RDM: In defense of decentralized research data management
$ ls -l inputs/images/chinstrap_01.jpg
lrwxrwxrwx 1 adina adina 132 Apr 5 20:53 inputs/images/chinstrap_01.jpg -> ../../.git/annex/objects/1z/
xP/MD5E-s725496--2e043a5654cec96aadad554fda2a8b26.jpg/MD5E-s725496--2e043a5654cec96aadad554fda2a8b26.jpg
(PS: especially useful in datasets with many identical files) $ md5sum inputs/images/chinstrap_01.jpg
2e043a5654cec96aadad554fda2a8b26 inputs/images/chinstrap_01.jpg
datalad run wraps a command execution and records its impact on a dataset.
datalad run wraps a command execution and records its impact on a dataset.
commit 9fbc0c18133aa07b215d81b808b0a83bf01b1984 (HEAD -> main)
Author: Adina Wagner [adina.wagner@t-online.de]
Date: Mon Apr 18 12:31:47 2022 +0200
[DATALAD RUNCMD] Convert the second image to greyscale
=== Do not change lines below ===
{
"chain": [],
"cmd": "python code/greyscale.py inputs/images/chinstrap_02.jpg outputs/im>
"dsid": "418420aa-7ab7-4832-a8f0-21107ff8cc74",
"exit": 0,
"extra_inputs": [],
"inputs": [],
"outputs": [],
"pwd": "."
}
^^^ Do not change lines above ^^^
diff --git a/outputs/images_greyscale/chinstrap_02_grey.jpg b/outputs/images_gr>
new file mode 120000
index 0000000..5febc72
--- /dev/null
+++ b/outputs/images_greyscale/chinstrap_02_grey.jpg
@@ -0,0 +1 @@
+../../.git/annex/objects/19/mp/MD5E-s758168--8e840502b762b2e7a286fb5770f1ea69.>
\ No newline at end of file
The resulting commit's hash (or any other identifier) can be used to automatically re-execute a computation (more on this tomorrow)
Traceback (most recent call last):
File "/home/bob/Documents/rdm-warmup/example-dataset/code/greyscale.py", line 20, in module
grey.save(args.output_file)
File "/home/bob/Documents/rdm-temporary/venv/lib/python3.9/site-packages/PIL/Image.py", line 2232, in save
fp = builtins.open(filename, "w+b")
PermissionError: [Errno 13] Permission denied: 'outputs/images_greyscale/chinstrap_02_grey.jpg'
$ datalad unlock outputs/images_greyscale/chinstrap_02_grey.jpg
$ ls outputs/images_greyscale/chinstrap_02_grey.jpg
-rw-r--r-- 1 adina adina 758168 Apr 18 12:31 outputs/images_greyscale/chinstrap_02_grey.jpgdatalad run wraps a command execution and records its impact on a dataset.
In addition, it can take care of data retrieval and unlocking
datalad rerun is helpful to spare others and yourself
the short- or long-term memory task, or the forensic skills to figure
out how you performed an analysis
datalad rerun to rerun the script execution.
Find out if the output changed--input
are retrieved prior to command execution, data/directories specified as --output unlocked.datalad rerun can automatically re-execute run-records later.datalad savedatalad rundatalad unlockdatalad save or datalad rundatalad drop and datalad remove$ datalad clone https://github.com/datalad-datasets/machinelearning-books.git
$ cd machinelearning-books
$ datalad get A.Shashua-Introduction_to_Machine_Learning.pdf
$ datalad drop A.Shashua-Introduction_to_Machine_Learning.pdf
drop(ok): /tmp/machinelearning-books/A.Shashua-Introduction_to_Machine_Learning.pdf (file)
[checking https://arxiv.org/pdf/0904.3664v1.pdf...]$ datalad clone https://github.com/datalad-datasets/human-connectome-project-openaccess.git
$ cd human-connectome-project-openaccess
$ datalad get -n HCP1200/996782
$ datalad drop --what all HCP1200/996782# The command operates outside of the to-be-removed dataset!
$ datalad remove -d . machinelearning-books
uninstall(ok): /tmp/machinelearning-books (dataset)$ datalad create local-dataset
$ cd local-dataset
$ echo "This file content will only exist locally" > local-file.txt
$ datalad save -m "Added a file without remote content availability"
$ datalad drop local-file.txt
$ drop(error): local-file.txt (file) [unsafe; Could only verify the existence of 0 out of 1 necessary copy;
(Note that these git remotes have annex-ignore set: origin upstream);
(Use --reckless availability to override this check, or adjust numcopies.)]
$ datalad drop local-file.txt --reckless availability$ datalad remove -d local-dataset
uninstall(error): . (dataset) [to-be-dropped dataset has revisions that are not available at any known
sibling. Use `datalad push --to ...` to push these before dropping the local dataset,
or ignore via `--reckless availability`. Unique revisions: ['main']]
$ datalad remove -d local-dataset --reckless availability$ datalad create local-dataset
$ cd local-dataset
$ echo "This file content will only exist locally" > local-file.txt
$ datalad save -m "Added a file without remote content availability"
$ rm -rf local-dataset
rm: cannot remove 'local-dataset/.git/annex/objects/Kj/44/MD5E-s42--8f008874ab52d0ff02a5bbd0174ac95e.txt/
MD5E-s42--8f008874ab52d0ff02a5bbd0174ac95e.txt': Permission denied
$ chmod +w -R local-dataset
$ rm -rf local-dataset
datalad clonedatalad getgit annex whereisdatalad dropdatalad siblingsdatalad removedatalad drop/remove --reckless availability