Stephan Heunis jsheunis @jsheunis@mas.to |
Michał Szczepanik mslw @doktorpanik@masto.ai |
|
Psychoinformatics lab,
Institute of Neuroscience and Medicine (INM-7) Research Center Jülich |
sh -c 'tr "a-z" "A-Z" < inputpath > outputpath'
can convert text in this fashion.
tr
command made.$ datalad create -c text2git my-dataset
Procedurally, version control is easy with DataLad!
datalad download-url
is helpful
datalad create
creates an empty dataset.datalad save
records the dataset or file state to the history. datalad download-url
obtains web content and records its origin. datalad status
reports the current state of the dataset.text2git
configurationtig
datalad status
datalad save
tree
datalad download-url
git restore
is a dangerous (!), but sometimes useful command:git revert [hash]
transparently undoes a past commitgit checkout
git rebase
changes and git reset
rewinds history without creating a commit about it (see Handbook chapter for examples).git reflog
git restore
and git clean
.git restore
datalad save
git revert
Git does not handle large files well.
Git does not handle large files well.
And repository hosting services refuse to handle large files:
git-annex to the rescue! Let's take a look how it works
$ datalad create mydataset
[INFO] Creating a new annex repo at /home/adina/mydataset
create(ok): /home/adina/mydataset (dataset)
but datasets can also be installed from paths or from URLs:
$ datalad clone https://github.com/datalad-datasets/human-connectome-project-openaccess HCP
install(ok): /tmp/HCP (dataset)
Hint: Did you know that you can get the Human Connectome Project Open Access Data as a Dataset?
$ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-phase2.git
install(ok): /tmp/studyforrest-data-phase2 (dataset)
$ cd studyforrest-data-phase2 && du -sh
18M .
$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
get(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]
# eNKI dataset (1.5TB, 34k files):
$ du -sh
1.5G .
# HCP dataset (~200TB, >15 million files)
$ du -sh
48G .
Git | git-annex |
handles small files well (text, code) | handles all types and sizes of files well |
file contents are in the Git history and will be shared upon git/datalad push | file contents are in the annex. Not necessarily shared |
Shared with every dataset clone | Can be kept private on a per-file level when sharing the dataset |
Useful: Small, non-binary, frequently modified, need-to-be-accessible (DUA, README) files | Useful: Large files, private files |
datalad get from whereever it is stored.
|
text2git
, yoda
)
or created and shared by users
(Tutorial) .gitattributes
(e.g., based on file type,
file/path name, size, ...;
rules and examples
)datalad save --to-git
)$ git annex whereis inputs/images/chinstrap_02.jpg
whereis inputs/images/chinstrap_02.jpg (1 copy)
00000000-0000-0000-0000-000000000001 -- web
c1bfc615-8c2b-4921-ab33-2918c0cbfc18 -- adina@muninn:/tmp/my-dataset [here]
web: https://unsplash.com/photos/8PxCm4HsPX8/download?force=true
ok
$ datalad drop inputs/images/chinstrap_02.jpg
drop(ok): /home/my-dataset/inputs/images/chinstrap_02.jpg (file)
$ datalad get inputs/images/chinstrap_02.jpg
get(ok): inputs/images/chinstrap_02.jpg (file)
$ datalad drop inputs/images/chinstrap_01.jpg
drop(error): inputs/images/chinstrap_01.jpg (file)
[unsafe; Could only verify the existence of 0 out of 1 necessary copy;
(Use --reckless availability to override this check, or adjust numcopies.)]
Delineation and advantages of decentral versus central RDM: In defense of decentralized research data management
$ ls -l inputs/images/chinstrap_01.jpg
lrwxrwxrwx 1 adina adina 132 Apr 5 20:53 inputs/images/chinstrap_01.jpg -> ../../.git/annex/objects/1z/
xP/MD5E-s725496--2e043a5654cec96aadad554fda2a8b26.jpg/MD5E-s725496--2e043a5654cec96aadad554fda2a8b26.jpg
(PS: especially useful in datasets with many identical files) $ md5sum inputs/images/chinstrap_01.jpg
2e043a5654cec96aadad554fda2a8b26 inputs/images/chinstrap_01.jpg
datalad run wraps a command execution and records its impact on a dataset.
datalad run wraps a command execution and records its impact on a dataset.
commit 9fbc0c18133aa07b215d81b808b0a83bf01b1984 (HEAD -> main)
Author: Adina Wagner [adina.wagner@t-online.de]
Date: Mon Apr 18 12:31:47 2022 +0200
[DATALAD RUNCMD] Convert the second image to greyscale
=== Do not change lines below ===
{
"chain": [],
"cmd": "python code/greyscale.py inputs/images/chinstrap_02.jpg outputs/im>
"dsid": "418420aa-7ab7-4832-a8f0-21107ff8cc74",
"exit": 0,
"extra_inputs": [],
"inputs": [],
"outputs": [],
"pwd": "."
}
^^^ Do not change lines above ^^^
diff --git a/outputs/images_greyscale/chinstrap_02_grey.jpg b/outputs/images_gr>
new file mode 120000
index 0000000..5febc72
--- /dev/null
+++ b/outputs/images_greyscale/chinstrap_02_grey.jpg
@@ -0,0 +1 @@
+../../.git/annex/objects/19/mp/MD5E-s758168--8e840502b762b2e7a286fb5770f1ea69.>
\ No newline at end of file
The resulting commit's hash (or any other identifier) can be used to automatically re-execute a computation (more on this tomorrow)
Traceback (most recent call last):
File "/home/bob/Documents/rdm-warmup/example-dataset/code/greyscale.py", line 20, in module
grey.save(args.output_file)
File "/home/bob/Documents/rdm-temporary/venv/lib/python3.9/site-packages/PIL/Image.py", line 2232, in save
fp = builtins.open(filename, "w+b")
PermissionError: [Errno 13] Permission denied: 'outputs/images_greyscale/chinstrap_02_grey.jpg'
$ datalad unlock outputs/images_greyscale/chinstrap_02_grey.jpg
$ ls outputs/images_greyscale/chinstrap_02_grey.jpg
-rw-r--r-- 1 adina adina 758168 Apr 18 12:31 outputs/images_greyscale/chinstrap_02_grey.jpg
datalad run wraps a command execution and records its impact on a dataset.
In addition, it can take care of data retrieval and unlocking
datalad rerun
is helpful to spare others and yourself
the short- or long-term memory task, or the forensic skills to figure
out how you performed an analysis
datalad rerun
to rerun the script execution.
Find out if the output changed--input
are retrieved prior to command execution, data/directories specified as --output
unlocked.datalad rerun
can automatically re-execute run-records later.datalad save
datalad run
datalad unlock
datalad save
or datalad run
datalad drop
and datalad remove
$ datalad clone https://github.com/datalad-datasets/machinelearning-books.git
$ cd machinelearning-books
$ datalad get A.Shashua-Introduction_to_Machine_Learning.pdf
$ datalad drop A.Shashua-Introduction_to_Machine_Learning.pdf
drop(ok): /tmp/machinelearning-books/A.Shashua-Introduction_to_Machine_Learning.pdf (file)
[checking https://arxiv.org/pdf/0904.3664v1.pdf...]
$ datalad clone https://github.com/datalad-datasets/human-connectome-project-openaccess.git
$ cd human-connectome-project-openaccess
$ datalad get -n HCP1200/996782
$ datalad drop --what all HCP1200/996782
# The command operates outside of the to-be-removed dataset!
$ datalad remove -d . machinelearning-books
uninstall(ok): /tmp/machinelearning-books (dataset)
$ datalad create local-dataset
$ cd local-dataset
$ echo "This file content will only exist locally" > local-file.txt
$ datalad save -m "Added a file without remote content availability"
$ datalad drop local-file.txt
$ drop(error): local-file.txt (file) [unsafe; Could only verify the existence of 0 out of 1 necessary copy;
(Note that these git remotes have annex-ignore set: origin upstream);
(Use --reckless availability to override this check, or adjust numcopies.)]
$ datalad drop local-file.txt --reckless availability
$ datalad remove -d local-dataset
uninstall(error): . (dataset) [to-be-dropped dataset has revisions that are not available at any known
sibling. Use `datalad push --to ...` to push these before dropping the local dataset,
or ignore via `--reckless availability`. Unique revisions: ['main']]
$ datalad remove -d local-dataset --reckless availability
$ datalad create local-dataset
$ cd local-dataset
$ echo "This file content will only exist locally" > local-file.txt
$ datalad save -m "Added a file without remote content availability"
$ rm -rf local-dataset
rm: cannot remove 'local-dataset/.git/annex/objects/Kj/44/MD5E-s42--8f008874ab52d0ff02a5bbd0174ac95e.txt/
MD5E-s42--8f008874ab52d0ff02a5bbd0174ac95e.txt': Permission denied
$ chmod +w -R local-dataset
$ rm -rf local-dataset
datalad clone
datalad get
git annex whereis
datalad drop
datalad siblings
datalad remove
datalad drop/remove --reckless availability