Michał Szczepanik
"What is a good filename?"
Based on the presentations "Naming Things" (CC0) by Jenny Bryan and "Project structure" by Danielle Navarro.
✅ reading01_shakespeare_romeo-and-juliet_act01.docx ✅ reading01_shakespeare_romeo-and-juliet_act02.docx ✅ reading01_shakespeare_romeo-and-juliet_act03.docx ✅ reading02_shakespeare_othello.docx ✅ reading19_plath_the-bell-jar.docx
versus
❌ Romeo and Juliet Act 1.docx ❌ Romeo and juliet act 2.docx ❌ Shakespeare RJ act3.docx ❌ shakespeare othello I think?.docx ❌ belljar plath (1).docx
✅ reading01_shakespeare_romeo-and-juliet_act03.docx ✅ reading02_shakespeare_othello.docx ✅ reading19_plath_the-bell-jar.docx
edit my file.txt
edit
command & my
, file.txt
argsedit "my file.txt"
edit my\ file.txt
my\ file.txt
→ my\\ file.txt
U+00E9
(latin small letter e with acute)U+0065
U+0301
(the letter "e" plus a combining acute symbol)'^.*?$+|
)
?
may mean "match any character"/
(directory separator)<>:"/\|?*
apple
and Apple
are the same file?
Example pattern:
[category] [class] [author] [title] [section(optional)]
Basic variant:
reading_01_shakespeare_romeo-and-juliet_act01.docx
Key-value variant:
cat-reading_class-01_author-shakespeare_title-romeoandjuliet_act-01.docx
>>> import glob
>>> glob.glob('reading_01*')
['reading_01_shakespeare_romeo-and-juliet_act02.docx', 'reading_01_shakespeare_romeo-and-juliet_act01.docx']
(similar to search in GUI file browsers)
>>> file = 'reading_01_shakespeare_romeo-and-juliet_act01.docx'
>>> file.split('_')
['reading', '01', 'shakespeare', 'romeo-and-juliet', 'act01.docx']
>>> file = 'cat-reading_class-01_author-shakespeare_title-romeoandjuliet_act-01.docx'
>>> dict(part.split('-') for part in file.split('_'))
{'cat': 'reading', 'class': RR'01', 'author': 'shakespeare', 'title': 'romeoandjuliet', 'act': '01.docx'}
YYYY-MM-DD
format maintains chronology in alphabetical ordering_v1
, _v2
, _v3_final
git log
will show file historygit blame
(line-by-line for text files)git log inputs/images/king_01.jpg commit ff73ce9acb8ee4b99106fa7ae080cfcb08138d48 Author: John Doe <j.doe@example.com> Date: Mon Oct 18 16:09:27 2021 +0200 Add third image
git blame --date human README.md 487b1267 (John Doe Oct 12 2021 1) # Example dataset 487b1267 (John Doe Oct 12 2021 2) 487b1267 (John Doe Oct 12 2021 3) This is an example datalad dataset. 487b1267 (John Doe Oct 12 2021 4) 7fef4b96 (John Doe Oct 12 2021 5) Raw data is kept in `inputs` folder: 7fef4b96 (John Doe Oct 12 2021 6) - penguin photos are in `inputs/images` dbf4ad7e (John Doe Oct 13 2021 7) dbf4ad7e (John Doe Oct 13 2021 8) ## Credit dbf4ad7e (John Doe Oct 13 2021 9) 6e759623 (John Doe Oct 18 2021 10) Photos by [Derek Oyen](https://unsplash.com/@goosegrease) and ...
touch name-with-identifying-information.dat
datalad save
git mv name-with-identifying-information.dat a-new-name.dat
datalad save
git diff HEAD~1 HEAD
diff --git a/name-with-identifying-information.dat b/a-new-name.dat similarity index 100% rename from name-with-identifying-information.dat rename to a-new-name.dat
"Rewriting history" possible, but: not easy, potentially destructive.
adelie_087.jpeg adelie_087.yaml
Sidecar file content:
species: Adelie
island: Torgersen
date: 2021-09-12
penguin_count: 1
sex: MALE
photographer: John
penguins.csv penguins.yaml
Sidecar file content:
species: description: a factor denoting penguin species levels: Adélie: P. adeliae Chinstrap: P. antarctica Gentoo: P. papua termURL: https://www.wikidata.org/wiki/Q9147 bill_length_mm: description: a number denoting bill length units: mm
full:
/home/alice/Documents/project/figures/setup.png /Users/bob/Documents/project/figures/setup.png C:\\Users\eve\Documents\project\figures\setup.py
relative:
figures/setup.png
Avoid hardcoding full paths - easier to move around.
Text | Binary |
---|---|
.txt | |
markup: .md, .rst, .org, .html | documents: docx, .xlsx, .pdf |
source code: .py, .R, .m | compiled files: .pyc, .o, .exe |
text-serialised formats: .toml, yaml, json, xml | binary-serialised formats: .pickle, .feather, .hdf |
delimited files: .tsv, .csv | domain-specific: .nii, .edf |
vector graphics: .svg | images: .jpg, .png, .tiff |
compressed: .zip .gz, .7z |
Consider the following:
/dataset ├── sample1 │ └── a001.dat ├── sample2 │ └── a001.dat ...
After applying a transform (preprocessing, analysis, …) this becomes:
/dataset ├── sample1 │ ├── ps34t.dat │ └── a001.dat ├── sample2 │ ├── ps34t.dat │ └── a001.dat ...
Without expert / domain knowledge, no distinction between original and derived data is possible anymore.
Compare it to a case with a clearer separation of semantics:
/derived_dataset ├── sample1 │ └── ps34t.dat ├── sample2 │ └── ps34t.dat ├── ... └── inputs └── raw ├── sample1 │ └── a001.dat ├── sample2 │ └── a001.dat ...
compendium/ ├── data │ ├── my_data.csv ├── analysis │ └── my_script.R ├── DESCRIPTION └── README.md
compendium/ ├── CITATION <- instructions on how to cite ├── code <- custom code for this project │ ├── analyze_data.R │ └── clean_data.R ├── data_clean <- intermediate data that has been transformed │ └── data_clean.csv ├── data_raw <- raw, immutable data │ ├── datapackage.json │ └── data_raw.csv ├── Dockerfile <- computing environment recipe ├── figures <- figures │ └── flow_chart.jpeg ├── LICENSE <- terms for reuse ├── Makefile <- steps to automatically generate the results ├── paper.Rmd <- text and code combined └── README.md <- top-level description
Example from The Turing Way
(3) Structure study elements in modular components to facilitate reuse within or outside the context of the original study
Yoda's Organigram on Data Analysis, github.com/myyoda/myyoda
datalad create -c yoda "my_analysis"
code
, changelog
& README
to be tracked by git (all else annexed). ├── CHANGELOG.md ├── code │ └── README.md └── README.md
├── ci/ # continuous integration configuration │ └── .travis.yml ├── code/ # your code │ ├── tests/ # unit tests to test your code │ │ └── test_myscript.py │ └── myscript.py ├── docs # documentation about the project │ ├── build/ │ └── source/ ├── envs # computational environments │ └── Singularity ├── inputs/ # dedicated inputs/, will not be changed by an analysis │ └─── data/ │ ├── dataset1/ # one stand-alone data component │ │ └── datafile_a │ └── dataset2/ │ └── datafile_a ├── outputs/ # outputs away from the input data │ └── important_results/ │ └── figures/ ├── CHANGELOG.md # notes for fellow humans about your project ├── HOWTO.md └── README.md
Example from Datalad Handbook
file naming, directory structure, metadata bids.neuroimaging.io
. ├── CHANGES ├── dataset_description.json ├── participants.tsv ├── README ├── sub-01 │ ├── anat │ │ ├── sub-01_inplaneT2.nii.gz │ │ └── sub-01_T1w.nii.gz │ └── func │ ├── sub-01_task-rhymejudgment_bold.nii.gz │ └── sub-01_task-rhymejudgment_events.tsv └── task-rhymejudgment_bold.json
sub-01_task-rhymejudgment_bold.nii.gz sub-01_task-rhymejudgment_events.tsv