Structuring data

Michał Szczepanik

1. Good practices on data organisation

"What is a good filename?"

  • Naming can go a long way in making your life easier
  • Topics:
    • file names
    • file types
    • project structure

Based on the presentations "Naming Things" (CC0) by Jenny Bryan and "Project structure" by Danielle Navarro.

2. How to name a file?

  • A file name exists to identify its content.
  • Opinions as to what exactly is a good file name differ.
  • Disclaimer:
    • Probably no gold standard exists, and
    • we do not claim to possess one, but
    • we will focus on identifying patterns in file naming which can make working with data easier.

2.1. Three principles:

  • be machine readable (automatically extract information)
  • be human readable (easy to know what's what)
  • make sorting and searching easy (easy to find things)

2.2. Why some ways are preferred?

  • The obvious
    • Need to know what's in a file
    • Need to find the file we need
  • Less obvious
    • Different uses (point & click, command line, automation)
    • Must survive transfer (also across OS)

2.3. Example: english literature class

✅ reading01_shakespeare_romeo-and-juliet_act01.docx
✅ reading01_shakespeare_romeo-and-juliet_act02.docx
✅ reading01_shakespeare_romeo-and-juliet_act03.docx
✅ reading02_shakespeare_othello.docx
✅ reading19_plath_the-bell-jar.docx

versus

❌ Romeo and Juliet Act 1.docx
❌ Romeo and juliet act 2.docx
❌ Shakespeare RJ act3.docx
❌ shakespeare othello I think?.docx
❌ belljar plath (1).docx

2.4. What's to like?

✅ reading01_shakespeare_romeo-and-juliet_act03.docx
✅ reading02_shakespeare_othello.docx
✅ reading19_plath_the-bell-jar.docx
  • uses consistent patterns
  • avoids "risky" characters
  • remains:
    • human readable
    • machine readable
    • easy to sort and search

3. Recommendations

3.1. Avoid white spaces

  • Command line: space separates arguments
    • edit my file.txt
    • edit command & my, file.txt args
  • Solve by:
    • Quoting: edit "my file.txt"
    • Escaping: edit my\ file.txt
  • Not ideal:
    • more typing / harder autocompletion
    • problems when passing around: my\ file.txtmy\\ file.txt

3.2. Use only letters, numbers, hyphens, and underscores

  • Encoding: e.g. even within Unicode, é in Adélie can be:
    • U+00E9 (latin small letter e with acute)
    • U+0065 U+0301 (the letter "e" plus a combining acute symbol)
  • Special meaning in command line ('^.*?$+|)
    • e.g. ? may mean "match any character"
  • Outright forbidden by some OS
    • illegal on Unix: / (directory separator)
    • illegal on Windows: <>:"/\|?*

3.3. Don't rely on letter case

  • Are apple and Apple are the same file?
    • depends on file system
    • eg. HFS+ preserves case on creation but not retrieval
  • Be consistent
  • Don't rely on just letter case to distinguish files

3.4. Use separators in a meaningful way

  • "-" to join words into one entity.
  • "_" to separate entities.
  • Example pattern:

    [category] [class] [author] [title] [section(optional)]
    
  • Basic variant:

    reading_01_shakespeare_romeo-and-juliet_act01.docx
    
  • Key-value variant:

    cat-reading_class-01_author-shakespeare_title-romeoandjuliet_act-01.docx
    

3.4.1. Globbing

>>> import glob
>>> glob.glob('reading_01*')
['reading_01_shakespeare_romeo-and-juliet_act02.docx', 'reading_01_shakespeare_romeo-and-juliet_act01.docx']

(similar to search in GUI file browsers)

3.4.2. Splitting

>>> file = 'reading_01_shakespeare_romeo-and-juliet_act01.docx'
>>> file.split('_')
['reading', '01', 'shakespeare', 'romeo-and-juliet', 'act01.docx']
>>> file = 'cat-reading_class-01_author-shakespeare_title-romeoandjuliet_act-01.docx'
>>> dict(part.split('-') for part in file.split('_'))
{'cat': 'reading', 'class': RR'01', 'author': 'shakespeare', 'title': 'romeoandjuliet', 'act': '01.docx'}

3.5. Follow ISO 8601 if using dates

  • YYYY-MM-DD format maintains chronology in alphabetical ordering
  • Dates aren't always preserved on transfer
  • Valid usecases:
    • Date is crucial (e.g. daily weather data)
    • Sorting by name is crucial (e.g. meeting notes)
  • Version control covers most other usecases:
    • Don't use date as _v1, _v2, _v3_final
    • git log will show file history
    • so will git blame (line-by-line for text files)

3.5.1. git log

git log inputs/images/king_01.jpg
commit ff73ce9acb8ee4b99106fa7ae080cfcb08138d48
Author: John Doe <j.doe@example.com>
Date:   Mon Oct 18 16:09:27 2021 +0200

    Add third image

3.5.2. git blame

git blame --date human README.md
487b1267 (John Doe Oct 12 2021       1) # Example dataset
487b1267 (John Doe Oct 12 2021       2) 
487b1267 (John Doe Oct 12 2021       3) This is an example datalad dataset.
487b1267 (John Doe Oct 12 2021       4) 
7fef4b96 (John Doe Oct 12 2021       5) Raw data is kept in `inputs` folder:
7fef4b96 (John Doe Oct 12 2021       6) - penguin photos are in `inputs/images`
dbf4ad7e (John Doe Oct 13 2021       7) 
dbf4ad7e (John Doe Oct 13 2021       8) ## Credit
dbf4ad7e (John Doe Oct 13 2021       9) 
6e759623 (John Doe Oct 18 2021      10) Photos by [Derek Oyen](https://unsplash.com/@goosegrease) and ...

3.6. Avoid leaking undesired information

touch name-with-identifying-information.dat
datalad save

git mv name-with-identifying-information.dat a-new-name.dat
datalad save

git diff HEAD~1 HEAD
diff --git a/name-with-identifying-information.dat b/a-new-name.dat
similarity index 100%
rename from name-with-identifying-information.dat
rename to a-new-name.dat

"Rewriting history" possible, but: not easy, potentially destructive.

4. Sidecar metadata strategy

4.0.1. Additional information in a text file

adelie_087.jpeg
adelie_087.yaml

Sidecar file content:

species: Adelie
island: Torgersen
date: 2021-09-12
penguin_count: 1
sex: MALE
photographer: John

4.0.2. Meticulous tabular data annotation

penguins.csv
penguins.yaml

Sidecar file content:

species:
  description: a factor denoting penguin species
  levels:
    Adélie: P. adeliae
    Chinstrap: P. antarctica
    Gentoo: P. papua
  termURL: https://www.wikidata.org/wiki/Q9147
bill_length_mm:
  description: a number denoting bill length
  units: mm

5. Important distinctions

5.1. File paths: full vs relative

full:

/home/alice/Documents/project/figures/setup.png
/Users/bob/Documents/project/figures/setup.png
C:\\Users\eve\Documents\project\figures\setup.py

relative:

figures/setup.png

Avoid hardcoding full paths - easier to move around.

5.2. Text vs binary files

  • Text file is a file structured as a sequence of lines containing text, composed of characters.
  • Binary file is anything other than a text file.
  • Choice will affect
    • DataLad behaviour (esp. with text2git - often suboptimal)
    • Simplicity of reading data

5.2.1. Examples

Text Binary
.txt  
markup: .md, .rst, .org, .html documents: docx, .xlsx, .pdf
source code: .py, .R, .m compiled files: .pyc, .o, .exe
text-serialised formats: .toml, yaml, json, xml binary-serialised formats: .pickle, .feather, .hdf
delimited files: .tsv, .csv domain-specific: .nii, .edf
vector graphics: .svg images: .jpg, .png, .tiff
  compressed: .zip .gz, .7z

6. Folder structure

6.1. Keep inputs and outputs separately

Consider the following:

/dataset
├── sample1
│   └── a001.dat
├── sample2
│   └── a001.dat
...

After applying a transform (preprocessing, analysis, …) this becomes:

/dataset
├── sample1
│   ├── ps34t.dat
│   └── a001.dat
├── sample2
│   ├── ps34t.dat
│   └── a001.dat
...

Without expert / domain knowledge, no distinction between original and derived data is possible anymore.

Compare it to a case with a clearer separation of semantics:

/derived_dataset
├── sample1
│   └── ps34t.dat
├── sample2
│   └── ps34t.dat
├── ...
└── inputs
    └── raw
	├── sample1
	│   └── a001.dat
	├── sample2
	│   └── a001.dat
	...

6.2. Example: Research compendium

6.2.1. minimal research compendium

compendium/
├── data
│   ├── my_data.csv
├── analysis
│   └── my_script.R
├── DESCRIPTION
└── README.md
  • Data and methods separated into folders
  • Computational environment described in a designated file
  • A README document provides a landing page

research-compendium.science

6.2.2. more extensive research compendium

compendium/
├── CITATION              <- instructions on how to cite
├── code                  <- custom code for this project
│   ├── analyze_data.R
│   └── clean_data.R
├── data_clean            <- intermediate data that has been transformed
│   └── data_clean.csv
├── data_raw              <- raw, immutable data
│   ├── datapackage.json
│   └── data_raw.csv
├── Dockerfile            <- computing environment recipe
├── figures               <- figures
│   └── flow_chart.jpeg
├── LICENSE               <- terms for reuse
├── Makefile              <- steps to automatically generate the results
├── paper.Rmd             <- text and code combined
└── README.md             <- top-level description

Example from The Turing Way

6.3. Example: YODA principles

(3) Structure study elements in modular components to facilitate reuse within or outside the context of the original study

Yoda's Organigram on Data Analysis, github.com/myyoda/myyoda

6.3.1. Minimal example

  • datalad create -c yoda "my_analysis"
    • creates initial files
    • sets code, changelog & README to be tracked by git (all else annexed)
.
├── CHANGELOG.md
├── code
│   └── README.md
└── README.md

6.3.2. extensive example

├── ci/                         # continuous integration configuration
│   └── .travis.yml
├── code/                       # your code
│   ├── tests/                  # unit tests to test your code
│   │   └── test_myscript.py
│   └── myscript.py
├── docs                        # documentation about the project
│   ├── build/
│   └── source/
├── envs                        # computational environments
│   └── Singularity
├── inputs/                     # dedicated inputs/, will not be changed by an analysis
│   └─── data/
│       ├── dataset1/           # one stand-alone data component
│       │   └── datafile_a
│       └── dataset2/
│           └── datafile_a
├── outputs/                    # outputs away from the input data
│   └── important_results/
│       └── figures/
├── CHANGELOG.md                # notes for fellow humans about your project
├── HOWTO.md
└── README.md

Example from Datalad Handbook

6.4. Example: Brain Imaging Data Structure

file naming, directory structure, metadata bids.neuroimaging.io

6.4.1. Example

.
├── CHANGES
├── dataset_description.json
├── participants.tsv
├── README
├── sub-01
│   ├── anat
│   │   ├── sub-01_inplaneT2.nii.gz
│   │   └── sub-01_T1w.nii.gz
│   └── func
│       ├── sub-01_task-rhymejudgment_bold.nii.gz
│       └── sub-01_task-rhymejudgment_events.tsv
└── task-rhymejudgment_bold.json

6.4.2. Key elements

sub-01_task-rhymejudgment_bold.nii.gz
sub-01_task-rhymejudgment_events.tsv
  • Key-value naming, with underscores and dashes
  • Sidecar metadata strategy:
    • .nii.gz (compressed binary file with neuroimaging data)
    • .tsv (text file with event timings)
  • Text files where possible:
    • tsv files are used to store participant tables and event timings.
    • json files are used for metadata
  • Specification & extensions for different neuroscience domains

7. Find out more