Setup – Research Data Management with DataLad

Participate without installation: Jupyter Hub

If you are participating in an organised workshop, the organisers may have provided you with access to a Jupyter Hub. In this case you will be working on a remote server, with all required software, through a web browser interface. This interface, called Jupyter Lab, gives you access to a command line, a basic file browser and a basic text editor. For workshop organisers/instructors, more information on setting up a cloud server with Jupyter Hub can be found here.

Participate without installation: use Binder

If you don’t have access to a premade environment (such as the Jupyter Hub above) and can’t or don’t want to install anything on your own machine, you can follow all exercises through Binder. The link opens a Jupyter Lab interface in your browser (see above). The binder environment has the most important software needed during the workshop. However, it has two limitations:

it is not persistent (all content will be removed after you close it)
it does not allow outgoing ssh connections (meaning that during the lesson about collaboration you won’t be able to publish all example data).

Participate with own computer: install software

If you want to follow the examples on your own machine, you will need to install DataLad and some additional software which we will use during the walkthrough. Note that Linux or MacOS are strongly recommended for this workshop; although DataLad works on all main operating systems, on Windows there are some caveats which may complicate the presented workflow.

Datalad

For the installation of DataLad, follow the instructions from the DataLad handbook.

Tig

Tig (text mode interface for Git) is a small command line program which we will use to view dataset history. On Linux you can istall it with your package manager (e.g. apt install tig on Debian and Ubuntu), and on MacOS it’s best to install it through homebrew (brew install tig). Detailed instructions for different systems are given here.

Python and modules

During the workshop, we will use photos and comma separated files to represent data, and custom Python scripts will serve as a model of data processing. In addition to Python you will need the following libraries:

pillow (processing images - examples in Modules 1 and 3)
pandas and seaborn (tabular data, plots - examples in Module 4)

The best way is to create a virtual environment and install the packages there. One way to do it is with virtualenv and pip:

virtualenv --system-site-packages --python=python3 ~/.venvs/rdm-workshop
source ~/.venvs/rdm-workshop/bin/activate
pip install pillow
pip install pandas seaborn

Pandoc

Pandoc is a tool for converting files from one markup format into another. We will use it in one of the examples in Module 4. Like with Tig, you can install it with your package manager on Linux (e.g. apt install pandoc) or with homebrew on MacOS (brew install pandoc), and you can read about all installation methods and systems here.

Register a GIN account

GIN is a data hosting / management platform of the German Neuroinformatics Node. In the module on remote collaboration we will be using GIN to demonstrate data publishing. If you want to follow the entire walkthrough, you will need to register a GIN account here. From the registration page:

For Registration we require only username, password, and a valid email address, but adding your name and affiliation is recommended. Please use an institutional email address for registration to benefit from the full set of features of GIN.

Research Data Management with DataLad: Setup