Research Data Management with DataLad: For Instructors

To save time during a workshop, organizers and instructors can consider setting up a cloud-based server beforehand, with JupyterHub and all required software packages pre-installed for all users. This ensures that all users have a working setup that is accessible via the browser, and prevents user- and operating system-specific issues from hindering workshop progress.

Below we provide instructions on how to set up The Littlest JupyterHub (for short: TLJH) for a small number of users (0-100) on a single AWS cloud-computing server. This includes:

  1. Creating an EC2 instance on AWS
  2. Create a constant IP address and domain
  3. JupyterHub configuration
  4. Some notes and further options

Before you begin

  1. Sign up for an AWS account and verify it.
  2. Consider the costs involved in running a cloud-compute server. AWS provides resources for checking EC2 on-demand pricing and also for calculating cost estimates for your usage.
  3. If you want to serve your workshop content via a custom domain (e.g. datalad-hub.inm7.de), ensure that you have the required level of access that will allow you to associate the EC2-provided IP address with your chosen domain address.

AWS account usage can incur costs

While Amazon provides Free Tier access to its services, it can still potentially result in costs if usage exceeds Free Tier Limits. Be sure to take note of these limits, or set up automatic tracking alerts to be notified before incurring unnecessary costs.

In 2022, our costs for a two half-day workshop were about 15 € (we used a small AWS instance for setup, and manually started & stopped a large instance for the workshop sessions only, i.e. about 8 hours).

1. Creating an EC2 instance on AWS

Note: The content of this section has been adapted from The Littlest JupyterHub documentation.

JupyterHub can be installed on AWS’s Elastic Compute Cloud (EC2), resulting in a setup with admin users and a user environment with conda / pip packages. Please follow these instructions:

  1. In Services, find EC2, and click Launch Instances.
  2. Name and tags: enter the name that will show in your EC2 management console - something that will identify it for you, like datalad-workshop.
  3. Application and OS Images: choose Ubuntu and pick Ubuntu Server 22.04 LTS.
  4. Instance type: while The Littlest JupyterHub documentation recommends t3.small, we suggest t3a.medium. The type can easily be up- or downscaled at a later stage. See notes below.
  5. Key pair (login): create new one or use one that you have created previously. Creating a new one will download it to your machine, and you should keep this safe. This key can be used for ssh access from your terminal, which will likely only be needed if things go wrong, since you should be able to do everything you need to via the JupyterHub browser interface.
  6. Network settings: click Edit and at Firewall (security groups) click Create security group, or otherwise Select existing security group if you have previously created one. When creating a new security group:
    • Edit the Security group name and Description
    • Use Add security group rule to ensure that HTTP (TCP port 80), HTTPS (TCP port 443) and SSH (TCP port 22) are all allowed.
    • If a new rule requires the Source to be set, use 0.0.0.0/0
  7. Configure storage: enter the amount of storage you require. This will vary depending on your workshop content and installed packages. According to AWS, “free tier eligible customers can get up to 30 GB of EBS General Purpose (SSD) or Magnetic storage”.
  8. Advanced details: expand this section, and at the very end in the text-field User data paste the following code (provided in The Littlest JupyterHub documentation):

    #!/bin/bash
    curl -L https://tljh.jupyter.org/bootstrap.py \
    | sudo python3 - \
        --admin <admin-username>
    

    Be sure to replace <admin-username> with your username of choice. And the shebang is important, don’t miss it!

    TLJH bootstrap script

    This bootstrap script is for convenience, so that the server is set up from the get-go. It ensures that you’ll only have to work through JupyterHub and not connect to the server via SSH. If you don’t paste this bit, you will have to SSH from your terminal and follow the steps described in Installing on your own server as if you worked on any other remote server.

  9. Finally, select Launch instance. Give it about 10 minutes to complete the installation.

2. Create a constant IP address and domain

Once launched and running, the instance with JupyterHub environment will be accessible via a public IP address. You can verify this by navigating to Instances in the left-hand sidebar of the EC2 console, then selecting your newly launched instance, and clicking on open address next to the IP address in the information block. This should open a browser tab/window with a login interface for JupyterHub. At this moment, we only have http available (not https).

Technically, this IP address can suffice as a point of entry for users, but there are certain caveats:

For these reasons, we need an “Elastic IP”:

  1. Go to Elastic IPs under Network & Security in the left-hand sidebar in EC2 console.
  2. Click the orange Allocate Elastic IP address button.
  3. Keep Amazon's pool of IPv4 addresse selected, and click Allocate at the bottom
  4. Next, select the IP address that was allocated, click Actions (next to the orange button), and select Associate Elastic IP address.
  5. Now, associate the elastic IP address with the instance that you’ve created. If you want to be able to reassociate the elastic IP address in future, check that box. Once done, the newly associated IP address will show up on the instance, and the instance will be accessible via this address.
  6. Lastly, ask your friendly system admin to associate this with a custom address likely within your institutional domain (such as datalad-hub.inm7.de). Once this is completed, your instance will be accessible via its own custom domain address.

3. JupyterHub configuration

Once your instance is set up with JupyterHub installed and accessible via a custom domain, several steps remain in order to create an environment that is ready to use and intuitive for users. These include configuring HTTPS, installing required software tools, setting up the default shell, and user management.

3.1. Log in and explore the hub

Before doing anything else, navigate to your Elastic IP address or custom domain and login to JupyterHub. You should type the username that you entered in the place of <admin-username> in the bootstrap script above. You can enter any password you prefer. This first login will save that as your admin account password. Once logged in, you will likely see the hub Home View at <your-hub-address>/hub/home. From here, you can:

3.2. TLJH configuration

You can configure The Littlest JupyterHub using tljh-config via a terminal window once logged in. For all configuration options, see the documentation. For now, we will only configure our hub to use JupyterLab as the default interface.

Navigate to your server, open up a Terminal and type and run the following command:

sudo tljh-config set user_environment.default_app jupyterlab

After modification, the configuration should be reloaded in order to take effect:

sudo tljh-config reload

In order to view the configuration, you can use:

sudo tljh-config show

3.3. Set up HTTPS

This can be done with Let’s Encrypt by following instructions from TLJH documentation:

  1. Enable HTTPS:
    sudo tljh-config set https.enabled true
    
  2. Set your email addres for Let’s Encrypt:
    sudo tljh-config set https.letsencrypt.email <you@example.com>
    

    where <you@example.com> should be replaced by your email address.

  3. Add your domain:
    sudo tljh-config add-item https.letsencrypt.domains <your-hub-address>
    

    where <your-hub-address> should just be the custom part of the address excluding http..., e.g.: datalad-hub.inm7.de

  4. Check the updated configuration to make sure all details are correct.
    sudo tljh-config show
    

    You should see something like this:

    users:
       admin:
       - admin-username
    user_environment:
       default_app: jupyterlab
    https:
       enabled: true
       letsencrypt:
          email: you@example.com
          domains:
          - datalad-hub.inm7.de
    
  5. Finally, reload the proxy to load the new configuration:
    sudo tljh-config reload proxy
    

    This could also require you to to refresh the page (with the address now containing https instead of http) and log in again.

3.4. Increase cull timeout

JupyterHub will shut down inactive notebook servers to save resources. Although state is restored, having to click “restart my server” if there is a pause during the workshop may be irritating. To increase cull timeout, do:

sudo tljh-config set services.cull.timeout <time in seconds>

If you plan to leave a low-resource instance running for users to explore, you may wish to adjust culling only before the workshop.

3.5. Install required tools into the base environment

This ensures that you have the required tools for the rest of the configuration procedure as well as those required for the workshop.

Run the following lines one by one, typing y for “Yes” whenever you are prompted to continue:

sudo apt update
sudo apt upgrade
sudo apt install zsh tig tree

For the workshop itself, we want to install datalad and its dependencies. Run the following lines one by one:

sudo -E apt install git-annex
sudo -E -H pip3 install datalad

Depending on the content of your workshop, you might also want to install DataLad extensions or other packages. This would be a sensible time to do so. Remember to add sudo -E in front of the install command in order to make the installation apply to all users.

3.6. Set up the default shell

We will set up zsh as the default shell for the terminal. Currently, when you open a terminal and run echo $0, it should print /usr/bin/bash, indicating that the current default is bash.

In order to change the default, we need to import some dotfiles (we will use Pure) and tell the terminal and hub how to access them in order to set the default shell to zsh:

  1. Run the following lines one by one in order to import the dotfiles:
    sudo mkdir -p "$HOME/.zsh"
    git clone https://github.com/sindresorhus/pure.git "$HOME/.zsh/pure"
    
  2. Create a .zshrc file in your HOME directory with a basic configuration taken, e.g., from this source:
    wget https://gist.githubusercontent.com/mslw/926f1191e61ef2d705fadab66d19b8ba/raw/.zshrc
    
  3. Lastly, we also need to make the hub aware of which shell it should use when launching a terminal. We do this with a configuration script. Run the following lines one by one:
    touch "$HOME/.jupyter/jupyter_notebook_config.py"
    echo 'c.NotebookApp.terminado_settings = {"shell_command": ["/usr/bin/zsh"]}' > "$HOME/.jupyter/jupyter_notebook_config.py"
    
  4. This finalizes the shell setup. You can now navigate to the hub’s admin panel and restart your server in order for the changes to take effect.

3.7. Create a global gitignore

When we work with git repositories and add content or code, we often want certain files or directories not to form part of the git history. We can achieve this by telling git to ignore certain files or directories, via the .gitignore configuration. JupyterHub stores its own metadata files under .ipynb_checkpoints and will create this directory whenever the hub’s text editor is used, or a notebook is opened. To prevent these directories for getting into user’s way when working with DataLad, we will create a global (user-level) gitignore configuration following this method:

echo ".ipynb_checkpoints" > ~/.gitignore_global
git config --global core.excludesfile "~/.gitignore_global"

3.8. Set up default user settings and data

What we have done up until now is to set up the base environment that a user will encounter when they log into the JupyterHub. Some tools and packages, such as datalad, have been installed for all users (as a result of using pip with sudo -E). However, some settings were done only for the admin account, and will not yet be the default for other, newly created users. This includes the default shell setup as well as any data content. We can ensure that any configuration files or data content are available to individual users by copying the required content to the /etc/skel directory, the contents of which will be placed in the $HOME directories of newly created users.

Run the following code to copy the zsh configuration files, the .gitignore_global file, and the jupyter_notebook_config.py file:

sudo mkdir -p /etc/skel/.zsh
sudo cp -r "$HOME/.zsh/pure" /etc/skel/.zsh/
sudo cp "$HOME/.zshrc" /etc/skel/
 
sudo cp "$HOME/.gitconfig" /etc/skel/
sudo cp "$HOME/.gitignore_global" /etc/skel/ 

sudo mkdir -p /etc/skel/.jupyter
sudo cp "$HOME/.jupyter/jupyter_notebook_config.py" /etc/skel/.jupyter/

Now, at long last, you can create a new user from the JupyterHub admin panel. Using this user, login from another device or using the browser’s incognito mode, and then check to see that everything functions as expected.

3.9. Finally, add users!

Depending on group size and logistics, we suggest creating users beforehand via the admin control panel (e.g. using their email addresses, or the username part of these), and letting users create their own passwords once they navigate to the hub URL for the first time. This is the default configuration.

Different authentication options are possible (e.g. admin can also authenticate users individually after they sign up, or authentication can happen via Google, GitHub or AWS). Refer to the documentation for further options.

4. Some notes and further options

EC2 instance notes

Troubleshooting