| 
        Content tracking with DataLad
       | 
      
        
        
        With version control, lineage of all files is preserved 
 
        
        You can record and revert changes made to the dataset 
 
        
        DataLad can be used to version control a dataset and all its files 
 
        
        You can manually save changes with datalad save 
 
        
        You can use datalad download-url to preserve file origin 
 
        
        You can use datalad run to capture outputs of a command 
 
        
        “Large” files are annexed, and protected from accidental modifications 
 
        
         
       | 
    
  
  
  
    
      | 
        Structuring data
       | 
      
        
        
        Use filenames which are machine-readable, human readable, easy to sort and search 
 
        
        Avoid including identifying information in filenames from the get-go 
 
        
        Files can be categorized as text or binary 
 
        
        Lightweight text files can go a long way 
 
        
        A well thought-out directory structure simplifies computation 
 
        
        Be modular to facilitate reuse 
 
        
         
       | 
    
  
  
  
    
      | 
        Remote collaboration
       | 
      
        
        
        A dataset can be published with datalad push 
 
        
        A dataset can be cloned with datalad clone 
 
        
        The clone operation does not obtain annexed file content, the contents can be obtained selectively 
 
        
        Annexed file contents can be removed (drop) and reobtained (get) as long as a copy exists somewhere 
 
        
        A dataset can be synchronised with its copy (sibling) with datalad update 
 
        
        GIN is one of the platforms with which DataLad can interact 
 
        
        GIN can serve as a store for both annexed and non-annexed contents 
 
        
         
       | 
    
  
  
  
    
      | 
        Dataset management
       | 
      
        
        
        A dataset can contain other datasets 
 
        
        The super- and sub-datasets have separate histories 
 
        
        The superdataset only contains a reference to a specific commit in the subdataset’s history 
 
        
         
       | 
    
  
  
  
    
      | 
        Extras: The Basics of Branching
       | 
      
        
        
        Your dataset contains branches. The default branch is usually called either main or master. 
 
        
        There’s no limit to the number of branches one can have, and each branch can become an alternative timeline with developments independent from the developments in other branches. 
 
        
        Branches can be merged to integrate the changes from one branch into another. 
 
        
        Using branches is fundamental in collaborative workflows where many collaborators start from a clean default branch and propose new changes to a central dataset sibling. 
 
        
        Typically, central datasets are hosted on services like GitHub, GitLab, or Gin, and if collaborators push their branches with new changes, these services help to create pull requests. 
 
        
         
       | 
    
  
  
  
    
      | 
        Extras: Removing datasets and files
       | 
      
        
        
        Your dataset keeps annexed data safe and will refuse to perform operations that could cause data loss 
 
        
        Removing files or datasets with known copies is easy, removing files or datasets without known copies requires by-passing security checks 
 
        
        There are two ‘destructive’ commands: drop and remove 
 
        
        drop is the antagonist command to get, and remove is the antagonist command to clone 
 
        
        Both commands have a --reckless [MODE] parameter to override safety checks 
 
        
         
       |