DataLad - Distributed Data Management

Published

May 7, 2024

Modified

October 18, 2024

DataLad 1 is a free and open source distributed data management system that keeps track of your data, creates structure, ensures reproducibility, supports collaboration, and integrates with widely used data infrastructure.

What is DataLad used for?

Datasets

What is a Dataset?

  • …content agnostic collection of files/data …no custom data structure
  • …user works with directories & files on the local file-system
  • …anything in the dataset is version controlled
    • …reset to a previous state …revert changes
    • …rich history of changes to the dataset
  • …typically stored decentralized like Git repositories
  • Datasets support links & nesting…
    • …creates a hierarchy of datasets
    • …enables recursive operations throughout the hierarchy

datasets.datalad.org …super-dataset consisting of datasets from various portals

Installation

Install on Enterprise Linux & Fedora:

sudo dnf install -y git git-annex python3-pip
python3 -m pip install --user datalad

Free & open source (MIT license) …implemented in Python

  • …build on top of git and git-annex 2
  • …files stored in git-annex are content-locked
  • …stored under .git/annex/objects (object-tree)
  • …original path preserved a symlink (manged by Git)

Usage

Functionality available through a single datalad command:

# display help text
datalad --help

# more comprehensive information  on a sub-command
datalad $command --help

create

# create a new, empty dataset in specified path
>>> datalad create -c text2git $path
# …generates version control related data
>>> ls -a1 $path
.datalad/
.git/
.gitattributes

DataLad core data type is called dataset

  • …Git repository with data annex (to track large files)
  • …uses git init and git annex init
# transform an existing directory into a dataset
datalad create -f $path #...

Create a sibling dataset to…

  • …publish a datasets on a remote resource
  • …update a dataset from a remote resource

status & save

# list files in annex
datalad status --annex all

Following content states are distinguished:

  • ‘clean’
  • ‘added’
  • ‘modified’
  • ‘deleted’
  • ‘untracked’ unknown content save to start tracking

save commits modifications in the dataset

datalad save -m "some meaningful message"
  • There is no staging area in DataLad
  • …datalad save combines a git add and a git commit

clone & update

Access an exist dataset & stay up-to-date:

git clone $path
datalad clone $path

# update from a particular sibling and merge the changes
datalad update --how merge

siblings & push

List all known siblings by default

  • …one line per sibling …name and URL
  • + and - labels indicate the presence or absence of a remote data annex
datalad siblings

What is a sibling?

  • …dataset clone at a another location (different local or remote path)
  • …changes can be retrieved/pushed between a dataset and its sibling
  • …equivalent of a remote in Git
datalad siblings add --name $sibling --url $dataset

datalad push --to $sibling

push publish a dataset content

  • …uses git push, and git annex copy

get & drop

# get content to repository
datalad get . -r

# drop content from repository
datalad drop $parh
  • get …obtain data from some source
    • …by default recursive (not for sub-datasets)
  • drop …antagonist of ‘get’
    • …safe-by-default …considers state of known siblings
    • --reckless availability disables check for minimum number of remote sources

unlock

Unlock files of a dataset in order to be able to edit the actual content…

datalad unlock $path

run

Machine-readable, re-execution provenance…

  • …document the origin of files download from a web resource
  • …reference containerized software environment

rerun previously executed computations

Public Storage

Third-party integration with storage providers

  • …enables storage beyond local infrastructure like a workstation or a compute cluster
  • …for hosting datasets on public cloud infrastructure like GitHub, OwnCloud, GoogleDrive, Amazon S3, etc
  • …to allow seamless integration to existing storage infrastructure
# clone a repository from GitHub
datalad clone https://github.com/psychoinformatics-de/studyforrest-data-phase2.git

# change into the repository 
cd studyforrest-data-phase2

# begin to download data
datalad get sub-01/ses-localizer/func/sub-01_ses-localizer_task-*