DataLad - Distributed Data Management

Published

May 7, 2024

Modified

October 18, 2024

DataLad ¹ is a free and open source distributed data management system that keeps track of your data, creates structure, ensures reproducibility, supports collaboration, and integrates with widely used data infrastructure.

What is DataLad used for?

…version control arbitrarily large files
…data & software alongside in a datasets
…transport mechanisms to share & retrieve data
…computationally reproducible data analysis in scientific research
…however tools are completely domain agnostic

Datasets

What is a Dataset?

…content agnostic collection of files/data …no custom data structure
…user works with directories & files on the local file-system
…anything in the dataset is version controlled
- …reset to a previous state …revert changes
- …rich history of changes to the dataset
…typically stored decentralized like Git repositories
Datasets support links & nesting…
- …creates a hierarchy of datasets
- …enables recursive operations throughout the hierarchy

datasets.datalad.org …super-dataset consisting of datasets from various portals

Installation

Install on Enterprise Linux & Fedora:

sudo dnf install -y git git-annex python3-pip
python3 -m pip install --user datalad

Free & open source (MIT license) …implemented in Python

…build on top of git and git-annex ²
…files stored in git-annex are content-locked
…stored under .git/annex/objects (object-tree)
…original path preserved a symlink (manged by Git)

Usage

Functionality available through a single datalad command:

# display help text
datalad --help

# more comprehensive information  on a sub-command
datalad $command --help

`create`

# create a new, empty dataset in specified path
>>> datalad create -c text2git $path
# …generates version control related data
>>> ls -a1 $path
.datalad/
.git/
.gitattributes

DataLad core data type is called dataset…

…Git repository with data annex (to track large files)
…uses git init and git annex init

# transform an existing directory into a dataset
datalad create -f $path #...

Create a sibling dataset to…

…publish a datasets on a remote resource
…update a dataset from a remote resource

`status` & `save`

# list files in annex
datalad status --annex all

Following content states are distinguished:

‘clean’
‘added’
‘modified’
‘deleted’
‘untracked’ unknown content save to start tracking

save commits modifications in the dataset

datalad save -m "some meaningful message"

There is no staging area in DataLad
…datalad save combines a git add and a git commit

`clone` & `update`

Access an exist dataset & stay up-to-date:

git clone $path
datalad clone $path

# update from a particular sibling and merge the changes
datalad update --how merge

`siblings` & `push`

List all known siblings by default

…one line per sibling …name and URL
…+ and - labels indicate the presence or absence of a remote data annex

datalad siblings

What is a sibling?

…dataset clone at a another location (different local or remote path)
…changes can be retrieved/pushed between a dataset and its sibling
…equivalent of a remote in Git

datalad siblings add --name $sibling --url $dataset

datalad push --to $sibling

push publish a dataset content

…uses git push, and git annex copy

`get` & `drop`

# get content to repository
datalad get . -r

# drop content from repository
datalad drop $parh

get …obtain data from some source
- …by default recursive (not for sub-datasets)
drop …antagonist of ‘get’
- …safe-by-default …considers state of known siblings
- --reckless availability disables check for minimum number of remote sources

`unlock`

Unlock files of a dataset in order to be able to edit the actual content…

datalad unlock $path

`run`

Machine-readable, re-execution provenance…

…document the origin of files download from a web resource
…reference containerized software environment

rerun previously executed computations

Public Storage

Third-party integration with storage providers…

…enables storage beyond local infrastructure like a workstation or a compute cluster
…for hosting datasets on public cloud infrastructure like GitHub, OwnCloud, GoogleDrive, Amazon S3, etc
…to allow seamless integration to existing storage infrastructure

# clone a repository from GitHub
datalad clone https://github.com/psychoinformatics-de/studyforrest-data-phase2.git

# change into the repository 
cd studyforrest-data-phase2

# begin to download data
datalad get sub-01/ses-localizer/func/sub-01_ses-localizer_task-*

Footnotes

--- title: DataLad - Distributed Data Management date: 2024/05/07 date-modified: 2024/10/18 toc-expand: 3 --- > DataLad [^aR4de] is a free and open source distributed data management system > that keeps track of your data, creates structure, ensures reproducibility, > supports collaboration, and integrates with widely used data infrastructure. [^aR4de]: DataLad Project <https://www.datalad.org> <https://github.com/datalad> <http://docs.datalad.org/en/stable/cmdline.html> <https://handbook.datalad.org> <https://github.com/datalad/tutorials> <https://psychoinformatics-de.github.io/rdm-course> <https://www.youtube.com/c/DataLad> What is DataLad used for? - …**version control arbitrarily large files** - …data & software alongside in a datasets - …**transport mechanisms to share & retrieve data** - …computationally reproducible data analysis in scientific research - …however tools are completely domain agnostic ## Datasets What is a Dataset? - …content agnostic collection of files/data …no custom data structure - …user works with directories & files on the local file-system - …anything in the dataset is version controlled - …reset to a previous state …revert changes - …rich history of changes to the dataset - …typically stored decentralized like Git repositories - Datasets support links & nesting… - …creates a hierarchy of datasets - …enables recursive operations throughout the hierarchy [datasets.datalad.org](https://datasets.datalad.org/) …super-dataset consisting of datasets from various portals ## Installation Install on Enterprise Linux & Fedora: ```bash sudo dnf install -y git git-annex python3-pip python3 -m pip install --user datalad ``` Free & open source (MIT license) …implemented in Python - …build on top of `git` and `git-annex` [^Ad32W] - …files stored in git-annex are content-locked - …stored under `.git/annex/objects` (object-tree) - …original path preserved a symlink (manged by Git) [^Ad32W]: `git-annex` Project <https://git-annex.branchable.com> ## Usage Functionality available through a single `datalad` command: ```bash # display help text datalad --help # more comprehensive information on a sub-command datalad $command --help ``` ### `create` ```bash # create a new, empty dataset in specified path >>> datalad create -c text2git $path # …generates version control related data >>> ls -a1 $path .datalad/ .git/ .gitattributes ``` DataLad core data type is called **dataset**… - …Git repository with data annex (to track large files) - …uses `git init` and `git annex init` ```bash # transform an existing directory into a dataset datalad create -f $path #... ``` Create a sibling dataset to… - …publish a datasets on a remote resource - …update a dataset from a remote resource ### `status` & `save` ```bash # list files in annex datalad status --annex all ``` Following content states are distinguished: - 'clean' - 'added' - 'modified' - 'deleted' - 'untracked' unknown content `save` to start tracking `save` commits modifications in the dataset ```bash datalad save -m "some meaningful message" ``` - There is **no staging area** in DataLad - …datalad save combines a `git add` and a `git commit` ### `clone` & `update` Access an exist dataset & stay up-to-date: ```bash git clone $path datalad clone $path # update from a particular sibling and merge the changes datalad update --how merge ``` ### `siblings` & `push` List all known siblings by default - …one line per sibling …name and URL - …`+` and `-` labels indicate the presence or absence of a remote data annex ```bash datalad siblings ``` **What is a sibling?** - …dataset clone at a another location (different local or remote path) - …changes can be retrieved/pushed between a dataset and its sibling - …equivalent of a remote in Git ```bash datalad siblings add --name $sibling --url $dataset datalad push --to $sibling ``` `push` publish a dataset content - …uses `git push`, and `git annex copy` ### `get` & `drop` ```bash # get content to repository datalad get . -r # drop content from repository datalad drop $parh ``` - `get` …obtain data from some source - …by default recursive (not for sub-datasets) - `drop` …antagonist of ‘get’ - …safe-by-default …considers state of known siblings - `--reckless availability` disables check for minimum number of remote sources ### `unlock` Unlock files of a dataset in order to be able to edit the actual content… ```bash datalad unlock $path ``` ### `run` Machine-readable, re-execution provenance… - …document the origin of files download from a web resource - …reference containerized software environment `rerun` previously executed computations ## Public Storage Third-party **integration with storage providers**… - …enables storage beyond local infrastructure like a workstation or a compute cluster - …for hosting datasets on public cloud infrastructure like GitHub, OwnCloud, GoogleDrive, Amazon S3, etc - …to allow seamless integration to existing storage infrastructure ```bash # clone a repository from GitHub datalad clone https://github.com/psychoinformatics-de/studyforrest-data-phase2.git # change into the repository cd studyforrest-data-phase2 # begin to download data datalad get sub-01/ses-localizer/func/sub-01_ses-localizer_task-* ```

Datasets

Installation

Usage

create

status & save

clone & update

siblings & push

get & drop

unlock

run