Skip to content

Download data from the grid

Data Download from the Grid#

Preparation#

The Grid is a distributed infrastructure composed of many clusters and research organizations. Because of this decentralized nature, there is no central user management. To securely identify users and control access, the system relies on Grid certificates and Virtual Organizations (VOs).

Your digital identity starts with a private key — known only to you. You then obtain a Grid certificate from a trusted Certificate Authority (CA). This certificate contains your name and institution, confirming that the holder of the private key is indeed the identified person, certified by the CA.

In short:

  • the Grid certificate provides authentication (your identity, like a passport),
  • the Virtual Organization (VO) provides authorization (your right to access, like a visa).

Steps to Access the Grid#

To gain access to the AGATA Grid, three main steps are required:

  1. Obtain a Grid certificate so that you can be authenticated on the Grid.
  2. Join the AGATA Virtual Organization (VO) to be authorized for Grid access.
  3. Use a proper Grid User Interface (UI) — either installed locally by your IT services or obtained from the official AGATA Grid Docker image.

This manual explains how to use the Docker image and how to download AGATA data from the Grid using the collaboration’s official Python script.


Getting a Grid Certificate#

Grid certificates are issued by Certificate Authorities (CAs). The procedure differs from one institution to another (university, lab, etc.). If possible, ask local Grid users or IT specialists for guidance. Note that AGATA does not issue Grid certificates itself.

Once obtained, import your certificate into your browser — this allows you to request membership in the AGATA VO. You can then export it in .p12 format for later use.

Export procedure:

  • Firefox
  • Edit → Preferences → Privacy & Security → Certificates → View Certificates
  • Under Your Certificates, select your certificate and click Export.
    Save it as a .p12 file (you will be prompted to set a password).

  • Chrome

  • Go to chrome://settings/certificates
  • Under Your Certificates, select your certificate and click Export.
    Save it as a .p12 file and set a password.

The .p12 file will later be converted into the format required by the Grid UI. This conversion is handled automatically by a script inside the Docker image.

For more details, see the original documentation.


Joining the AGATA Virtual Organization#

A Virtual Organization (VO) is a group of distributed users sharing common objectives and Grid resources. Each Grid user belongs to one or more VOs, which define the compute and storage resources they can access.

You can apply for AGATA VO membership only after obtaining and installing your Grid certificate in your browser.

Membership is managed via the INDIGO IAM server. The following steps describe how to join the AGATA VO:

Single Sign-On (SSO) Authentication#

The first login must be done through an identity provider. Both EDUGAIN and ORCID are available.
Visit the INDIGO IAM server, choose one of these providers, and complete the registration form.

Once validated, you will have access to the IAM dashboard:

Linking Your Certificate#

Next, link your Grid certificate by clicking the “Link Certificate” button. Once linked, you can connect to the IAM server either through SSO or using your certificate directly.

Your certificate will then appear on your dashboard:

Note: if you encounter issues or need assistance with AGATA VO registration, please contact agatadp@ip2i.in2p3.fr.


Understanding Grid Data Management#

The dCache storage system used by AGATA consists of two tiers:

  • Magnetic tape storage (long-term archive)
  • Hard-disk storage (active workspace)

Files stored on tape must be copied to disk before use — this operation is called staging or bringing a file online.

  • NEARLINE: file stored on tape only.
  • ONLINE: file available on disk.

Before downloading data, you must perform a --bring_online operation to stage files from tape to disk. Each staged file has a pin lifetime (typically one week). Once this time expires, the system may automatically purge files to free disk space. You can always stage them again later if needed.

When staging pools are full, new requests wait until existing pin lifetimes expire. The AGATA script automatically unpins files after download, but if files were staged without being downloaded, you can manually release them using the --release command.


AGATA Grid Docker Image#

Downloading the Docker Image#

The AGATA collaboration provides a ready-to-use Docker image that includes the complete Grid User Interface (UI). It is built and maintained in the public AGATA repository:
gitlab.in2p3.fr/ip2igamma/docker_images – branch agata_grid_IAM

Requirement: Docker must be installed on your system.

To install the AGATA Grid Docker image, run:

docker pull gitlab-registry.in2p3.fr/ip2igamma/docker_images:agata_grid_IAM

Starting the Docker Image#

Before launching the container, define two environment variables on your host system:

  • CERTIF_DIR — the directory on your computer that contains your .p12 Grid certificate. This is needed only once to produce .pem files required by the Grid UI. The folder will be mounted inside the container at /root

  • DATA_DIR — the directory where Grid data will be downloaded. It will be mounted at /data inside the container.

Example setup:

export CERTIF_DIR=/path/to/my/certificates
export DATA_DIR=/path/to/data

Then start the container:

docker run -it --rm   -v ${CERTIF_DIR}:/root   -v ${DATA_DIR}:/data   gitlab-registry.in2p3.fr/ip2igamma/docker_images:agata_grid_IAM

Upon startup, the working directory should be /opt/AgataGrid, containing the following files:

root@:/opt/AgataGrid$ ls -l
total 40
-rw-r--r-- 1 root root   210 Oct 17 15:09 Grid.conf
-rwxr-xr-x 1 root root 29753 Oct 17 14:53 GridDataSync.py
-rwxr-xr-x 1 root root   929 Oct 17 14:53 do_grid_completion.sh
-rwxrwxrwx 1 root root  1274 Oct 17 14:53 gen_cert.sh

File overview:

  • Grid.conf — default script configuration file.
  • GridDataSync.py — main Python script for browsing and downloading AGATA data.
  • do_grid_completion.sh — shell auto-completion helper (enables tab-completion of options).
  • gen_cert.sh — helper script that converts your .p12 certificate into the .pem format required by the Grid UI.

Generating Certificate Files in Grid Format#

This step is required only once.

It converts your .p12 certificate into two .pem files (usercert.pem and userkey.pem),
which are stored in the .globus directory inside the container.

If .globus already exists, delete it before running the script.

Execute:

root@:/opt/AgataGrid$ ./gen_cert.sh /root/DUDOUET_GRID_2023.p12
Enter Import Password:
MAC verified OK
Enter Import Password:
MAC verified OK
Enter PEM pass phrase:
Verifying - Enter PEM pass phrase:
The files usercert.pem and userkey.pem have been created and moved to /root/.globus

This script will:

  1. Ask twice for your .p12 certificate password (once per .pem file).
  2. Ask you to define a new passphrase for the .pem files.

Testing the Certificate: Creating a New Proxy#

Once your .pem certificates are created, you can test the setup by generating a new Grid proxy:

root@:/opt/AgataGrid$ GridDataSync.py --new_proxy

Enter GRID pass phrase for this identity:
Contacting voms-agata.ijclab.in2p3.fr:443 [/DC=org/DC=terena/DC=tcs/C=FR/L=Paris/O=Centre national de la recherche scientifique/CN=voms-agata.ijclab.in2p3.fr] "vo.agata.org"...
Remote VOMS server contacted succesfully.


Created proxy in /tmp/x509up_u0.

Your proxy is valid until Mon Oct 20 14:55:27 UTC 2025

If this message appears without error, your Docker environment and certificate are correctly configured.

Script User Guide#

Presentation of the Different Options in the Script#

To show the available options in the terminal, use the --help option:

root@:/opt/AgataGrid$ GridDataSync.py --help

Usage: GridDataSync.py [options]

Browse and download AGATA data from the Grid

Options:
  -h, --help         show this help message and exit
  --new_proxy        create a new proxy
  --proxy_status     print the proxy status
  --from_LYON        download data from CC Lyon (default)
  --from_CNAF        download data from Bologna
  --show_conf        show the current configuration (paths, patterns)
  --ls_dir           list the content of the given folder
  --input_dir=path   copy grid data from distant path
  --output_dir=path  copy grid data into local path
  --exc=patt         exclude patterns separated by ":"; skip files containing these patterns (use none to reset)
  --inc=patt         include patterns separated by ":"; only include files matching these patterns (use none to reset)
                     (check https://regexone.com/references/python for Python regex format)
  --build_list       build the list of files to be downloaded (mandatory before start)
  --bring_online     move files from tape to disks (make the copy of files faster)
  --check_status     check the status of the files to be downloaded (locality, downloaded...)
  --verbose          increase verbosity
  --start            launch the download of the files from the Grid
  --force            force the download of offline files (much slower)
  --nochecksum       remove the checksum on each downloaded file
  --overwrite        overwrite already downloaded files
  --release          release all files from disk
  --threads=N        number of threads to use for parallel downloads (default: 4)
  --streams=N        number of simultaneous streams per file (default: 1)

The following subsections describe the different options in detail.

help:#

Prints the help message above. It will also be printed if no option is given.

new_proxy:#

Creates a new proxy, valid for 72 hours (this requires your certificate password).

proxy_status:#

Shows the status of your proxy, including the remaining time to use it.

from_LYON / from_CNAF:#

Selects the data source. By default, downloads are done from the Lyon Tier-1 site (--from_LYON).
Use --from_CNAF to download from the Bologna Tier-1 site.

show_conf:#

Displays the current configuration. The configuration file (Grid.conf) is automatically created on the first execution and updated as needed.

root@:/opt/AgataGrid$ GridDataSync.py --show_conf
********************************
** GridDataSync configuration **
********************************

SERVER              : https://ccdavegee.in2p3.fr:2880/
BASE_DIR_ON_GRID    : agata/
OUTPUTDIR           : /data/
ls_dir:#

Lists the content of a folder on the Grid. Without argument, it prints the base directory; with a subfolder name, it prints its content.
This command helps navigate the Grid directory tree.

root@:/opt/AgataGrid$ GridDataSync.py --ls_dir
********************************
** GridDataSync configuration **
********************************

SERVER              : https://ccdavegee.in2p3.fr:2880/
BASE_DIR_ON_GRID    : agata/
OUTPUTDIR           : /data/

 -- List content of folder: https://ccdavegee.in2p3.fr:2880/agata/
80.6 MB    31 Mar 2011 test_small_file_johan.tar
158.0 B    15 Oct 2013 host_adler32
Folder     02 Sep 2010 lschwarz
Folder     08 Sep 2010 pietro_test
Folder     21 Sep 2010 2010_week19
Folder     22 Oct 2010 Milano
Folder     25 Oct 2010 2010_week28
Folder     24 Mar 2011 2010_week48
Folder     01 Apr 2011 252Cf
Folder     22 Feb 2013 generated
Folder     26 Apr 2013 kaci-test
...
  • by giving as argument a sub folder:
root@:/opt/AgataGrid$ GridDataSync.py --ls_dir e680/e680
********************************
** GridDataSync configuration **
********************************

SERVER              : https://ccdavegee.in2p3.fr:2880/
BASE_DIR_ON_GRID    : agata/
OUTPUTDIR           : /data/

 -- List content of folder: https://ccdavegee.in2p3.fr:2880/agata/e680/e680
Folder     09 Jun 2015 ReplayFromAnalysisServer
Folder     09 Jun 2015 Vamos
Folder     09 Jun 2015 NarvalAGATAsolo
Folder     09 Jun 2015 Replay
Folder     09 Jun 2015 run_0001.dat.07-05-15_14h17m05s
Folder     09 Jun 2015 run_0002.dat.07-05-15_14h22m27s
Folder     09 Jun 2015 run_0003.dat.11-05-15_09h55m42s
Folder     09 Jun 2015 run_0004.dat.11-05-15_10h06m11s
...
input_dir / output_dir:#

Defines the input (remote) and output (local) directories for data transfer.

exc / inc:#

Defines exclude (--exc) and include (--inc) regular-expression filters. These determine which files are ignored or kept when building the list of files to download.

Use none to reset the filters.

Examples:

GridDataSync.py --exc .*.tar:.*.cdat.*

This command will exclude all .tar and cdat files.

GridDataSync.py --inc .*/03./.*

This command will only save files for the crystals starting with 03 (03A, 03B, 03C).

more information on the python regular expressions can be find on this link

Include and exclude patterns are recalled in the –show_conf option.

build_list:#

Builds the catalog (grid_summary.csv) listing all files, their location, and their download status.

bring_online:#

Stages files from tape to disk. This operation may take time and runs in batches of up to 1000 files to avoid overloading the namespace server.

check_status:#

Checks the current status of the files (downloaded, online, nearline, etc.) and updates the catalog.

To have more information, the options --verbose can be added to this command.

start:#

Launches the download process.
By default, it performs checksum verification on each file. Use --nochecksum to disable it, or --overwrite to re-download existing files.

⚙️ Parallel Download Mode
Tthe script supports multi-threaded and multi-stream downloading.
- The --threads option controls the number of concurrent threads processing independent files.
- The --streams option defines the number of simultaneous streams per file (useful for large datasets).
Optimal values depend on network and system bandwidth. Typically, --threads=4 and --streams=4 provide good performance without overloading the Grid servers.
Threaded transfers automatically synchronize and ensure safe concurrent access.

release:#

Releases all files that were previously pinned to disk using --bring_online.
Use this option if you staged files but no longer need to download them.


Example: Downloading a Folder to a Local Computer#

In this example, configuration files and trace (.cdat) files for crystals 03A–03C of run 0003 from the e680 experiment are downloaded from the Lyon Grid server.

Create a New Proxy (if needed)#

GridDataSync.py --new_proxy
root@:/opt/AgataGrid$ GridDataSync.py --ls_dir e680
********************************
** GridDataSync configuration **
********************************

SERVER              : https://ccdavegee.in2p3.fr:2880/
BASE_DIR_ON_GRID    : agata/
OUTPUTDIR           : /data/

 -- List content of folder: https://ccdavegee.in2p3.fr:2880/agata/e680
Folder     09 Jun 2015 e680
Folder     09 Jun 2015 ReadATCA
Folder     15 Jun 2015 e680_NoTraces
root@:/opt/AgataGrid$ GridDataSync.py --ls_dir e680/e680
********************************
** GridDataSync configuration **
********************************

SERVER              : https://ccdavegee.in2p3.fr:2880/
BASE_DIR_ON_GRID    : agata/
OUTPUTDIR           : /data/

 -- List content of folder: https://ccdavegee.in2p3.fr:2880/agata/e680/e680
Folder     09 Jun 2015 ReplayFromAnalysisServer
Folder     09 Jun 2015 Vamos
Folder     09 Jun 2015 NarvalAGATAsolo
Folder     09 Jun 2015 Replay
Folder     09 Jun 2015 run_0001.dat.07-05-15_14h17m05s
Folder     09 Jun 2015 run_0002.dat.07-05-15_14h22m27s
Folder     09 Jun 2015 run_0003.dat.11-05-15_09h55m42s
...

Define Input and Output Directories#

GridDataSync.py --input_dir e680/e680/run_0003.dat.11-05-15_09h55m42s
GridDataSync.py --output_dir /data/

Define Include and Exclude Patterns#

GridDataSync.py --exc .*.adf
GridDataSync.py --inc .*/03./.*

********************************
** GridDataSync configuration **
********************************

SERVER              : https://ccdavegee.in2p3.fr:2880/
BASE_DIR_ON_GRID    : agata/
INPUTDIR            : e680/e680/run_0003.dat.11-05-15_09h55m42s
OUTPUTDIR           : /data/
Include pattern     : .*/03./.*
Exclude pattern     : .*.adf

Here: .*.adf will exclude adf files, and .*/03./.* will include 03(ABC) crystals.

Build the File List (catalog generation):#

root@:/opt/AgataGrid$ GridDataSync.py --build_list
********************************
** GridDataSync configuration **
********************************

SERVER              : https://ccdavegee.in2p3.fr:2880/
BASE_DIR_ON_GRID    : agata/
INPUTDIR            : e680/e680/run_0003.dat.11-05-15_09h55m42s
OUTPUTDIR           : /data/
Include pattern     : .*/03./.*
Exclude pattern     : .*.adf

=> adding: 80.0 kB e680/e680/run_0003.dat.11-05-15_09h55m42s/Conf/03A/SRM_AGATA_small_files.tar
=> adding: 80.0 kB e680/e680/run_0003.dat.11-05-15_09h55m42s/Conf/03B/SRM_AGATA_small_files.tar
=> adding: 80.0 kB e680/e680/run_0003.dat.11-05-15_09h55m42s/Conf/03C/SRM_AGATA_small_files.tar
=> adding: 1020.0 MB e680/e680/run_0003.dat.11-05-15_09h55m42s/Data/03A/SRM_AGATA_event_mezzdata.cdat.0000
=> adding: 64.5 MB e680/e680/run_0003.dat.11-05-15_09h55m42s/Data/03A/SRM_AGATA_small_files.tar
=> adding: 895.0 MB e680/e680/run_0003.dat.11-05-15_09h55m42s/Data/03B/SRM_AGATA_event_mezzdata.cdat.0000
=> adding: 68.8 MB e680/e680/run_0003.dat.11-05-15_09h55m42s/Data/03B/SRM_AGATA_small_files.tar
=> adding: 890.0 MB e680/e680/run_0003.dat.11-05-15_09h55m42s/Data/03C/SRM_AGATA_event_mezzdata.cdat.0000
=> adding: 40.4 MB e680/e680/run_0003.dat.11-05-15_09h55m42s/Data/03C/SRM_AGATA_small_files.tar
[catalog] Catalog built from e680/e680/run_0003.dat.11-05-15_09h55m42s
[catalog] scope: all entries
[catalog] rows: 9
[catalog] counts: total: 9, downloaded: 0, remaining: 9
[catalog] remaining status: ONLINE=0, NEARLINE=9, OFFLINE=0, unknown=0
[catalog] sizes: total=2.9 GB, downloaded=0.0 B, remaining=2.9 GB

Stage the Files (Bring Online)#

The --bring_online can take time. Once it is launched, you can exit using CTRL+C command, the staging of the files will continue anyway.

root@:/opt/AgataGrid$ GridDataSync.py --bring_online
********************************
** GridDataSync configuration **
********************************

SERVER              : https://ccdavegee.in2p3.fr:2880/
BASE_DIR_ON_GRID    : agata/
INPUTDIR            : e680/e680/run_0003.dat.11-05-15_09h55m42s
OUTPUTDIR           : /data/
Include pattern     : .*/03./.*
Exclude pattern     : .*.adf

 ... Start staging files from tape to disk ...
     -> press CTRL+C to skip display (staging continues in background)
 Number of files to bring online: 9
 ** Staging launched for 9 file(s) **
 [\] ... 8 files ONLINE over 9, 1 remaining ...

The status of the --bring_online command can then be checked with the --check_status command.

Check File Status#

Once the staging has been launched, you need to wait few minutes and test with the --check_status command until the first files are in the ONLINE_AND_NEARLINE status

root@:/opt/AgataGrid$ GridDataSync.py --check_status --verbose
********************************
** GridDataSync configuration **
********************************

SERVER              : https://ccdavegee.in2p3.fr:2880/
BASE_DIR_ON_GRID    : agata/
INPUTDIR            : e680/e680/run_0003.dat.11-05-15_09h55m42s
OUTPUTDIR           : /data/
Include pattern     : .*/03./.*
Exclude pattern     : .*.adf

ONLINE              : e680/e680/run_0003.dat.11-05-15_09h55m42s/Conf/03A/SRM_AGATA_small_files.tar
ONLINE              : e680/e680/run_0003.dat.11-05-15_09h55m42s/Conf/03B/SRM_AGATA_small_files.tar
ONLINE              : e680/e680/run_0003.dat.11-05-15_09h55m42s/Conf/03C/SRM_AGATA_small_files.tar
NEARLINE            : e680/e680/run_0003.dat.11-05-15_09h55m42s/Data/03A/SRM_AGATA_event_mezzdata.cdat.0000
ONLINE              : e680/e680/run_0003.dat.11-05-15_09h55m42s/Data/03A/SRM_AGATA_small_files.tar
NEARLINE            : e680/e680/run_0003.dat.11-05-15_09h55m42s/Data/03B/SRM_AGATA_event_mezzdata.cdat.0000
ONLINE              : e680/e680/run_0003.dat.11-05-15_09h55m42s/Data/03B/SRM_AGATA_small_files.tar
NEARLINE            : e680/e680/run_0003.dat.11-05-15_09h55m42s/Data/03C/SRM_AGATA_event_mezzdata.cdat.0000
ONLINE              : e680/e680/run_0003.dat.11-05-15_09h55m42s/Data/03C/SRM_AGATA_small_files.tar
[catalog] Files status
[catalog] scope: all entries
[catalog] rows: 9
[catalog] counts: total: 9, downloaded: 0, remaining: 9
[catalog] remaining status: ONLINE=6, NEARLINE=3, OFFLINE=0, unknown=0
[catalog] sizes: total=2.9 GB, downloaded=0.0 B, remaining=2.9 GB

Here, 3 files are still to be copied on the disks. If we redo this operation few minutes later, we obtain:

root@:/opt/AgataGrid$ GridDataSync.py --check_status --verbose
********************************
** GridDataSync configuration **
********************************

SERVER              : https://ccdavegee.in2p3.fr:2880/
BASE_DIR_ON_GRID    : agata/
INPUTDIR            : e680/e680/run_0003.dat.11-05-15_09h55m42s
OUTPUTDIR           : /data/
Include pattern     : .*/03./.*
Exclude pattern     : .*.adf

ONLINE              : e680/e680/run_0003.dat.11-05-15_09h55m42s/Conf/03A/SRM_AGATA_small_files.tar
ONLINE              : e680/e680/run_0003.dat.11-05-15_09h55m42s/Conf/03B/SRM_AGATA_small_files.tar
ONLINE              : e680/e680/run_0003.dat.11-05-15_09h55m42s/Conf/03C/SRM_AGATA_small_files.tar
ONLINE              : e680/e680/run_0003.dat.11-05-15_09h55m42s/Data/03A/SRM_AGATA_event_mezzdata.cdat.0000
ONLINE              : e680/e680/run_0003.dat.11-05-15_09h55m42s/Data/03A/SRM_AGATA_small_files.tar
ONLINE              : e680/e680/run_0003.dat.11-05-15_09h55m42s/Data/03B/SRM_AGATA_event_mezzdata.cdat.0000
ONLINE              : e680/e680/run_0003.dat.11-05-15_09h55m42s/Data/03B/SRM_AGATA_small_files.tar
ONLINE              : e680/e680/run_0003.dat.11-05-15_09h55m42s/Data/03C/SRM_AGATA_event_mezzdata.cdat.0000
ONLINE              : e680/e680/run_0003.dat.11-05-15_09h55m42s/Data/03C/SRM_AGATA_small_files.tar
[catalog] Files status
[catalog] scope: all entries
[catalog] rows: 9
[catalog] counts: total: 9, downloaded: 0, remaining: 9
[catalog] remaining status: ONLINE=9, NEARLINE=0, OFFLINE=0, unknown=0
[catalog] sizes: total=2.9 GB, downloaded=0.0 B, remaining=2.9 GB

All the files are now on disks and ready to be downloaded. It has to be noted that this is not mandatory to wait that all the files are staged to start the download. The catalog will be updated and restarting the download later will only download the missing files.

Download Files#

Once its ready, start the download with the --start option:

root@:/opt/AgataGrid$ GridDataSync.py --start                 
********************************
** GridDataSync configuration **
********************************

SERVER              : https://ccdavegee.in2p3.fr:2880/
BASE_DIR_ON_GRID    : agata/
INPUTDIR            : e680/e680/run_0003.dat.11-05-15_09h55m42s
OUTPUTDIR           : /data/
Include pattern     : .*/03./.*
Exclude pattern     : .*.adf

...starting to download the 9 requested files using 4 threads and 4 streams per file...
Copied files: 6/9, current: 40.4 MB, total: 173.9 MB/2.9 GB, rate=27.4 MB/s, ETA=1m42s

During the transfer, the script reports download progress, total throughput, and completion status per thread.