LabMeeting Maguelonne Feb2023
LabMeeting Maguelonne Feb2023
Maguelonne Roux
Maguelonne Roux
February 2023
Maguelonne Roux, IP
Genomics INTERN
Maxime Rotival, IP Gaspard Kerner
Functional genomics Bioinformatics, Ancient DNA
Statistics Immunity
Tristan Woh
DNA Methylation
ASSISTANT
Oguzhan Parasayan
Human Evolutionary Genetics Team Population genetics
Ancient DNA
Marie-Therese Vicente
Maguelonne Roux, IP
Genomics INTERN
Maxime Rotival, IP Gaspard Kerner
Functional genomics Bioinformatics, Ancient DNA
Statistics Immunity
Tristan Woh
DNA Methylation
ASSISTANT
Oguzhan Parasayan
Geh Dry Lab Population genetics
Ancient DNA
Marie-Therese Vicente
Cluster
Computing Nodes / Partitions
Head Node
Scheduler
ssh
login
Cluster
common
MAESTRO
maestro-submit
ssh maestro.pasteur.fr
login - ABCsel
- ADAD
- evoceania
- evo_immuno_pop
Zeus /projets - Gdap
/homes - LabExMI
- LifeChange
- MATAEA
appa /scratch
- MORTUI
Cluster
common
MAESTRO
maestro-submit
ssh maestro.pasteur.fr
login
- IGSR : sharing of public
resources (datasets & software)
Zeus /projets
/homes
appa
/scratch
Cluster
common
MAESTRO
maestro-submit
ssh maestro.pasteur.fr
login
Zeus /projets
/homes
appa
/scratch
Cluster
common
MAESTRO
maestro-submit
ssh maestro.pasteur.fr
login
Zeus /projets
.snapshot
- Hourly (last 3 days)
/homes
- Nightly (last 9 nights)
- Weekly (last 4 weeks)
appa - Monthly (last 8 months)
/scratch
Cluster
TARS - ABCsel
- agrhum
- Enseignements
- gdap
- Geh_calcul
- LabExMI
- LabExMIRaw
ssh gaia.pasteur.fr - LifeChange
login - mataea
gaia - selink
/projets/p01
- Vanuatu2
- IGSR
/geh
/home Projects
Team /projets/p02
«Private» - Vanuatu
Cluster
common
MAESTRO
maestro-submit
✕
ssh maestro.pasteur.fr
login
/homes
Zeus /projets
/home
✕
/projets
✕
appa gaia
/scratch /geh
Zeus
✕
/homes
/scratch
/projets
appa
/projets
/home
/homes
gaia
/geh
/scratch /projets/p02
Cluster
TARS
ssh gaia.pasteur.fr
login
gaia /projets/p01
/geh Projects
/home
Team
/projets/p02
«Private»
Cluster
TARS
/geh Projects
/home
Team
/projets/p02
«Private»
MAESTRO common
File System /
appa /scratch Data Storage
Analysis
Where ?
Data
- ABCsel
- ADAD
Scripts - evoceania
- evo_immuno_pop
- Gdap
SLURM Zeus /projets - LabExMI
- LifeChange
- MATAEA
Results - MORTUI
Ï Data published by the scientific community without restricted access Zeus /projets/p02/IGSR
Ï Temporary Data
appa /scratch
" No snapshot & No backup & Data can be removed by IT
How ?
…
Data Datasetm
Analysisn
… …
Stepo
Analysisn
README.txt
Projectl
README.txt
Analysis
Data
How ?
Scripts Ï Make sure that all the members of the group (members that do
have access to the project disk) can access your directories :
SLURM * chmod -R 770 directory/ for group rwx permissions
1 Verify that the raw and/or processed data do not already exist on GEH project disks ;
2 Anticipate the resources required to store the raw and/or processed data ;
Ï Identify the data that are using the space du -BGB --apparent-size ProjectDisk/
Evaluate whether the space used can be decreased (e.g by removing unnecessary files, by
compressing files, by moving cold data to gaia...) and leave enough space for the new data ;
Ï Otherwise, ask IT - through Etienne - for more space ... without abuse ;)
GAIA ZEUS
Analysis
Data Job
= calculation/data processing
= allocation of resources for the execution of a program.
Scripts
Ï Resources :
SLURM * CPU --cpus-per-task/-c
* Memory --mem --mem-per-cpu
MAESTRO
Computing Nodes / Partitions
maestro-submit
common
Partition Default QOS Nb of nodes Memory per node Nb of CPUs per node Time Limit
84 500000 96 2h
dedicated fast
12 2000000 96 2h
35 500000 96 24h
common normal
3 2000000 96 24h
Data
Job
= calculation/data processing
Scripts = allocation of resources for the execution of a program.
Ï When salloc :
* Give a name to your interactive session : salloc -J name so you can easily
remember the resources allocated and the analysis performed
(e.g. salloc -J MethNorm_100G_1CPU_Geh) ;
* Remember to end your interactive session ( exit ) when you’re not using it
anymore to allow other users to benefit from the resources of your unused session ;
* You have the possibility to run an salloc within a screen or tmux to avoid
your interactive session to be killed/lost when loosing connection to the cluster.
Ï Type of Processing :
* sbatch myjob.sh
Sequential * myjob1_ID=$(sbatch --parsable myjob1.sh)
sbatch --dependency=after:myjob1_ID myjob2.sh
Ï Resources required :
* CPU --cpus-per-task/-c
* Memory --mem --mem-per-cpu
MAESTRO
Computing Nodes / Partitions
maestro-submit
common
Partition Default QOS Nb of nodes Memory per node Nb of CPUs per node Time Limit
84 500000 96 2h
dedicated fast
12 2000000 96 2h
35 500000 96 24h
common normal
3 2000000 96 24h
that last > 2h & < 24h that require a little to within or without a
common
each a lot of memory pipeline
Ï so one cluster user of the team can easily prevent another cluster user of the
team to work...
Partition Default QOS Nb of nodes Memory per node Nb of CPUs per node Time Limit
Ï & if I do run the 2 examples at the same time, I allocate the sum of the resources
5.72 + 5.625 = 11.345 geh nodes (>100% of geh partition)
Ï & Limitate the number of resources allocated at the same time (by limiting
the number of jobs and/or the number of array tasks running at the same time)
Ï sacct -j JOBID : displays information on jobs, job steps, status, and exitcodes
Ï scancel JOBID
https://slurm.schedmd.com/
Ï The GEH lab is involved in several research projects, that require the generation and
handling of large-scale datasets. We have thus established, for each project, a Data
Management Plan that describes the management and life cycle of all the data that will
be collected, generated or processed by the lab. This document, which is established at
the start of the project, evolves with the scientific project and is updated when necessary
to include newly generated data or accommodate new legal/technological requirements.
Ï For each project, the DMP defines the data that will be collected, generated or
processed, including a detailed description for each dataset of its aim, nature, format,
and the number and size of corresponding files. Similarly, the DMP also provides
descriptions of existing data that will be reused for the project (these can either be
public data published by the scientific community or data produced internally as part of
previous scientific projects).
Ï For each project, the DMP defines the resources required for the data management,
including the persons in charge of the data management during the research project, the
computational and storage resources maintained by the Institut Pasteur required for data
processing (Maestro HPC ; ZEUS storage) and archiving (GAIA storage ; magnetic
bands), and the budget and funding allocated by the GEH team to use and enlarge these
resources.
Ï For each project, the DMP defines the data management during the project, which
depends on the nature of the data ; demographic data collected on structured
questionnaires (eCRFs) are stored in the secured web application REDCap, installed on
secured servers behind institutional firewalls ; raw–omics data are stored on a dedicated
disk partition provided by the Institut Pasteur, to which only project participants have
access using >12-character password authentication. Data processing and computations
are performed by the persons in charge of the data management, using IT System
Department infrastructures (for computing and storage). The DMP defines quality
checks that are required to control the quality and the consistency of the data throughout
the project. Where appropriate, it also defines specific file classification schemes and
naming conventions to facilitate the identification and comprehension of the data. The
commenting and the versioning of the scripts on a daily basis using GitLab, along with
documentations (README file), guarantee data reproducibility and/or reuse.
Ï For each project, the DMP defines the data/code sharing ; as soon as the results
generated with the data are published, all the scripts and datasets necessary to
reproduce the analyses are made accessible to the scientific community through DOIs.
The scripts are deposited on GitHub, where they are openly accessible, whereas the
pseudonymized datasets (genetic data, etc.) are deposited either on the European
Genome/Phenome Archive (EGA, maintained by EMBL-EBI) or, whenever possible, on
OWEY (maintained by Pasteur), which both allow data sharing under restricted access.
To access these data, the Data Requester is asked to describe the objectives of their
research to a dedicated Data Access Committee. If the research objectives comply with
the informed consent of research participants, the requester is asked to sign a Data
Access Agreement and is granted data access. The Agreement describes data usage
conditions, including data security measures.
Ï For each project, the DMP defines the long-term preservation of the data : after the
publication of the results generated with the data, raw datasets are archived indefinitely
on both the data sharing platforms (EGA/OWEY) and on the secured storage disk
partitions (GAIA storage) or on magnetic tapes at the Institut Pasteur, depending on the
data size.