0% found this document useful (0 votes)
24 views

LabMeeting Maguelonne Feb2023

Uploaded by

marwaan.nabil1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

LabMeeting Maguelonne Feb2023

Uploaded by

marwaan.nabil1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Lab Meeting

Maguelonne Roux

Maguelonne Roux

February 2023

Lab Meeting - February 2023 1 / 47


RESEARCH SCIENTISTS ENGINEERS/TECHNICIANS POST-DOCS PhD STUDENTS

Aurelie Bisiaux, IP Sara Niedbalski Yann Aquino


Luis Quintana, CdF/IP Population genetics Functional genomics
Functional genomics
Population genetics Siberia Single cell
Data generation

Zhi Li, IP Bjôrn Axel Olin Gaston Rijo


Etienne Patin, CNRS Functional genomics Systems Population genetics
Population genetics Data generation immunology Pacific

Maguelonne Roux, IP
Genomics INTERN
Maxime Rotival, IP Gaspard Kerner
Functional genomics Bioinformatics, Ancient DNA
Statistics Immunity
Tristan Woh
DNA Methylation

Guillaume Laval, IP Anthony Jaquaniello, IP Dang Liu


Computational Genomics Population genetics
modeling Data Manager Pacific

ASSISTANT

Oguzhan Parasayan
Human Evolutionary Genetics Team Population genetics
Ancient DNA
Marie-Therese Vicente

Lab Meeting - February 2023 2 / 47


RESEARCH SCIENTISTS ENGINEERS/TECHNICIANS POST-DOCS PhD STUDENTS

Aurelie Bisiaux, IP Sara Niedbalski Yann Aquino


Luis Quintana, CdF/IP Population genetics Functional genomics
Functional genomics
Population genetics Siberia Single cell
Data generation

Zhi Li, IP Bjôrn Axel Olin Gaston Rijo


Etienne Patin, CNRS Functional genomics Systems Population genetics
Population genetics Data generation immunology Pacific

Maguelonne Roux, IP
Genomics INTERN
Maxime Rotival, IP Gaspard Kerner
Functional genomics Bioinformatics, Ancient DNA
Statistics Immunity
Tristan Woh
DNA Methylation

Guillaume Laval, IP Anthony Jaquaniello, IP Dang Liu


Computational Genomics Population genetics
modeling Data Manager Pacific

ASSISTANT

Oguzhan Parasayan
Geh Dry Lab Population genetics
Ancient DNA
Marie-Therese Vicente

Ï does require computational & storage resources

Lab Meeting - February 2023 3 / 47


The High Performance Computing Cluster of Institut Pasteur

Lab Meeting - February 2023 4 / 47


HPC Cluster

Cluster
Computing Nodes / Partitions

Head Node
Scheduler

ssh
login

File System / Data Storage

Lab Meeting - February 2023 5 / 47


HPC Cluster - Since 2021

Cluster
common

MAESTRO

maestro-submit

ssh maestro.pasteur.fr
login - ABCsel
- ADAD
- evoceania
- evo_immuno_pop
Zeus /projets - Gdap
/homes - LabExMI
- LifeChange
- MATAEA
appa /scratch
- MORTUI

Lab Meeting - February 2023 6 / 47


HPC Cluster - Since 2021
Shared Disk

Cluster
common

MAESTRO

maestro-submit

ssh maestro.pasteur.fr
login
- IGSR : sharing of public
resources (datasets & software)
Zeus /projets
/homes

appa
/scratch

Lab Meeting - February 2023 7 / 47


HPC Cluster - Since 2021
Mappable File System

Cluster
common

MAESTRO

maestro-submit

ssh maestro.pasteur.fr
login

Zeus /projets
/homes

appa
/scratch

Lab Meeting - February 2023 8 / 47


HPC Cluster - Since 2021
Snaphot

Cluster
common

MAESTRO

maestro-submit

ssh maestro.pasteur.fr
login

Zeus /projets
.snapshot
- Hourly (last 3 days)
/homes
- Nightly (last 9 nights)
- Weekly (last 4 weeks)
appa - Monthly (last 8 months)
/scratch

Lab Meeting - February 2023 9 / 47


HPC Cluster - Since 2021
Snaphot

Lab Meeting - February 2023 10 / 47


HPC Cluster - Before 2021

Cluster

TARS - ABCsel
- agrhum
- Enseignements
- gdap
- Geh_calcul
- LabExMI
- LabExMIRaw
ssh gaia.pasteur.fr - LifeChange
login - mataea
gaia - selink
/projets/p01
- Vanuatu2
- IGSR
/geh
/home Projects

Team /projets/p02

«Private» - Vanuatu

Lab Meeting - February 2023 11 / 47


HPC Cluster - Before 2021
File System Access

Cluster
common

MAESTRO

maestro-submit


ssh maestro.pasteur.fr
login

/homes
Zeus /projets
/home

/projets


appa gaia
/scratch /geh

Lab Meeting - February 2023 12 / 47


HPC Cluster - Before 2021
File System Access

Zeus


/homes
/scratch
/projets
appa

ssh sftpcampus.pasteur.fr https://move.pasteur.fr/


login login

/projets

/home
/homes

gaia
/geh
/scratch /projets/p02

Lab Meeting - February 2023 13 / 47


HPC Cluster - Before 2021
Mappable File Systems

Cluster

TARS

ssh gaia.pasteur.fr
login
gaia /projets/p01

/geh Projects
/home
Team
/projets/p02
«Private»

Lab Meeting - February 2023 14 / 47


HPC Cluster - Before 2021
Backup

Cluster

TARS

Backup on disks located in


a different building
ssh gaia.pasteur.fr
login
gaia /projets/p01

/geh Projects
/home
Team
/projets/p02
«Private»

Lab Meeting - February 2023 15 / 47


HPC Cluster for Analyses

Cluster Computing Nodes / Partitions

MAESTRO common

Head Node Analysis


Scheduler
maestro-submit Data

ssh maestro.pasteur.fr Scripts


login
SLURM
/projets
Zeus
/homes
Results

File System /
appa /scratch Data Storage

Lab Meeting - February 2023 16 / 47


HPC Cluster for Analyses
Data Storage

Analysis
Where ?
Data
- ABCsel
- ADAD
Scripts - evoceania
- evo_immuno_pop
- Gdap
SLURM Zeus /projets - LabExMI
- LifeChange
- MATAEA
Results - MORTUI

Lab Meeting - February 2023 17 / 47


HPC Cluster for Analyses
Data Storage

Input Data : Data Reuse

Ï Data produced internally as part of previous scientific projects


Zeus /projets/p02
Ï Data published by the scientific community with restricted access

Ï Data published by the scientific community without restricted access Zeus /projets/p02/IGSR

Intermediate Files : Scratch

Ï Temporary Data
appa /scratch
" No snapshot & No backup & Data can be removed by IT

Lab Meeting - February 2023 18 / 47


HPC Cluster for Analyses
Data Storage

How ?

Analysis Project Disk Data Dataset1


Data Datasetm

Project1 Scripts Analysis1 Step1




Scripts Stepo

Analysisn

SLURM Results Analysis1 Step1


… …
Stepo

Analysisn

Results Log Files

README.txt

Projectl

README.txt

Lab Meeting - February 2023 19 / 47


HPC Cluster for Analyses
Data Storage

Analysis

Data
How ?

Scripts Ï Make sure that all the members of the group (members that do
have access to the project disk) can access your directories :
SLURM * chmod -R 770 directory/ for group rwx permissions

* chmod -R 750 directory/ for group r-x permissions


Results

Lab Meeting - February 2023 20 / 47


! ! ! LIMITED DISK SPACE ! ! !
GAIA Total Used Free %Used IT lend
/pasteur/gaia/projets/p01/LabExMIRaw 361.00 244.79 116.21 67.81%
/pasteur/gaia/projets/p01/IGSR 110.00 64.56 45.44 58.69%
/pasteur/gaia/projets/p01/mataea 110.00 96.67 13.33 87.88% 50.0
/pasteur/gaia/projets/p01/gdap 30.00 9.69 20.31 32.29%
/pasteur/gaia/projets/p02/Vanuatu 85.00 54.73 30.27 64.39%
/pasteur/gaia/projets/p01/evo_immuno_pop 79.00 58.09 20.91 73.53% 30.0
/pasteur/gaia/projets/p01/Geh_calcul 38.00 29.44 8.56 77.47%
/pasteur/gaia/projets/p01/LabExMI 29.00 25.50 3.50 87.93%
/pasteur/gaia/projets/p01/ABCsel 20.00 19.63 0.37 98.15%
/pasteur/gaia/projets/p01/Vanuatu2 15.00 12.03 2.97 80.21%
/pasteur/gaia/projets/p01/LifeChange 15.00 12.97 2.03 86.45%
/pasteur/gaia/projets/p01/agrhum 13.00 12.54 0.46 96.49%
/pasteur/gaia/projets/p01/selink 10.00 8.22 1.78 82.22%
Total 915.00 648.86 266.14 70.91% 80.0

ZEUS Total Used Free %Used IT lend


/pasteur/zeus/projets/p02/LabExMI 170.00 97.96 72.04 57.62% 30.0
/pasteur/zeus/projets/p02/evo_immuno_pop 20.00 14.89 5.11 74.44%
/pasteur/zeus/projets/p02/IGSR 97.00 71.76 25.24 73.98% 90.0
/pasteur/zeus/projets/p02/gdap 38.00 37.15 0.85 97.76% 40.0
/pasteur/zeus/projets/p02/evoceania 287.00 262.86 24.14 91.59% 280.0
/pasteur/zeus/projets/p02/ABCsel 5.00 4.52 0.48 90.44%
/pasteur/zeus/projets/p02/LifeChange 35.00 28.71 6.29 82.02% 90.0
/pasteur/zeus/projets/p02/MATAEA 239.00 207.18 31.82 86.68% 70.0
/pasteur/zeus/projets/p02/ADAD 4.00 3.19 0.81 79.87%
/pasteur/zeus/projets/p02/MORTUI 25.00 6.27 18.73 25.07%
Total 895.00 728.21 166.79 81.36% 600.0
Grand total 1810.00 1377.07 432.93 76.08% 680.0

Lab Meeting - February 2023 21 / 47


! ! ! LIMITED DISK SPACE ! ! !
Good Practices

1 Verify that the raw and/or processed data do not already exist on GEH project disks ;

2 Anticipate the resources required to store the raw and/or processed data ;

3 Evaluate the space available on the disk of interest df -BGB ProjectDisk/ ;

4 If the space left on the disk of interest is not enough :

Ï Identify the data that are using the space du -BGB --apparent-size ProjectDisk/
Evaluate whether the space used can be decreased (e.g by removing unnecessary files, by
compressing files, by moving cold data to gaia...) and leave enough space for the new data ;

Ï Otherwise, ask IT - through Etienne - for more space ... without abuse ;)

Lab Meeting - February 2023 22 / 47


! ! ! LIMITED DISK SPACE ! ! !

Ï Prices per Year :

GAIA ZEUS

Replicated 180e / TB 340e / TB

Non-Replicated 90e / TB 250e / TB

Lab Meeting - February 2023 23 / 47


https://moocs.pasteur.fr/courses/course-v1:Institut_Pasteur+DSI_01+1/course/
https://slurm.schedmd.com/

Analysis

Data Job
= calculation/data processing
= allocation of resources for the execution of a program.
Scripts
Ï Resources :
SLURM * CPU --cpus-per-task/-c
* Memory --mem --mem-per-cpu

* (Time) Partition --partition/-p & Quality of Service --qos/-q


Results

Lab Meeting - February 2023 24 / 47


Job = allocation of resources for the execution of a program.

MAESTRO
Computing Nodes / Partitions
maestro-submit
common

sinfo -e -O nodes,memory,cpus -p partition

Partition Default QOS Nb of nodes Memory per node Nb of CPUs per node Time Limit

84 500000 96 2h
dedicated fast
12 2000000 96 2h

35 500000 96 24h
common normal
3 2000000 96 24h

geh normal 11 500000 96 ;

gehbigmem normal 1 2000000 96 ;

Lab Meeting - February 2023 25 / 47


Analysis

Data
Job
= calculation/data processing
Scripts = allocation of resources for the execution of a program.

SLURM Ï srun / salloc to run a job interactively ;

Ï sbatch to run a job in batch mode ;


Results

Lab Meeting - February 2023 26 / 47


Ê srun / salloc to run a job interactively
Ë sbatch to run a job in batch mode

Ï srun myscript.sh ⇐⇒ salloc & ./myscript.sh & exit

Ï When salloc :

* Antipate the resources required to allocate the right --mem , -c & -p ;

* Give a name to your interactive session : salloc -J name so you can easily
remember the resources allocated and the analysis performed
(e.g. salloc -J MethNorm_100G_1CPU_Geh) ;

* Remember to end your interactive session ( exit ) when you’re not using it
anymore to allow other users to benefit from the resources of your unused session ;

* You have the possibility to run an salloc within a screen or tmux to avoid
your interactive session to be killed/lost when loosing connection to the cluster.

Lab Meeting - February 2023 27 / 47


Ê srun / salloc to run a job interactively
Ë sbatch to run a job in batch mode

Ï Type of Processing :

* sbatch myjob.sh
Sequential * myjob1_ID=$(sbatch --parsable myjob1.sh)
sbatch --dependency=after:myjob1_ID myjob2.sh

Parallel * sbatch --array=<start>-<stop>%<running> myarray.sh

Parallel & * myarray1_ID=$(sbatch --parsable --array=<s>-<s>%<r> myarray1.sh)


Sequential sbatch --array=<s>-<s>%<r> --dependency=after:myarray1_ID myarray2.sh

Lab Meeting - February 2023 28 / 47


Ê srun / salloc to run a job interactively
Ë sbatch to run a job in batch mode

Ï Resources required :

* CPU --cpus-per-task/-c
* Memory --mem --mem-per-cpu

* (Time) Partition --partition/-p & Quality of Service --qos/-q

Lab Meeting - February 2023 29 / 47


Ê srun / salloc to run a job interactively
Ë sbatch to run a job in batch mode

MAESTRO
Computing Nodes / Partitions
maestro-submit
common

Partition Default QOS Nb of nodes Memory per node Nb of CPUs per node Time Limit

84 500000 96 2h
dedicated fast
12 2000000 96 2h

35 500000 96 24h
common normal
3 2000000 96 24h

geh normal 11 500000 96 ;

gehbigmem normal 1 2000000 96 ;

Lab Meeting - February 2023 30 / 47


Ê srun / salloc to run a job interactively
Ë sbatch to run a job in batch mode

Job Time Memory slurm dependencies Partition

One or several job(s) that require a little to without a pipeline dedicated


that last < 2h each
One or several array(s) a lot of memory
within a pipeline common

that last > 2h & < 24h that require a little to within or without a
common
each a lot of memory pipeline

that require a little to within or without a


that last > 24h each geh
moderate memory pipeline

that require a lot of within or without a


One or several job(s) that last > 24h each gehbigmem
memory pipeline

Lab Meeting - February 2023 31 / 47


geh nodes

Ï Are used by 13 members of the team and are limited

Ï so one cluster user of the team can easily prevent another cluster user of the
team to work...

Lab Meeting - February 2023 32 / 47


geh nodes
Examples of Usage

Partition Default QOS Nb of nodes Memory per node Nb of CPUs per node Time Limit

geh normal 11 500000 96 ;

Ï with sbatch --array=1-22 --mem 130G -c 1 -p geh myarray1.sh ,


I use 22 ∗ 130/500G= 5.72 geh nodes (>50% of geh partition)

Ï with sbatch --array=1-45 --mem 1G -c 12 -p geh myarray2.sh ,


I use 45 ∗ 12/96CPUs= 5.625 geh nodes (>50% of geh partition)

Ï & if I do run the 2 examples at the same time, I allocate the sum of the resources
5.72 + 5.625 = 11.345 geh nodes (>100% of geh partition)

Lab Meeting - February 2023 33 / 47


geh nodes
Good Practices

Ï Do not ask for more resources than necessary

Ï & Limitate the number of resources allocated at the same time (by limiting
the number of jobs and/or the number of array tasks running at the same time)

Ï To allow the progression of your scientific project while guaranteeing the


progression of the projects of other geh members.

Lab Meeting - February 2023 34 / 47


How to monitor slurm jobs

Ï squeue -u username : displays information for running and pending jobs

Lab Meeting - February 2023 35 / 47


How to monitor slurm jobs

Ï scontrol show job JOBID : shows detailed information about a job

Lab Meeting - February 2023 36 / 47


How to monitor slurm jobs

Ï sacct -j JOBID : displays information on jobs, job steps, status, and exitcodes

Lab Meeting - February 2023 37 / 47


How to monitor slurm jobs

Ï seff JOBID : displays the resources used by a finished job

Lab Meeting - February 2023 38 / 47


Cancel a Job

Ï scancel JOBID

Modify jobs or array tasks that are PD

Ï scontrol hold JOBID

Ï scontrol update JobId=JOBID MinMemoryCPU=<megabytes>

Ï scontrol update JobId=JOBID CPUsPerTask=<count>

Ï scontrol update JobId=JOBID ArrayTaskThrottle=<count>

Ï scontrol update JobId=JOBID Partition=<name> QOS=<name>

Ï scontrol release JOBID

https://slurm.schedmd.com/

Lab Meeting - February 2023 39 / 47


Reproducibility

Ï Versioning of the scripts (Git & GitLab/GitHub)

Ï Package managers (e.g. Conda)

Ï Containers (e.g. Singularity, Apptainer)

Ï Workflow managers (e.g Nextflow, Snakemake)

Lab Meeting - February 2023 40 / 47


DMP

Ï The GEH lab is involved in several research projects, that require the generation and
handling of large-scale datasets. We have thus established, for each project, a Data
Management Plan that describes the management and life cycle of all the data that will
be collected, generated or processed by the lab. This document, which is established at
the start of the project, evolves with the scientific project and is updated when necessary
to include newly generated data or accommodate new legal/technological requirements.

Lab Meeting - February 2023 41 / 47


DMP

Ï For each project, the DMP defines the data that will be collected, generated or
processed, including a detailed description for each dataset of its aim, nature, format,
and the number and size of corresponding files. Similarly, the DMP also provides
descriptions of existing data that will be reused for the project (these can either be
public data published by the scientific community or data produced internally as part of
previous scientific projects).

Lab Meeting - February 2023 42 / 47


DMP

Ï For each project, the DMP defines the resources required for the data management,
including the persons in charge of the data management during the research project, the
computational and storage resources maintained by the Institut Pasteur required for data
processing (Maestro HPC ; ZEUS storage) and archiving (GAIA storage ; magnetic
bands), and the budget and funding allocated by the GEH team to use and enlarge these
resources.

Lab Meeting - February 2023 43 / 47


DMP

Ï For each project, the DMP defines the data management during the project, which
depends on the nature of the data ; demographic data collected on structured
questionnaires (eCRFs) are stored in the secured web application REDCap, installed on
secured servers behind institutional firewalls ; raw–omics data are stored on a dedicated
disk partition provided by the Institut Pasteur, to which only project participants have
access using >12-character password authentication. Data processing and computations
are performed by the persons in charge of the data management, using IT System
Department infrastructures (for computing and storage). The DMP defines quality
checks that are required to control the quality and the consistency of the data throughout
the project. Where appropriate, it also defines specific file classification schemes and
naming conventions to facilitate the identification and comprehension of the data. The
commenting and the versioning of the scripts on a daily basis using GitLab, along with
documentations (README file), guarantee data reproducibility and/or reuse.

Lab Meeting - February 2023 44 / 47


DMP

Ï For each project, the DMP defines the data/code sharing ; as soon as the results
generated with the data are published, all the scripts and datasets necessary to
reproduce the analyses are made accessible to the scientific community through DOIs.
The scripts are deposited on GitHub, where they are openly accessible, whereas the
pseudonymized datasets (genetic data, etc.) are deposited either on the European
Genome/Phenome Archive (EGA, maintained by EMBL-EBI) or, whenever possible, on
OWEY (maintained by Pasteur), which both allow data sharing under restricted access.
To access these data, the Data Requester is asked to describe the objectives of their
research to a dedicated Data Access Committee. If the research objectives comply with
the informed consent of research participants, the requester is asked to sign a Data
Access Agreement and is granted data access. The Agreement describes data usage
conditions, including data security measures.

Lab Meeting - February 2023 45 / 47


DMP

Ï For each project, the DMP defines the long-term preservation of the data : after the
publication of the results generated with the data, raw datasets are archived indefinitely
on both the data sharing platforms (EGA/OWEY) and on the secured storage disk
partitions (GAIA storage) or on magnetic tapes at the Institut Pasteur, depending on the
data size.

Lab Meeting - February 2023 46 / 47


Thank you :)

Lab Meeting - February 2023 47 / 47

You might also like