Esecuzione di un job con Matlab

Esecuzione di un job con Matlab

Di seguito sono descritti i passi necessari per l'esecuzione di un job tramite Matlab.

Il primo passo consiste nel preparare tutto quanto è necessario per l’esecuzione dell’esperimento desiderato: in questo caso un file di testo esempio.m contenente le istruzioni nel linguaggio Matlab che costituiscono l'esperimento.
Il file conterrà le seguenti due istruzioni:

a = [1 2 3 4 6 4 3 4 5]
b = a+2
b
ver

Come per qualsiasi altro job è necessario preparare uno script (esempio_matlab.sh) per il gestore delle code.
Possiamo utilizzare lo stesso del QuickStart ma matlab vuole delle opzioni specifiche per essere esguito in modalità batch e non richiedere quindi un'interazione diretta con l'utente:

#!/bin/sh
#SBATCH --job-name=esempio_matlab                      #  nome attribuito al job
#SBATCH --output=%x_%j.log                 # nome del file di output; produce esempio_base_JOBID.log
#SBATCH --error=%x_%j.err                  # nome del file di errore; produce esempio_base_JOBID.err
#SBATCH --mem-per-cpu=1024M                      # quantità di memoria riservata
#SBATCH --time=0-00:02:00                        # stima tempo esecuzione (giorni-ore:minuti:secondi)
#SBATCH -n 1                                     # numero di processori
#SBATCH -N 1                                     # numero di server
#SBATCH --mail-user=utente.terastat2@uniroma1.it # proprio indirizzo email
#SBATCH --mail-type=ALL

module load matlab/R2020a
matlab -nodisplay -nosplash -nodesktop -batch "run('$HOME/esempio.m');exit;"
sleep 60

Lo script potrà quindi essere messo in esecuzione con il comando sbatch

sbatch esempio_matlab.sh

Esempio parallelizzazione in R

Esempio parallelizzazione in R

Riportiamo qui un esempio di utilizzo della libreria doParallel di R per prarallelizzare parte di un algoritmo di stima del valore del pi greco.

Per questa stima viene utilizzato  il metodo Monte Carlo. Si considera un quadrato di lato 1 con un cerchio inscritto. Il rapporto delle aree del quadrato e del cerchio è proporzionale a Pi. Si distribuisono poi casualmente sul quadrato dei punti. Il rapporto tra il numero di punti distributi sul quadrato e quello dei punti che finiscono all'interno del cerchio inscritto sarà proporzionale, per grandi numeri di tentativi,  al rapporto tra le aree e, quindi, a pi greco.

Una implementazione sequenziale di questo algoritmo è:

library(ggplot2)
library(scales)

nCicli <- 6
nStime <- 64
nPunti <- 1

df <- data.frame( NumPunti=integer(), stima=double(), std=double(), time=double() )
for ( ciclo in 1:nCicli ){
    start_time <- Sys.time()
    nPunti <- nPunti * 10

    stime <- vector()
    for ( stima in 1:nStime ){
        nIn=0

        for ( step in 1:nPunti ) {
            x=runif(1,0,1)
            y=runif(1,0,1)
            if( x*x+y*y <= 1 ){
                nIn=nIn+1
            }
        }

        stime <- c( stime, 4*nIn/nPunti)
    }
    end_time <- Sys.time()
    df <- rbind( df, data.frame( NumPunti=nPunti, stima=mean(stime), std=sd(stime),
        time=end_time-start_time ) ) }
print(df)
p <- ggplot(df,aes(x = NumPunti, y= stima)) +
    geom_point() +
    geom_pointrange( aes( ymin=stima-std, ymax=stima+std))
    p + scale_x_log10()

ggsave("PiPlot_seq.pdf")

In questa implementazione, attraverso dei cicli, lo script ripete la stima numerose volte e con differenti numeri di punti casuali (da 1 a 1 milione); nello specifico vengono effettuate 64 stime per ognuno di 6 valori differenti del numero di punti.

Premesso che questo algoritmo è particolarmente adatto alla parallelizzazione in quanto le varie stime sono indipendenti l'una dall'altra, bisogna comunque individuare delle sottoprocedure che siano anche abbastanza corpose da giustificare l'inevitabile sovraccarico del controllo della parallelizzazione.

Per effettuare la parallelizzazione la libreria che andremo ad utilizzare (doParallel)  permette di definire un pool di processi worker a cui verrà affidata l'esecuzione delle sottoprocedure e, attraverso delle speciali direttive nei cicli di R permette di indicare quali parti del programma devono essere delegate  e parallelizzate. Ad esempio se si dispone di 10 worker e il ciclo prevede 100 passi la libreria farà si che i primi 10 passi vengano affidati ai processi worker e mano a mano che un worker termina il proprio lavoro, comunica il risultato al processo padre che gli affida il passo successivo fino al completamento del ciclo.

La prima cosa da fare è quindi modificare lo script di schedulazione assicurandosi di avere risorse a sufficienza per eseguire i processi worker:

#!/bin/sh

#SBATCH --job-name=stima_Pi_par # nome del job
#SBATCH --output=%x_32_%j.log # nome del file di output
#SBATCH --error=%x_32_%j.err # nome del file di errore
#SBATCH --mem-per-cpu=1G # memoria richiesta
#SBATCH --time=0-00:40:00 # tempo esecuzione richiesto (giorni-ore:minuti:secondi)
#SBATCH -n 32 # numero di processori da utilizzare
#SBATCH -N 1 # numero di server da utilizzare
#SBATCH --mail-user=utente@esempio.es # proprio indirizzo email
#SBATCH --mail-type=END

module load R
R --slave -f stimaPi_par.R

dove si vede che con la direttiva -n 32 vengono richiesti 32 core.

La versione parallelizzata della procedura diventa:

library(doParallel)
library(ggplot2)
library(scales)

num_cores <- as.numeric(Sys.getenv("SLURM_CPUS_ON_NODE"))
registerDoParallel(cores = num_cores)

nCicli <- 6
nStime <- 64
nPunti <- 1

df <- data.frame( NumPunti=integer(), stima=double(), std=double(), time=double() )
for ( ciclo in 1:nCicli ){
    start_time <- Sys.time()
    nPunti <- nPunti * 10

    stime = foreach ( stima=1:nStime, .combine='c' ) %dopar% {
        nIn <- 0
        for( step in 1:nPunti) {
            x=runif(1,0,1)
            y=runif(1,0,1)
            if( x*x+y*y <= 1 ){
                nIn = nIn+1
            }
        }
        4*nIn/nPunti
    }
    end_time <- Sys.time()
    df <- rbind( df, data.frame( NumPunti=nPunti, stima=mean(stime), std=sd(stime),
            time=as.numeric(end_time-start_time,units="secs") ) )
}

print(df)
p <- ggplot(df,aes(x = NumPunti, y= stima)) +
    geom_point() +
    geom_pointrange( aes( ymin=stima-std, ymax=stima+std))
p + scale_x_log10()

ggsave(paste("PiPlot_",num_cores,".pdf", sep=""))

Oltre all'inclusione della libreria scelta per la parallelizzazione si può vedere che viene indicato alla libreria doParallel il numero di core disponibili; infatti le due righe

num_cores <- as.numeric(Sys.getenv("SLURM_CPUS_ON_NODE"))
registerDoParallel(cores = num_cores)

leggono dall'ambiente di esecuzione il numero di core riservati sul nodo e utilizzano questa informazione per inizializzare il pool di worker.

Il ciclo foreach, il cui blocco calcolo la singola stima di ogni ciclo, tramite la direttiva %dopar% indica alla libreria che il blocco di codice deve essere esguito da uno dei woker.

Workshop Case 2: Efficient multiple input handling using R (Marco Mingione)

Efficient multiple input handling using R

Author

Dott. Marco Mingione (marco.mingione@uniroma1.it)

Registrazione video

Problem Statement

In many real-world data applications, there is a need to process and analyze input data that is originally distributed across multiple (possibly huge) files.

The natural solution would be to read these files sequentially and merge them in a single data object for later analysis. However, such an approach is unefficient as it requires an execution time that is linear in the size of the input.

 

Solutions

We exploit the distributed nature of TeraStat2 by developing a procedure in R for reading in parallel the content of several data files by using multiple processors. We use to this end several parallelization packages available with R and pre-installed on TeraStat 2 like: foreach, doParallel, multicore. 

We consider, as case-study a dataset reporting hourly data about waves height and direction in the South Mediterranean from January to October 2022.

Parallel solution

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)

# Load packages
packs <- c("tidyverse", "magrittr", "lubridate", "doParallel", "foreach")
suppressMessages(suppressWarnings(sapply(packs, require, character.only = T)))

# Data
rdata_files <- list.files("data/", full.names = T, pattern = ".RData")
n_files <- as.numeric(args[1])
#n_files <- 10
Ntimes <- 24 # hourly data

# get number of cores
n_cores <- detectCores()
cl <- makeCluster(n_cores-2) # create cluster and is good practice to leave some threads free
t0 <- Sys.time() # save initial time
final_dat <- foreach(fl = 1:length(rdata_files[1:n_files]), .combine = bind_rows, .inorder = T, .packages = "tidyverse") %dopar% {
  # Load the ith RData file
  load(rdata_files[fl]) 
  # Create part of the output
  coords_hour <- expand_grid(lat = data$lat, lon = data$lon, date_time = data$time) %>% arrange(date_time, lat, lon)
  hm0 <- lapply(1:Ntimes, function(i) as.vector(data$hm0[,,i])) %>% reduce(c)
  mdr <- lapply(1:Ntimes, function(i) as.vector(data$mdr[,,i])) %>% reduce(c)
  oo <- coords_hour %>% mutate(hm0 = hm0, mdr = mdr)
  return(oo)
  rm(coords_hour, hm0, mdr)
}
print("Final computation time is: ")
(t1 <- Sys.time() - t0) # save final time
stopCluster(cl) # stop cluster

print("Final object size is: ")
print(object.size(final_dat), units = "Mb")

# final_dat %>% filter(date_time == min(date_time)) %>% ggplot(aes(x = lon, y = lat, color = hm0)) + geom_point()
# 
# final_dat %>% drop_na %>% filter(lon == lon[1], lat == lat[1]) %>% ggplot(aes(x = date_time, y = hm0)) + geom_line()
# final_dat %>% drop_na %>% filter(lon %in% lon[1:100], lat %in% lat[1:100]) %>% 
#   ggplot(aes(x = date_time, y = hm0, group = interaction(lon,lat))) + geom_line()

(click here to download the source code)

 

We provide as well the code required to implement a traditional approach to this problem, by reading the different files sequentially. 

 

Sequential solution

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)

# Load packages
packs <- c("tidyverse", "magrittr", "lubridate")
suppressMessages(suppressWarnings(sapply(packs, require, character.only = T)))

# Dati
rdata_files <- list.files("data/", full.names = T, pattern = ".RData")
n_files <- as.numeric(args[1])
#n_files <- 10
Ntimes <- 24

final_dat <- list() # output list
t0 <- Sys.time() # save initial time
for(fl in rdata_files[1:n_files]){
  # Load the ith RData file
  load(fl)
  # Create part of the output
  coords_hour <- expand_grid(lat = data$lat, lon = data$lon, date_time = data$time) %>% arrange(date_time, lat, lon)
  hm0 <- lapply(1:Ntimes, function(i) as.vector(data$hm0[,,i])) %>% reduce(c)
  mdr <- lapply(1:Ntimes, function(i) as.vector(data$mdr[,,i])) %>% reduce(c)
  final_dat[[fl]] <- coords_hour %>% mutate(hm0 = hm0, mdr = mdr)
  rm(coords_hour, hm0, mdr)
}
final_dat %<>% reduce(bind_rows) 
print("Final computation time is: ")
(t1 <- Sys.time() - t0) # save final time

print("Final object size is: ")
print(object.size(final_dat), units = "Mb")

# final_dat %>% filter(date_time == min(date_time)) %>% ggplot(aes(x = lon, y = lat, color = hm0)) + geom_point()
# 
# final_dat %>% drop_na %>% filter(lon == lon[1], lat == lat[1]) %>% ggplot(aes(x = date_time, y = hm0)) + geom_line()
# final_dat %>% drop_na %>% filter(lon %in% lon[1:100], lat %in% lat[1:100]) %>% 
#   ggplot(aes(x = date_time, y = hm0, group = interaction(lon,lat))) + geom_line()

(click here to download the source code)

 

 

Data

Hourly data for waves height and direction in the South Mediterranean from January 1st to January 31th. One file for each day, for a total of 31 files.

Each file contains an R object called data and including:

  • 2 arrays, one for wave height and one for wave direction, each of dimension 361 × 192 × 24 (lon × lat × hour)
  • a vector of length 361 with the distinct longitudes
  • a vector of length 192 with the distinct latitudes
  • a vector of length 24 with the distinct hours

Click here to download the whole dataset (one single ZIP file of approx. 400MB)

Scripts

In the following we report some examples of script files useful for executing either the parallel or the sequential version of the code we proposed on the previously mentioned data. In both cases, we first load the TeraStat 2 module needed for using version 4.0.5 of R and, then, launch our experiment by specifying the R file to be executed together with its parameters

Note: all the SBATCH directives (line 2-10) should be changed at your convenience. 

 

Script for executing the parallel version

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#!/bin/sh
#SBATCH --job-name="parallel_30"              # jobname
#SBATCH --output=%x_%j.log            # output file
#SBATCH --error=%x_%j.err                # error file
#SBATCH --time=0-01:00:00                 # estimated duration
#SBATCH --mem-per-cpu=2GB               # amount of memory per cpu
#SBATCH -n 30                             # amount of cpu(s)
#SBATCH -N 1                                # amount of server(s)
#SBATCH --mail-user=john.doe@uniroma1.it
#SBATCH --mail-type=ALL

module load R/4.0.5_10gcc
Rscript --vanilla parallel.R 30

(click here to download the source code)

 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#!/bin/sh
#SBATCH --job-name="sequential"              # jobname
#SBATCH --output=%x_%j.log            # output file
#SBATCH --error=%x_%j.err                # error file
#SBATCH --time=0-01:00:00                 # estimated duration
#SBATCH --mem-per-cpu=2GB               # amount of memory per cpu
#SBATCH -n 1                             # amount of cpu(s)
#SBATCH -N 1                                # amount of server(s)
#SBATCH --mail-user=john.doe@uniroma1.it
#SBATCH --mail-type=ALL

module load R/4.0.5_10gcc
Rscript --vanilla sequential.R 30

(click here to download the source code)

Workshop Case 4: Simulation and analysis of protein dynamics in unconventional solvents (Leonardo Guidoni)

Simulation and analysis of protein dynamics in unconventional solvents

Author

prof. Leonardo Guidoni

Registrazione video

Problem Statement

Gromacs is one among the most efficient and versatile program to simulate molecular dynamics of liquids, proteins and nucleic acids. In this short demonstration we see how to start and analyze a molecular dynamic simulation of a small protein embedded in an unconventional chemical solvent: a mixture of water and Deep Eutectic Solvent (DES).

Solution

A molecular dynamic simulation needs 3 main file categories: 

  • A file (or more files) containing the topology and the force-field. For the protein we used standard amber force-field. For the DES molecules we used an adapted version of the gaff force-field. Gromacs tools used:  pdb2gmx. 
  • A file containing the starting coordinates. The starting coordinates of the simulation box has been obtained by solvating a lysozyme protein with a pre- equilibrated mixture of water and DES. Gromacs tool used: grompp, mdrun. 
  • Input file for energy minimization and molecular dynamics. Two files are used. The first (run0.mdp) for initial optimization, the second (run1.mdp).  

Data

Input files

Scripts

Script for execution of the parallel version:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#!/bin/sh
#SBATCH --job-name=test_gromacs # nome del job nelle code
#SBATCH --output=%x_%j.log # file di output
#SBATCH --error=%x_%j.err # file di errore
#SBATCH --mem-per-cpu=1024M # memoria riservata
#SBATCH --time=1-00:0:00 # riservo per 1 minuto
#SBATCH -n 32 # numero di core
#SBATCH -N 1 # numero di nodes
#SBATCH --mail-user=
#SBATCH --mail-type=ALL

module load gromacs/2021.3_10gcc

gmx_mpi grompp -f run0.mdp -c starting_coordinates.gro -p topol.top -o run0.tpr
gmx_mpi mdrun -s run0.tpr -deffnm test_gromacs -maxh 24.0

(click here to download the source code)

Workshop Case 5: Parallel Human Genome Data Analysis (Luca Corda)

Parallel Human Genome Data Analysis

Authors:

Registrazione video

Problem Statement

Biological samples are often acquired in batch, generating big data that need to be processed and analyzed simultaneously. To study multiple data by processing them for multi-parameters acquisition, the most time-efficient and reproducible way is to run the same script on each file in parallel. In the Giunta Lab at Sapienza, we routinely generate large amount of genomic data through sequencing experiments that require computation post-processing of samples through custom-made pipelines, as shown in the example below.

Solutions

The architecture of TeraStat2 cluster allows us to perform our analysis in very few steps directly using the modules already installed in the cluster. By developing a procedure in bash and using specific tools, it is possible to analyze genomic data in parallel using our custom pipelines. Here, I am presenting as proof-of-principle, a basic pipeline to evaluate CENP-A enriched regions of the human genome starting from publicly available ChIP-Seq data and the latest human reference genome assembly:

 

 

Data

A common type of sequencing-derived genomic data are fastq files. These are text files containing all the DNA sequence information. When these raw files are mapped to a reference sequence, which can be an en-tire genome or portions of it, they become BAM files. Here, we use sequencing data derived from a chromatin immuno-precipitation experiment (ChIP-seq) to investigate the binding of a protein to DNA and its enrichment on specific regions of the genome. We illustrate in this example the spread of Centromeric Protein A (CENP-A) on specific regions of the genome, and test whether our pipeline shows enriched peaks of this protein where we would expect, namely at human centromere.
INPUT: multiple human ChIP sequencing fastq files and a reference fasta file.

OUTPUT: for each input file, a binary alignment map (.bam) file, an index file (.bai) and a coverage track file (.bigwig) will be generated.

 

Scripts

This section reports some examples of script files written in bash useful to perform the ChIP-seq analysis in parallel.

Warning: Standard TeraStat 2 accounts come with a 100GB disk space availability. In order to perform the following experiments, you will have to ask the system administrator for an upgrade of your disk capacity otherwise the experiments will fail for lack of disk space.

Download data from NCBI (sra toolkit):

This step allows to obtain public data of ChIP-seq experiments that we will use as input for the analysis. It takes advantage of the download.sh script shown below to download a collection of paired-end reads starting from their id.

1
2
3
4
5
#!/bin/sh

module load sratoolkit/3.0.1 #load SRA Toolkit module

fastq-dump --split-files $1  #download paired-end reads given by the SRA id in list.txt

(click here to download the script)

We used this script to download the paired-end fastq files for each of the 4 experiments:

 

 

Thus, we first created a text file called list.txt containing the id of all paired-end reads collections to download

 

1
2
3
4
SRR13278681
SRR13278682
SRR13278683
SRR13278684

(click here to download the file)

Then, we executed the following command from the command-line to download all collections described in list.txt

1
for i in `cat list.txt`; do bash download.sh $i; done

Note: when run on TeraStat 2, this script requires about 10 hours to download all the input files.

 

Quality check (fastqc)

Data quality check is one of the most important steps in human genomic data analysis. Here, I describe the way we use FastQC, the software that provides the quality report on our starting raw data and allow us to move on to the next step in the shortest possible time, because of the parallel running of these jobs. The output of the FastQC tool is an html report with various information about the reads, such as base sequence quality, sequence duplication levels, adapter content and others. In our case, we take advantage of the many cores available with TeraStat 2 by running FastQC with the help of the fastqc.sh script shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#!/bin/sh
#SBATCH --job-name=fastqc                                   #job name                     
#SBATCH --output=%x_%j.log                                  #log file name
#SBATCH --error=%x_%j.err                                   #err file name
#SBATCH --mem-per-cpu=2G                                    #job memory
#SBATCH -n 128                                              #amount of CPU (s)
#SBATCH -N 1                                                #amount of server (s) 
#SBATCH --time=02:00:00                                     #estimated duration
#SBATCH --mail-user=john.doe@uniroma1.it 
#SBATCH --mail-type=ALL

module load fastqc/0.11.9  #load FastQC module

fastqc --threads 128 $1    #parallel quality control analysis

(click here to download the script)

The fastqc.sh script is executed, turnwise, on all fastq files downloaded before using the following command line:

1
for i in *fastq; do sbatch fastqc.sh $i; done

Note: when run on TeraStat 2, this step requires about 10 minutes to analyze all the input files.

 

Genome index (bwa index)
The human reference genome we consider in these experiments,  t2t-chm13-v2.0.fa.gz, can be retrieved from the following address:


http://t2t.gi.ucsc.edu/chm13/hub/t2t-chm13-v2.0/genome/?C=M;O=D

It can be downloaded in one shot by executing the following command:

1
wget http://t2t.gi.ucsc.edu/chm13/hub/t2t-chm13-v2.0/genome/t2t-chm13-v2.0.fa.gz

 

The downloaded genome is indexed using the bwa_index.sh script described below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#!/bin/sh
#SBATCH --job-name=index                     #job name
#SBATCH --output=%x_%j.log                   #log file name
#SBATCH --error=%x_%j.err                    #err file name
#SBATCH --mem-per-cpu=2G                     #job memory
#SBATCH -n 128                               #amount of CPU (s)
#SBATCH -N 1                                 #amount of server (s)
#SBATCH --time=02:00:00                      #estimated duration
#SBATCH --mail-user=john.doe@uniroma1.it 
#SBATCH --mail-type=ALL

module load python/3.9.2_10gcc #load the python module
module load bwa/0.7.17         #load the BWA module

bwa index -a bwtsw -p chm13.v2.0 t2t-chm13-v2.0.fa.gz #index reference genome

(click here to download the script)

 

Note: when run on TeraStat 2, this step requires about 1 hour to analyze the input file.

 

Alignment (bwa mem and samtools)

Burrows-Wheeler Aligner (bwa) is a software package that performs mapping sequences against a large reference, such as the human genome. With this script it is possible to map all our fastq files in parallel against the large sequence of our choice. In this case we will map to the T2T human reference genome and this step will take between 0.5h and 2h of time to complete. Once completed, we will have our raw Binary Alignment Map (BAM) file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#!/bin/sh
#SBATCH --job-name=alignment                                #job name          
#SBATCH --output=%x_%j.log                                  #log file name
#SBATCH --error=%x_%j.err                                   #err file name
#SBATCH --mem-per-cpu=2G                                    #job memory
#SBATCH -n 128                                              #amount of CPU (s)
#SBATCH -N 1                                                #amount of server (s)     
#SBATCH --time=90:00:00                                     #estimated duration
#SBATCH --mail-user=john.doe@uniroma1.it 
#SBATCH --mail-type=ALL

module load python/3.9.2_10gcc  #load the python module
module load bwa/0.7.17          #load the BWA module
module load samtools/1.12_10gcc #load the Samtools module

bwa mem chm13.v2.0 $1\_1.fastq $1\_2.fastq -t 128 | samtools view -bSh - > $1\.bam #alignment and SAM > BAM conversion

(click here to download the script)

 

 

Alignment is performed according to all collections of paired-end reads reported in list.txt, by executing the following command:

1
for i in `cat list.txt`; do sbatch alignment.sh $i; done

 

Note: when run on TeraStat 2, this step requires about 2 hours to analyze all the input files.

 

Processing (samtools)

The processing of the BAM files is useful to obtain a more reliable data. It is possible to filter the mapped reads in different ways, in the following script, named filter.sort.sh, we remove the unmapped reads, the secondary alignments and the supplementary alignments. Additionally, we can also sort the BAM files to order them and also reduce them to a smaller size.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#!/bin/sh
#SBATCH --job-name=filter                                      #job name
#SBATCH --output=%x_%j.log                                     #log file name
#SBATCH --error=%x_%j.err                                      #err file name
#SBATCH --mem-per-cpu=2G                                       #job memory
#SBATCH -n 128                                                 #amount of CPU (s) 
#SBATCH -N 1                                                   #amount of server (s) 
#SBATCH --time=90:00:00                                        #estimated duration
#SBATCH --mail-user=john.doe@uniroma1.it 
#SBATCH --mail-type=ALL

module load samtools/1.12_10gcc #load samtools module

samtools view -h -F 0x904 $1 | samtools sort -@ 128 - -O BAM -o $1\.filter.sort.bam      #sort by chromosome
samtools index $1\.filter.sort.bam                                                       #index bam file                                              

(click here to download the script)

The filter.sort.sh script is run on all bam files using the following command line:

1
for i in *bam; do sbatch filter.sort.sh $i; done

 

Note: when run on TeraStat 2, this step requires about 10 minutes to analyze all the input files.

 

Coverage track (deeptools)

This tool takes an alignment of reads or fragments as input (BAM file) and generates a coverage track (bigWig or bedGraph) as output.
The coverage is calculated as the number of reads per bin, where bins are short consecutive counting windows of a defined size.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#!/bin/sh
#SBATCH --job-name=coverage                                    #job name
#SBATCH --output=%x_%j.log                                     #log file name
#SBATCH --error=%x_%j.err                                      #err file name
#SBATCH --mem-per-cpu=2G                                       #job memory
#SBATCH -n 128                                                 #amount of CPU (s)
#SBATCH -N 1                                                   #amount of server (s)
#SBATCH --time=05:00:00                                        #estimated duration
#SBATCH --mail-user=john.doe@uniroma1.it 
#SBATCH --mail-type=ALL

module load python/3.9.2_10gcc      #load the python module
module load deeptools/              #load the deepTools module

bamCoverage -b $1 -o $1\.bw --binSize 10 --normalizeUsing RPGC --effectiveGenomeSize 3000000000 #create track coverage files

(click here to download the script)

 

Type this command to run the script:

1
for i in *filter.sort.bam; do sbatch coverage.sh $i; done

 

Note: when run on TeraStat 2, this step requires about 40 minutes to analyze all the input files.

 

Conclusion

After the ChIP-seq data analysis, we can obtain a graphical visualization of the peak files which show the coverage of mapped reads along the reference genome using the Integrative Genome Viewer (IGV) tool. After the alignment of raw reads to the genome and data filtering, it is now possible to identify enriched signal regions (peak calling) of our protein of interest. We can choose to visualize the spread of our protein along the entire chromosome or narrow down with a zoom in into our regions of interest. Figure below shows an example of the entire chromosome 5:

 

Workshop Case 6: Parallel Monte Carlo Markov Chain with JAGS and R (Gianmarco Caruso)

Parallel Monte Carlo Markov Chain with JAGS and R

Authors

Dott. Gianmarco Caruso (gianmarco.caruso@uniroma1.it)

 

Problem Statement

We consider a simple context in which we want to estimate the parameters (intercept, slope, and variance of errors) of a simple linear regression model. The parameters are estimated in a Bayesian framework exploiting Monte Carlo Marko Chains (MCMCs) via the well-known Bayesian program JAGS (which can be used in R). To evaluate the ability of the method to recover the true value of the model parameter, we consider a simulation study where K datasets are simulated using the same model specification and the true model is fitted to each of the K replicas.

 

Solutions

We show how parallelization, while fitting the K-replicas, provides a significant efficiency gain over the classical sequential solution. Moreover, parallelization is also possible within each single model fitting, e.g., by running parallel chains. Finally, we provide some summaries to evaluate the performance of the model.
 
We work with simulated data. Details on the simulation setup can be found in the following simSetting.R script.

Parallel solution

fitParallel.R 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
source("simSetting.R") #run the script with the simulation setting

require(R2jags)
require(doParallel)

# FIT A SIMPLE LINEAR MODEL ON EACH REPLICAS (PARALLEL SOLUTION)

ncores = detectCores() # get number of cores available on the machine
cl = makeCluster(ncores-2) # create cluster 
doParallel::registerDoParallel(cl)

start.time = Sys.time() # save initial time

fit_parallel = foreach(k = 1:K, .inorder = T, .packages = "R2jags") %dopar% {
  
  #jags.parallel parallelises over the number of chains
  
  jags.parallel(data = list("x"=sim_data[[k]]$x, 
                   "y"=sim_data[[k]]$y),
                inits = start.values,
                parameters.to.save = c("beta0","beta1","sigma"),
                model.file = "simplereg.txt",
                n.chains = nc,
                n.iter = niter,
                n.burnin = nburnin,
                n.thin = thin)
}

print("Final computation time is: ")
(parallel_time = Sys.time() - start.time)

stopCluster(cl) # stop cluster

(click here to download the source code)

 

We provide as well the code required to implement a traditional approach to this problem.

 

Sequential solution

fitSequential.R

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
source("simSetting.R") #run the script with the simulation setting

require(R2jags)

# FIT A SIMPLE LINEAR MODEL ON EACH REPLICAS (SEQUENTIAL SOLUTION)

start.time = Sys.time() # save initial time

fit_sequential = vector("list",K)

for(k in 1:K){
  
  #jags.parallel parallelises over the number of chains
  
  fit_sequential[[k]] = jags.parallel(data = list("x"=sim_data[[k]]$x, 
                                             "y"=sim_data[[k]]$y),
                                      inits = start.values,
                                      parameters.to.save = c("beta0","beta1","sigma"),
                                      model.file = "simplereg.txt",
                                      n.chains = nc,
                                      n.iter = niter,
                                      n.burnin = nburnin,
                                      n.thin = thin,
                                      export_obj_names = c("nc","niter","nburnin","thin"))
  
}


print("Final computation time is: ")
(parallel_time = Sys.time() - start.time)

(click here to download the source code)

 

The Bayesian model

simplereg.txt

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
model{

  #Priors
  
  beta0 ~ dnorm(0,0.001) #intercept
  beta1 ~ dnorm(0,0.001) #slope
  sigma ~ dunif(0,100) #standard deviation
  tau = 1/sigma^2 #precision 
  
  #Likelihood
  
  for( i in 1:length(y)){
    
    y[i] ~ dnorm(beta0 + beta1*x[i], tau) 
    
    #NB argument of dnorm() are the mean and the precision (i.e. precision=1/variance)
    
  }
  

}

(click here to download the source code)

 

Data

We use for this experiment simulated data, as returned by the simSetting.R code shown above.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#We consider a simple linear regression model:
#y =  beta0 + beta1*x + epsilon, where epsilon ~ Normal(0, sigma).
#Given n observations (x_i, y_i), we want to obtain a Bayesian estimation (via MCMC implemented in JAGS) of the model's parameters: beta0, beta1 and sigma.


# WRITE THE FUNCTION TO SIMULATES VALUES FOR INDEPENDENT AND DEPENDENT VARIABLES ----

sim.slreg = function(n, beta0, beta1, sigma){
  
  x = rnorm(n, 175, 7)
  y = beta0 + beta1*x + rnorm(n,0,sigma)
  
  return(as.data.frame(cbind(x=x, y=y)))
}

# DETAILS OF THE EXPERIMENT ----

n = 100 #number of units in each dataset
beta0 = -50 #intercept
beta1 = 0.7 #slope
sigma = 3 #standard deviation of the error

# SIMULATING REPLICAS ----

K = 100 #number of replicated datasets

set.seed(123) #setting a seed for 
sim_data = replicate(K, 
                     simplify = FALSE,
                     sim.slreg(n, beta0, beta1, sigma))

# BUILD A FUNCTION WHICH SIMULATES A LIST OF STARTING VALUES FOR EACH CHAIN ----

start.values = function(){
  beta0 = runif(1,-200,200)
  beta1 = runif(1,-50,50)
  sigma = runif(1,0,10)
  list(beta0=beta0, beta1=beta1, sigma=sigma)
}

# DETAILS OF THE MCMC RUN ----

nc = 2 #number of chains
niter = 1e5 #total number of iterations for each chain
nburnin = niter/2 #the first half of the iterations is discarded
thin = 1 #no thinning

 

Scripts

In the following we report some examples of script files useful for executing either the parallel or the sequential version of the code we proposed on the previously mentioned data. In both cases, we first load the TeraStat 2 module needed for using version 4.0.5 of R and, then, launch our experiment by specifying the R file to be executed together with its parameters

Note: all the SBATCH directives (line 2-8) should be changed at your convenience. 

 

Script for executing the parallel version

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#!/bin/sh
#SBATCH --job-name="parallel"              # jobname
#SBATCH --output=%x_%j.log            # output file
#SBATCH --error=%x_%j.err                # error file
#SBATCH --time=0-01:00:00                 # estimated duration
#SBATCH --mem-per-cpu=2GB               # amount of memory per cpu
#SBATCH -n 30                             # amount of cpu(s)
#SBATCH -N 1                                # amount of server(s)

module load jags
module load R
Rscript --vanilla fitParallel.R

(click here to download the script)

 

Script for executing the sequential version

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#!/bin/sh
#SBATCH --job-name="sequential"              # jobname
#SBATCH --output=%x_%j.log            # output file
#SBATCH --error=%x_%j.err                # error file
#SBATCH --time=0-01:00:00                 # estimated duration
#SBATCH --mem-per-cpu=2GB               # amount of memory per cpu
#SBATCH -n 1                             # amount of cpu(s)
#SBATCH -N 1                                # amount of server(s)

module load jags
module load R
Rscript --vanilla fitSequential.R

(click here to download the script)

 

Output

Once the experiment will be finished, it will be possible to evaluate the performance of the model using the following code. 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
load("outParallel.RData")
#load("outSequential.RData") #same content as outParallel.RData

require(HDInterval) #we load a package useful for computing HPD credible interval

# ANALYSING THE CONTENT OF THE FIRST REPLICA ----

fit_parallel[[1]]$BUGSoutput #posterior summaries for the first replica

beta0_d1 = fit_parallel[[1]]$BUGSoutput$sims.matrix[,"beta0"]
beta1_d1 = fit_parallel[[1]]$BUGSoutput$sims.matrix[,"beta1"]
sigma_d1 = fit_parallel[[1]]$BUGSoutput$sims.matrix[,"sigma"]

hdi(beta0_d1, credMass = 0.95)
hdi(beta1_d1, credMass = 0.95)
hdi(sigma_d1, credMass = 0.95)

par(mfrow=c(2,3))
#estimated posterior densities
hist(beta0_d1); abline(v=beta0, col=2)
hist(beta1_d1); abline(v=beta1, col=2)
hist(sigma_d1); abline(v=sigma, col=2)
#traceplots
plot(beta0_d1,type="l"); abline(h=beta0, col=2)
plot(beta1_d1,type="l"); abline(h=beta1, col=2)
plot(sigma_d1,type="l"); abline(h=sigma, col=2)

summary(lm(y~x, data = sim_data[[1]])) #comparing with the solution obtained in the frequentist framework

# BRIEF ANALYSIS OF THE PERFORMANCE OVER THE K REPLICAS ----

## Posterior estimates

#NB. We excluded the third column which refers to the chain of the 'deviance'

post_means = sapply(fit_parallel, 
                    function(x) colMeans(x$BUGSoutput$sims.matrix[,-3])) 

par(mfrow=c(1,3))
boxplot(post_means["beta0",], main=expression(hat(beta)[0])); abline(h=beta0, col=2)
boxplot(post_means["beta1",], main=expression(hat(beta)[1])); abline(h=beta1, col=2)
boxplot(post_means["sigma",], main=expression(hat(sigma))); abline(h=sigma, col=2)

## Credible intervals and coverage

CI95_beta0 = sapply(fit_parallel, 
                    function(x) hdi(x$BUGSoutput$sims.matrix[,"beta0"], credMass = 0.95))
CI95_beta1 = sapply(fit_parallel, 
                    function(x) hdi(x$BUGSoutput$sims.matrix[,"beta1"], credMass = 0.95))
CI95_sigma = sapply(fit_parallel, 
                    function(x) hdi(x$BUGSoutput$sims.matrix[,"sigma"], credMass = 0.95))

covCI95_beta0 = mean(apply(CI95_beta0, 2, function(x) beta0>x[1] & beta0<x[2]))
covCI95_beta1 = mean(apply(CI95_beta1, 2, function(x) beta1>x[1] & beta1<x[2]))
covCI95_sigma = mean(apply(CI95_sigma, 2, function(x) sigma>x[1] & sigma<x[2]))

par(mfrow=c(1,3))

plot(0,0, xlim=c(min(CI95_beta0), max(CI95_beta0)), ylim=c(0,K), xlab=expression(beta[0]), ylab="replica", main=paste("coverage=",covCI95_beta0))
for(i in 1:K) segments(CI95_beta0[1,i],i,CI95_beta0[2,i],i, col=3)
abline(v=beta0, col=2)

plot(0,0, xlim=c(min(CI95_beta1), max(CI95_beta1)), ylim=c(0,K), xlab=expression(beta[1]), ylab="replica", main=paste("coverage=",covCI95_beta1))
for(i in 1:K) segments(CI95_beta1[1,i],i,CI95_beta1[2,i],i, col=4)
abline(v=beta1, col=2)

plot(0,0, xlim=c(min(CI95_sigma), max(CI95_sigma)), ylim=c(0,K), xlab=expression(sigma), ylab="replica", main=paste("coverage=",covCI95_sigma))
for(i in 1:K) segments(CI95_sigma[1,i],i,CI95_sigma[2,i],i, col=6)
abline(v=sigma, col=2)

 

(click here to download the source code)

Workshop Case 7: Scaling Intensive Calculations in Matlab: Applications (Mathworks)

Scaling Intensive Calculations in MATLAB: Applications

Requirements  

Before trying to run these experiments, follow the directions provided in https://www.dss.uniroma1.it/it/HPCTeraStat2/IntegrazioneMatlab about configuring a Matlab installation to work with TeraStat 2.

Problem Statement

This workshop will explore the solutions of three problems using parallel computing: Monte Carlo simulation to solve the birthday paradox, Solutions of the Van Der Pol Oscillator and Training a Classification Model for X-Ray Chest Images.

Solution

Download the workshop_parallelcomputing. file, extract its content with MATLAB and follow the instructions inside the workshop_script.mlx file.

Data

NIH Chest X-ray Dataset, which can be found here:

https://www.kaggle.com/datasets/nih-chest-xrays/data

Scripts

No scripts are needed for these experiments as their  execution on TeraStat 2 is automatically instrumented by Matlab, provided that the initial configuration has been completed.