2° Workshop su Supercalcolo @ Dipartimento di Scienze Statistiche
Martedì 23 Aprile, a partire dalle 9:00, si terrà il 2° Workshop su Supercalcolo, organizzato dal Dipartimento di Scienze Statistiche.
L’evento, cui sarà possibile partecipare in presenza o da remoto, avrà come scopo quello di illustrare le potenzialità e le opportunità offerte dall’uso di strumenti di supercalcolo in ambito scientifico. Prevederà inoltre la presenza di sessioni tecniche durante le quali si procederà alla risoluzione, mediante l’uso di TeraStat 2, il supercalcolatore del Dipartimento di Scienze Statistiche appartenente alla Infrastruttura di Ricerca di Ateneo, di brevi casi di studio in un ampio spettro di domini di ricerca. I materiali utilizzati per le presentazioni saranno resi preventivamente disponibili ai partecipanti, dando così la possibilità di replicare in prima persona gli esperimenti presentati.
Luogo dell’evento
I lavori si terranno nell'Aula 1.01 al primo piano dell'Edificio D della sede Sapienza in Viale Regina Elena 295, a partire dalle ore 9:00, e proseguiranno per l'intera giornata come da programma (link Google Maps). In allegato le indicazioni per raggiungere l'aula partendo dal retro della Città Universitaria.
Evento organizzato con il supporto di:
- Dottorato in Biologia Ambientale ed Evoluzionistica
- Dottorato in Genetica e Biologia Molecolare
- Dottorato in Infrastrutture e Trasporti
- Dottorato in Ingegneria Informatica
- Dottorato Nazionale in Osservazione della Terra
- Dottorato di Ricerca in "Scuola di Scienze Statistiche"
- Scuola Superiore di Studi Avanzati Sapienza
Per registrarsi:
Attenzione: le registrazioni sono state prorogate sino a lunedì 22 Aprile.
Programma degli interventi
-
Saluti Istituzionali (ore 9:00)
- Alberto Marchetti Spaccamela (Prorettore alle Tecnologie digitali, Presidente di InfoSapienza)
- Fabio Sciarrino (Prorettore alle Strategie competitive per la ricerca internazionale)
- Giovanna Jona Lasinio (Direttrice del Dipartimento di Scienze Statistiche)
Sessione Mattutina
- 09:20 Il progetto TeraStat 2 (Umberto Ferraro Petrillo, responsabile scientifico TeraStat 2)
- 09:40 Nozioni di base e prima connessione a TeraStat 2 (Antonio Mastrandrea, Emanuele Corti, Dipartimento di Scienze Statistiche)
-
10:20 Leveraging the multi-core capabilities of a super-computing cluster to train Machine Learning models in R (Pierfrancesco Alaimo Di Loro, LUMSA)
Show AbstractThe recent strides in super-computing technology have shifted the focus from "slim" statical models to "brute-force" algorithmic routines that leverage the new computational capabilities to learn unstructured patterns from huge amounts of data. Versatile Machine Learning methods now have the upper hand in solving many real-world challenges but their practical application within reasonable time frames demands exploitation of all the available computational resources. This necessity becomes particularly pronounced when ensuring robust validation and uncertainty quantification through resampling techniques like cross-validation and bootstrap, which require training the same model a (sometimes large) number of times. The naive sequential implementation of these techniques grows linearly with the number of resamples - a highly inefficient approach. Considering that such operations can be executed independently, we propose harnessing the distributed architecture of TeraStat2 via R to perform them in parallel on multiple cores. This can significantly reduce the overall execution time whenever the cost of each single operation is sufficiently large.Strumenti utilizzati: R, tidyverse, xgboost, caret
-
11:00 Fast activity rhythms estimation of bears through Bayesian modeling in Stan (Aurora Donatelli, Dottorato in BIOLOGIA AMBIENTALE ED EVOLUZIONISTICA)
Show AbstractThe scope of the study is to evaluate effects of anthropogenic pressure, land productivity, and ambient temperature on circadian activity rhythms of brown bears at a transcontinental scale. The study used complex, Bayesian models implemented in R with the Stan language, capable of handling extensive datasets and incorporating multiple variables to analyze the circadian activity rhythms of brown bears. The models were processed on TeraStat 2 to manage the large dataset of approximately 400,000 lines and complex model variables efficiently. This allowed for simultaneous execution of multiple models and parallelized processes, significantly speeding up the analysis.Strumenti utilizzati: R, stan
-
11:40 De novo diploid human genome assembly using TeraStat 2 (Emilia Volpe, Dottorato in GENETICA E BIOLOGIA MOLECOLARE)
Show AbstractIn recent years, rapid advancements in sequencing technologies have revolutionized human genomics, rendering fields as de novo genome assembly more accessible. Here, we propose a new concept to create a complete reference to map and analyze omics-data from the same cell line in an isogenomic manner. With this goal, we assemble the non-tumoral diploid hTERT RPE-1 human cell line, assemble this cell line, we used the latest third generation sequencing technologies, PacBio High Fidelity, Oxford Nanopore Technology and Hi-C and assemblers. To obtain the assembly we use 128 computational cores and 256 GB of RAM from TeraStat 2 for an overall time of 3 days. We showcase the use of matched reference-reads for high precision alignments.Strumenti utilizzati: Conda, R, Snakemake, Python, Rukki, MBG, Graphaligner, Winnowmap, Bedtools, Samtools, BWA
-
12:20 Consistency of Maximum Likelihood Estimators via EM in relative survival cure models: A large-scale simulation study (Fabrizio Di Mari, Scuola di Dottorato in Scienze Statistiche)
Show AbstractIn this case study, we implement the EM algorithm to obtain the Maximum Likelihood Estimates of a Relative Survival Cure Model. We then assess whether increasing the sample size also increases the proximity of our estimates to the true values of the model parameters. For each sample size, the algorithm runs in parallel on 250 simulated datasets. The mean squared difference is computed across the datasets as an estimate of the MSE, after obtaining all estimates.Strumenti utilizzati: R, parallel
Sessione Pomeridiana
-
14:00 Leverage TeraStat2 to speed up MATLAB Algorithms and Applications (Alessio Conte, Mathworks)
Show AbstractIn this talk, we will present an innovative technique to easily submit MATLAB code to TeraStat2 directly from the MATLAB environment on your laptop. In addition, we will present fundamental tools and techniques to optimize and parallelize your MATLAB algorithms and applications on TS2 via simple case studies (before the event, instructions to configure MATLAB with TS2 will be provided. Make sure to complete the integration before the event to actively participate in the hands-on tasks). Strumenti utilizzati: Matlab, Parallel Computing Toolbox
Strumenti utilizzati: Matlab, Parallel Computing Toolbox
-
14:40 Containers: an ocean of softwares for NGS data analysis (and everything else) (Giacomo Chiappa, Dottorato in BIOLOGIA AMBIENTALE ED EVOLUZIONISTICA)
Show AbstractIn this talk, we will show how to use containers on TeraStat2 by discussing a particular case study involving marine gastropod protein files. First, we will filter the input files based on sequence length using the SeqKit package. Then, we will identify proteins that match a toxin reference database by employing the BLAST package.Strumenti utilizzati: Singularity, SeqKit, BLAST
-
15:20 Optimizing Computational Costs in Fluid Dynamics Simulator with Terastat: Scaling Techniques and Applications (Marta Galuppi, Dottorato in INFRASTRUTTURE E TRASPORTI)
Show AbstractThe objective of this project is to enhance the computational efficiency of Fire Dynamics Simulator (FDS), a computational fluid dynamics (CFD) model of fire-driven fluid flow, by leveraging TeraStat 2, a High-Performance Computing (HPC) cluster. This improvement will primarily focus on scaling techniques and their practical applications within the FDS framework. FDS is a scientific tool, open source, released by NIST, for simulating fire scenarios in a controlled virtual environment, utilizing a numerical solution of the Navier-Stokes equations suitable for low-speed, thermally driven flow. However, the current computational costs associated with fluid dynamics simulations, particularly concerning large-scale systems or complex fluid behaviours, present significant challenges. By exploiting the capabilities of TS2, an advanced computing platform, the goal is to optimize these costs, facilitating faster and more resource-efficient simulations. This optimization is crucial for enabling detailed analyses of complex fire dynamics, including heat transfer, smoke movement, and structural element interactions, in alignment with observed data from model-scale tests. The iterative process of validating and refining FDS models contributes to the advancement of fire safety research and engineering practices, ultimately promoting the development of more accurate and reliable fire safety strategies. Through this project, we seek to foster safer environments and infrastructure by enhancing the computational efficiency of FDS simulations.Strumenti utilizzati: FDS
-
16:00 Study of the Protein Structural Dynamics in Solution (Alessandro Nicola Nardi, Giuseppe Chen, Dipartimento di Chimica)
Show AbstractClassical molecular dynamics (MD) simulations are nowadays among the most commonly used computational techniques to obtain information about the structural dynamics of condensed phase systems. The Gromacs software package is an efficient and versatile program that allows performing classical MD simulations of proteins and nucleic acids in solution. Gromacs offers the possibility of extensive (CPU and GPU) parallelization that speeds up bottleneck calculations. In this demonstration, we will exploit the CPU parallelization offered by Gromacs on TeraStat2. The performance of TS2 will be tested for the dynamics of a small-sized protein in aqueous solution, as a case study. Additionally, the typical analyses performed on an MD trajectory will be shown.Strumenti utilizzati: Gromacs
-
16:40 Benchmarking computational topology tools using interactive jobs (Riccardo Ceccaroni, Dottorato in Data Science)
Show AbstractFrequently, we encounter multiple software implementations for tackling the same problem, prompting the need to identify the most efficient option. This scenario is evident in the computation of persistent homology, a method for computing topological features of a space at different spatial resolutions. Numerous software solutions have emerged for this purpose. This study focuses on software designed to extract topological features from images, with a specific interest in zero-dimensional persistent homology achieved through simplicial complex filtering. Two options for this task are Ripser, renowned for its stability and versatility, and the newly developed PixHomology. To determine the most efficient solution in terms of execution time and memory usage, we propose a benchmarking approach on Terastat2 utilizing interactive jobs for real-time results.Strumenti utilizzati: gcc, conda, wget, git, ripser
TeraStat 2
TeraStat 2 (TS2) è il supercalcolatore general-purpose del Dipartimento di Scienze Statistiche per la risoluzione di modelli matematici e statistici su Big Data. Complessivamente il sistema dispone di 12 nodi di calcolo “fat” per un totale di 1.920 core. L'accesso a TS2 viene reso disponibile gratuitamente a tutto il personale di Sapienza che ne ha bisogno per lo sviluppo di progetti che richiedono l’uso di supercalcolo. Per un uso particolarmente intensivo delle risorse di calcolo, è prevista la sottoscrizione di una quota di valore non commerciale, che contribuirà alla copertura dei costi di manutenzione ed aggiornamento di TS2. Ulteriori informazioni su TeraStat 2 sono disponibili all’indirizzo: https://www.dss.uniroma1.it/