
Latent variables account for unobserved heterogeneity and induce some sort of dependence/similarity between observations in the same class (as in multilevel, longitudinal, or multi-way data) or explain why certain units or variables are closely related while others are not (as in network or multivariate data).
LVMs are successfully used in many empirical areas and comprehend methods for both supervised and unsupervised learning. The former includes, among others, Generalized Linear Mixed models, Item Response Theory models and Multilevel models. The latter contains Finite Mixture Models for clustering and factor models. Finite Mixture Models have gained popularity in the last decades as a powerful tool to model distributions and handle unobserved data heterogeneity. Factor models summarize observed variables by describing their interrelationships via a lower number of latent factors.
Clustering and factor models can also be formalized without any distributional assumption, as heuristic methods in the larger class of Dimensionality Reduction Methods (DRMs). They summarize data with respect to the dimension of units and/or variables, aiming to lose a negligible portion of the information only.
Research on LVMs and DRMs has been intensive in recent years, but most of the proposed tools are still inadequate to handle complex data. Complexity may either arise from specific features of the data collection process (when integrated data are obtained by multiple sources, or information gathered from innovative data sources come with unconventional structures, as networks or trees), or be due to features of the data themselves (when data are collected by survey with multiple waves or repeated observations from the same units are recorded, as in multivariate panel/longitudinal data). We may decide to exploit standard LVMs and DRMs at the cost of losing a significant portion of the available information or handle the complexity and the rich information that comes with it, by defining more appropriate models and methods. These should be computationally and statistically efficient to unfold the relevant information that would otherwise remain hidden.
This research project focuses on Latent Variable Models (LVMs) and Dimensionality Reduction Methods (DMRs) for analysing complex, high-dimensional data with, at least, two main aims:
(i) Converting complex data into rich information by formulating new methods and models to address the challenges of unconventional structures, such as multiway (space-time, multivariate longitudinal data), relational, functional, mixed-type, multilevel data (with observations nested within known clusters), possibly coming from multiple sources.
(ii) Applying the proposed methods and models to complex data coming from several application fields, in particular education and health. Education and health are both centred on people, and they share features calling for a common methodological approach.
To address the methodological target (i) of the research project, the following lines of research, grouped according to LVMs (section A) and DRMs (section B), will be developed:
A1. General LVMs: Model specification and variable selection
A2. Advances in LVMs for spatial, temporal, and spatio-temporal data
A3. Advances in LVMs for item response data
A4. Advances in LVMs for multilevel data
A5. Advances in LVMs for ordinal data
B1. Advances in DRMs for high-dimensional data
B2. Advances in DRMs for functional and relational data
B3. Advances in DRMs for mixed-type data
B4. Advances in DRMs for three-way data
B5. Advances in DRMs for network data
To address the applicative target (ii) of the research project,
In line with the applicative target (ii), a major focus of this project is to establish a clear connection between theory and practice, i.e. learning from real-life problems how to modify theoretical models to address relevant practical scientific issues. The project focuses on education and health. People are complex entities whose main characteristics are not always directly observable, e.g. ability and well-being. These are measured by manifest outcomes, e.g. through achievement and questionnaire items, raising a measurement error issue. LVs are the standard statistical tool to represent unobserved characteristics measured by manifest traits. The focus on people also raises the issue of missing data, common to education and health, calling for the adoption of specific methods to measure their impact on results. Education and health data are often characterised by hierarchical structures, with units nested into groups (schools, hospitals, etc). Such structures are relevant since people from the same environment may have similar outcomes. Some research may involve different levels, e.g. the effect of school environment on student achievements or of hospital organisation on patient outcomes. Multilevel models explicitly account for the nesting structure, and they directly answer research questions adjusting for correlated outcomes. Education and health are also characterised by complex structures, with variables observed across time, e.g. a cohort may be repeatedly monitored using measurements to evaluate the effects of a risk factor. Here it is crucial to reduce complexity by DRMs, in particular for high-dimensional data.