Álvaro Méndez Civieta, Universidad Carlos III de Madrid
Predictive Models for Big Data Environments
In recent years, the advances in data collection technologies have presented a difficult challenge by extracting increasingly complex and larger datasets. Traditionally, statistics methodologies treated with datasets where the number of variables did not exceed the number of observations, however, dealing with problems where the number of variables is larger than the number of observations has become more and more common, and can be seen in areas like economics, genetics, climate data, computer vision etc. This problem has required the development of new methodologies suitable for a high dimensional framework.
Most of the statistical methodologies are limited to the study of averages. Least squares regression, principal component analysis, partial least squares... All these techniques provide mean based estimations, and are built around the key idea that the data is normally distributed. But this is an assumption that is usually unverified in real datasets, where skewness and outliers can easily be found. The estimation of other metrics like the quantiles can help providing a more complete image of the data distribution.
This thesis is built around these two core ideas. The development of more robust, quantile based methodologies suitable for high dimensional problems. The first contribution of this thesis is centered in the regression framework, and studies the formulation of an adaptive sparse group lasso for quantile regression, a flexible formulation that makes use of the adaptive idea to help correcting the bias of well known penalization sparse group lasso, improving this way variable selection and prediction accuracy.
An alternative solution to the high dimensional problem is the usage of a dimension reduction technique like partial least squares. Partial least squares (PLS) is a methodology initially proposed in the field of chemometrics as an alternative to traditional least squares regression when the data is high dimensional or faces colinearity. It works by projecting the independent data matrix into a subspace of uncorrelated variables that maximize the covariance with the response matrix. However, being an iterative process based on least squares makes this methodology extremely sensitive to the presence of outliers or heteroscedasticity. The second contribution of this thesis defines the fast partial quantile regression, a technique that performs a projection into a subspace where a quantile covariance metric is maximized, effectively extending partial least squares to the quantile regression framework.
Another field where it is common to find high dimensional data is in functional data analysis, where the observations are functions measured along time, instead of scalars. A key technique in this field is functional principal component analysis (FPCA), a methodology that provides an orthogonal set of basis functions that best explains the variability in the data. However, FPCA fails capturing shifts in the scale of the data affecting the quantiles. The third contribution proposed introduces the functional quantile factor model. A methodology that extends the concept of FPCA to quantile regression, obtaining a model that can explain the quantiles of the data conditional on a set of common functions.
The final contribution of this thesis is the asgl package for python, a package that solves penalized least squares and quantile regression models in low and high dimensional, filling a gap in the currently available implementations of these models.