Estás en: Inicio > Página Raíz Instituto Mixto Universidad Carlos III de Madrid - Banco de Santander en Big Data Financiero (IC3BS) > Research > articles



Title Authors Journal Abstract
Copying@ Scale: Using harvesting accounts for collecting correct answers in a MOOC Alexandron, G., Ruipérez-Valiente, J. A., Chen, Z., Muñoz-Merino, P.J., & Pritchard, D. E Computers & Education, Vol 108, pp 96-114 (2017) This paper presents a detailed study of a form of academic dishonesty that involves the use of multiple accounts for harvesting solutions in a Massive Open Online Course (MOOC). It is termed CAMEO – Copying Answers using Multiple Existence Online. The detection of CAMEO is done using educational data mining. The study has three main goals: determining the prevalence of CAMEO, studying its detailed characteristics, and inferring the motivation(s) for using it.
Improving the Graphical Lasso Estimation for the Precision Matrix Through Roots of the Sample Covariance Matrix Avagyan, V., Alonso,A.M., and Nogales, F.J. Journal of Computational and Graphical Statistics Vol 26, Issue 4, pp 865-872 (2017) In this article, we focus on the estimation of a high-dimensional inverse covariance (i.e., precision) matrix. We propose a simple improvement of the graphical Lasso (glasso) framework that is able to attain better statistical performance without increasing significantly the computational cost. The proposed improvement is based on computing a root of the sample covariance matrix to reduce the spread of the associated eigenvalues. Through extensive numerical results, using both simulated and real datasets, we show that the proposed modification improves the glasso procedure. Our results reveal that the square-root improvement can be a reasonable choice in practice.
High-fat diet induces metabolic changes and reduces oxidative stress in female mouse hearts Barba, I., Miró-Casas, E., Torrecilla, J.L., Pladevall, E., Tejedor, S., Sebastián-Pérez , R., Ruiz-Meana, M., Berrendero, J.R. , Cuevas,A. and García-Dorado, D. Journal of Nutritional Biochemistry, Vol. 40, Pages 187-193 (2017) In this work, we study the differences induced by sex and diet in the metabolic phenotype and mitochondrial function of mice and their relation to cardiac events. The methodology includes the use of variable selection techniques with nuclear magnetic resonance spectra in order to detect relevant metabolites and improves the classification performance.
An Efficient Industrial Big-data Engine Basanta-Val, P. IEEE Transactions, on Industrial Informatics Vol. PP Issue: 99, (2017) Current trends in industrial systems opt for the use of different big-data engines as a mean to process huge amounts of data that cannot be processed with an ordinary infrastructure. The number of issues an industrial infrastructure has to face is large and includes challenges such as the definition of different efficient architecture setups for different applications, and the definition of specific models for industrial analytics. In this context, the article explores the development of a medium size big-data engine (i.e. implementation) able to improve performance in map-reduce computing by splitting the analytic into different segments that may be processed by the engine in parallel using a hierarchical model.
Patterns for Distributed Real-Time Stream Processing Basanta-Val, P., Fernández-García, N., Sánchez-Fernández,L. and Arias-Fisteus, J. IEEE Transactions on Parallel and Distributed Systems, Vol. 28, Issue: 11 (2017) In recent years, big data systems have become an active area of research and development. Stream processing is one of the potential application scenarios of big data systems where the goal is to process a continuous, high velocity flow of information items. High frequency trading (HFT) in stock markets or trending topic detection in Twitter are some examples of stream processing applications. In some cases (like, for instance, in HFT), these applications have end-to-end quality-of-service requirements and may benefit from the usage of real-time techniques. Taking this into account, the present article analyzes, from the point of view of real-time systems, a set of patterns that can be used when implementing a stream processing application. For each pattern, we discuss its advantages and disadvantages, as well as its impact in application performance, measured as response time, maximum input frequency and changes in utilization demands due to the pattern.
Predictable remote invocations for distributed stream processing Basanta-Val, P., Fernández-García, N. and Sánchez-Fernández,L. Future Generation , DOI. 10.1016/j.future.2017.08.023, (2017) Typical infrastructure for big-data includes multiple machines with data accessed remotely with request–response patterns from different remote locations. Currently, most of the state-of-the-art remote invocation techniques are focused on models for distributed interactions, which have not explored the advantages given by parallel computing, such as those offered to run on distributed stream processors. In this context, the article is focused on the definition of a predictable remote procedure call (RPC) able to take advantage from the distributed stream processing technology.
On the use of reproducing kernel Hilbert spaces in functional classification Berrendero, J.R., Cuevas, A. and Torrecilla, J.L. Journal of the American Statistical Association, DOI: 10.1080/01621459.2017.1320287, (2017) This paper provides: (a) Explicit expressions for the optimal (Bayes) rule in several classification problems of equivalent Gaussian processes.  (b) An interpretation, in terms of mutual singularity, for the “near perfect classification” phenomenon described by Delaigle and Hall (2012) and an asymptotically optimal rule under singularity. (c) As an application, we propose a natural variable selection method and discuss the conditions for optimality. The approach relies on some classical results in the RKHS theory.
Modelling Electricity Swaps with Stochastic Forward Premium Models Blanco, I., Peña, J.I. and Rodriguez r. Energy Journal Issue, Vol. 39, no 2(2017) We present a new model for pricing electricity swaps. We posit swap electricity prices result from at least three driving forces. First, a stochastic factor acting as an anchor of the level of the forward curve. This is the average “consensus” price for the contracts within a maturity slot (yearly, quarterly, and monthly). Second, an element reflecting deterministic trend-seasonal components, because we assume market expects weather-related variations in demand. Third, a part accounting for (mean-reverting) stochastic deviations from the last two factors. These deviations depend on time to maturity and length of delivery period. By using a Multivariate Normal Inverse Gaussian (MNIG) distribution, our model embodies realistic probabilities of occurrence of extreme prices. Finally, we test the model using EEX data for the German market
Humans expect generosity Brañas-Garza, P., Rodríguez-Lara, I. and Sánchez, A. Scientific Reports, 7, Article number: 42446 (2017) Data analysis of experiments with the Dictator game in different setups and countries shows that the majority of people expects generosity from strangers in situations when sharing is non-enforceable
Combining Multivariate Volatility Forecasts: An Economic-Based Approach Caldeira, J.F., Moura, G.V., Nogales, F.J. and Santos A. A.P. Journal of Financial Econometrics Vol 15, Issue 2, pp 247-285 (2017) We devise a novel approach to combine predictions of high-dimensional conditional covariance matrices using economic criteria based on portfolio selection. The combination scheme takes into account not only the portfolio objective function but also the portfolio characteristics in order to define the mixing weights.Three important advantages are that i) it does not require a proxy for the latent conditional covariance matrix, ii) it does not require optimization of the combination weights, and iii) can be calibrated in order to adjust the influence of the best performing models.
Adaptive multiscapes: an up-to-date metaphor to visualize molecular adaptation Catalán, P., Arias,C.F. , Cuesta, J. and Manrubia, S. Biology Direct, 12:7 (2017) This paper proposes an update to Wright's fitness landscapes that incorporates the most recent discoveries in molecular evolution
T-Hoarder: A framework to process Twitter data streams Congosto, M., Basanta-Val, P. and Sanchez-Fernandez, L. Journal of Network and Computer Applications, Vol. 83, Pages 28-39 (2017) This paper describes T-Hoarder: a framework that enables tweet crawling, data filtering, and which is also able to display summarized and analytical information about the Twitter activity with respect to a certain topic or event in a web-page. T-Hoarder is capable of managing very large experiments both in duration (more than one year) and size (millions of tweets).
Langevin diffusions on the torus: estimation and applications Garcia Portugués, E., Sørensen M., Mardia, K.V. and Hamelryck, T. Statistics and Computing, pp 1–22, (2017) We introduce stochastic models for continuous-time evolution of angles and develop their estimation. We focus on studying Langevin diffusions with stationary distributions equal to well-known distributions from directional statistics, since such diffusions can be regarded as toroidal analogues of the Ornstein–Uhlenbeck process. We propose three approximate likelihoods that are computationally tractable and investigate the empirical performance of the approximate likelihoods. The software package sdetorus implements the estimation methods and applications presented in the paper
Disentangling the effects of selection and loss bias on gene dynamics Iranzo J., José A. Cuesta, Susanna Manrubia, Mikhail I. Katsnelson, and Koonin, E. V. Proceedings of the National Academy of Sciences (USA), Early Edition, vol. 114 no. 28 (2017) We combine mathematical modeling of genome evolution with comparative analysis of prokaryotic genomes to estimate the relative contributions of selection and intrinsic loss bias to the evolution of different functional classes of genes and mobile genetic elements
A divisive clustering method for functional data with special consideration of outliers Justel, A. and Svarc, M. Advances in Data Analysis and Classification (2017) This paper presents DivClusFD, a new divisive hierarchical method for the non-supervised classification of functional data. Data of this type present the peculiarity that the differences among clusters may be caused by changes as well in level as in shape. Different clusters can be separated in different subregion and there may be no subregion in which all clusters are separated. In each step of division, the DivClusFD method explores the functions and their derivatives at several fixed points, seeking the subregion in which the highest number of clusters can be separated
The BIG CHASE: A decision support system for client acquisition applied to financial networks Liberatore F. and Quijano-Sánchez L. Decision Support Systems, Vol. 98, Pages 49-58 (2017) The paper presents a case study of a client acquisition decision support system for "Banco Santander, S.L.. In it, a reliability graph is built from client and transaction data provided by the bank. This graph models relationships based on a probability of traversal function that includes social measures. Then, an optimization procedure tailored to be efficient on very large sparse graphs with millions of nodes and edges identifies the most reliable sequence of clients that a manager should contact to reach a specific target.
What do we really need to compute the Tie Strength? An empirical study applied to Social Networks Liberatore F. and Quijano-Sánchez L. Computer Communications, Volume 110, Pages 59-74 (2017) The paper empirically presents the relative importance of different social variables for the computation of the tie strength and proposes a computational model independent of the Social Networks' domain. It includes the first dataset publicly available to explicitly include tie strength measures.
Distribution of genotype networks sizes in sequence-to-structure genotype-phenotype maps Manrubia S. and Cuesta J. A. Journal of the Royal Society Interface, Vol. 14, issue 129 (2017) By using very simple statistical arguments we explain the observed distributions of genotype network sizes (the number of genotypes that yield the same phenotype)
Equilibria, information and frustration, in heterogeneous network games with conflicting preferences Mazzoli,M. and Sánchez, A Journal of Statistical Mechanics: Theory and Experiment, DOI: 10.1088/1742-5468/aa9347 (2017) This paper presents a simulation model to address the problem of people interacting on a network and having to choose between two options, when there is heterogeneity in the population. Thus, preferences are introduced by assigning to every individual a preference for one of the said options. The paper shows that the population then ends up in different situations depending on the type of network and the specific interaction. The model can be used to generate data about specific applications where this generic mechanism of identity is of relevance. 
Detecting Steps Walking at very Low Speeds Combining Outlier Detection, Transition Matrices and Autoencoders from Acceleration Patterns Munoz-Organero, M. and Ruiz-Blaquez, R. Sensors, 17(10), 2274 (2017) Este trabajo desarrolla y valida un nuevo algoritmo para detectar pasos mientras caminamos a muy baja velocidad (entre 30 y 40 pasos por minuto) basado ​​en datos de un único acelerómetro triaxial. El algoritmo concatena tres fases consecutivas. En primer lugar, se realiza una detección de valores atípicos en los datos sensados basado ​​en la distancia de Mahalanobis para detectar puntos candidatos en la serie temporal de aceleración que pueden contener un segmento de contacto del pie con el suelo. En segundo lugar, los segmentos de aceleración alrededor de los puntos atípicos pre-detectados se utilizan para calcular matrices de transición con el fin de capturar las dependencias temporales. Finalmente se usan autocodificadores entrenados con segmentos de datos que contienen matrices de transición de pasos etiquetados para decidir si un valor atípico corresponde con un paso a baja velocidad.
Automatic detection of traffic lights, street crossings and urban roundabouts combining outlier detection and deep learning classification techniques based on GPS traces while driving Munoz-Organero, M., Ruiz-Blaquez, R. and Sánchez-Fernández, L. Computers Environment and Urban Systems. DOI: 10.1016/j.compenvurbsys.2017.09.005 (2017) Este artículo presenta un mecanismo novedoso para la detección automática de elementos de infraestructura urbana que influyen en la conducción como semáforos, cruces de calles y rotondas. Con el fin de minimizar los requisitos del sistema y simplificar la recopilación de datos de muchos usuarios con un impacto mínimo para ellos, sólo se utilizan trazas de GPS de un dispositivo móvil durante la conducción. Las series temporales de aceleración y de velocidad se derivan de los datos GPS. Un algoritmo de detección de valores atípicos se utiliza en primer lugar con el fin de detectar ubicaciones de conducción anormal (que pueden ser debidas a elementos de infraestructura o condiciones particulares del tráfico). Utilizando herramientas de aprendizaje profundo, los patrones de velocidad y aceleración se analizan automáticamente con el fin de extraer características relevantes que luego se clasifican en un semáforo, cruce de calles, rotonda urbana u otro elemento.
Improving transportation networks: Effects of population structure and decision making policies Pablo-Martí, F. and Sánchez, A. Scientific Reports, 7, Article number: 4498 (2017) In this paper we introduce a method to analyze data from transportation networks in order to identify the criteria used to decide how they have been built. The method can also be used to optimize an existing network subject to different types of constraints reflecting strategic decisions.
The emergence of altruism as a social norm Pereda, M., Brañas-Garza,P., Rodríguez-Lara,I. and Sánchez, A. Scientific Reports 7, Article number: 9684 (2017) Experimental data shows very clearly that people are generous in so far as they give money to others when they are allowed to keep all of it without any punishment. In this work we introduce a simulation model that allows to understand the experimental data in terms of human behavior arising from reinforcement learning. For the model to reproduce the data properly, we show that mistakes during the process must be taken into account as the deterministic learning process does not fit the data quantitatively.
Make it personal: A social explanation system applied to group recommendations Quijano-Sanchez, L., Sauer, C., Recio-Garcia, J.A. and Diaz-Agudo, B. Expert Systems with Applications, Vol. 76, Pages 36-48 (2017) The paper proposes a Personalized Social Individual Explanation approach for group recommenders. Its goal is to study how to best explain proposed items to social groups performing joint activities and how to enhance users’ reactions towards a recommender system by recalling the groups’ affective bonds.
Next-Generation Big Data Analytics: State of the Art, Challenges, and Future Research Topics Zhihan Lv, Houbing Song, Basanta-ValP., Steed, A. and Minho Jo IEEE Transactions on Industrial Informatics Vol. 13, Issue: 4, (Aug. 2017 ) The term big data occurs more frequently now than ever before. A large number of fields and subjects, ranging from everyday life to traditional research fields (i.e., geography and transportation, biology and chemistry, medicine and rehabilitation), involve big data problems. The popularizing of various types of network has diversified types, issues, and solutions for big data more than ever before. In this paper, we review recent research in data types, storage models, privacy, data security, analysis methods, and applications related to network big data. Finally, we summarize the challenges and development of big data to predict current and future trends.


Title Authors Journal Abstract
D-trace estimation of a precision matrix using adaptive Lasso penalties Avagyan, V., Alonso,A.M., and Nogales, F.J. Advances in Data Analysis and Classification DOI: (2016) The accurate estimation of a precision matrix plays a crucial role in the current age of high-dimensional data explosion. To deal with this problem, one of the prominent and commonly used techniques is the ℓ1ℓ1 norm (Lasso) penalization for a given loss function. This approach guarantees the sparsity of the precision matrix estimate for properly selected penalty parameters. However, the ℓ1ℓ1 norm penalization often fails to control the bias of obtained estimator because of its overestimation behavior. In this paper, we introduce two adaptive extensions of the recently proposed ℓ1ℓ1 norm penalized D-trace loss minimization method. They aim at reducing the produced bias in the estimator.
Stock Return Serial Dependence and Out- of-Sample Portfolio Performance. DeMiguel, A.V., Nogales,F.J. and Uppal, R. The Review of Financial Studies,Vol. 27, Issue 4, Pages 1031–1073 (2014) . We study whether investors can exploit serial dependence in stock returns to improve out-of-sample portfolio performance. We show that a vector-autoregressive (VAR) model captures stock return serial dependence in a statistically significant manner.
Multiperiod portfolio optimization with multiple risky assets and general transaction costs Mei, X., De Miguel, V and Nogales, F.J. Journal of Banking & Finance Vol 69, pp 108-120, (2016) We analyze the optimal portfolio policy for a multiperiod mean–variance investor facing multiple risky assets in the presence of general transaction costs. For proportional transaction costs, we give a closed-form expression for a no-trade region, shaped as a multi-dimensional parallelogram, and show how the optimal portfolio policy can be efficiently computed for many risky assets by solving a single quadratic program. For market impact costs, we show that at each period it is optimal to trade to the boundary of a state-dependent rebalancing region. Finally, we show empirically that the losses associated with ignoring transaction costs and behaving myopically may be large.
Common Seasonality in Multivariate Time Series Nieto, F.H., Peña,D. and Saboyá, D. Statistica Sinica, 26, 1389-1410, 2016. Common factors for seasonal multivariate time series are usually obtained by first filtering the series to eliminate the seasonal component and then extracting the nonseasonal common factors.
Generalized Dynamic Principal Components Peña,D. and Yohai, V.J. The Journal of American Statistical Association, 111,515, 1121-1131, 2016. Brillinger defined dynamic principal components (DPC) for time series based on a reconstruction criterion. He gave a very elegant theoretical solution and proposed an estimator which is consistent under stationarity. Here we propose a new enterally empirical approach to DPC.
Dependence patterns for modeling simultaneous events. Rodríguez, J., Lillo, R.E. and Ramírez Cobo, P. (2016). Reliability Engineering & System Safety, 154, 19-30. In this paper we examine in detail some of the modeling capabilities of the stationary m-state BMAP , with simultaneous events up to size k, noted BMAPm(k) . Specifically, we study the forms of the auto-correlation functions of the inter-event times and event sizes
Analyzing the Impact of Using Optional Activities in Self-Regulated Learning Ruipérez-Valiente, J.A., Muñoz-Merino, P.J.,Delgado Kloos,C.,Niemann,K.,Schefeld,M. and Wolpers, M. IEEE Transactions on Learning Technologies, Volume: 9, Issue: 3, July-Sept. 1 2016 This paper analyzes the use of optional activities in an educational online environment in two case studies with a Self-Regulated Learning approach. We found that the level of use of optional activites was low. Optional activities which are not related to learning are used more. Students finished the goals they set in more than 50 percent of the time and that they voted their peers' comments in a positive way. We also found that gender and the type of course can influence which optional activities are used.
Functional outlier detection by a local depth with application to NO x levels Sguera, C, Galeano, P y Lillo, R.E Stochastic Environmental Research and Risk Assessment, Volume 30, Issue 4, pp 1115–1130 (2016) This paper proposes methods to detect outliers in functional data sets and the task of identifying atypical curves is carried out using the recently proposed kernelized functional spatial depth (KFSD).


Title Authors Journal
Daily rhythms in mobile telephone communication Aledavood, T., López, E., Roberts, S., Reed-Tsochas, F., Moro, E., Dunbar, R. and Saramäki, J. PLoS ONE 10, e0138098 (2015)
Short-Range Mobility and the Evolution of Cooperation: An Experimental Study Antonioni, A., Tomassini, M. and Sánchez, A. Scientific Reports 5, 10282 (2015).
Time series segmentation procedures to detect, locate and estimate change- points Badagian, A.L., Kaiser, R. and Peña, D. In festschrift for Prof. Heiler, Empirical Economic and Financial Research – Theory, Methods and Practice, Beran, J., Feng, Y. and Hebbel, H. (eds.) Springer, Berlin. 2015.
Revealing patterns of local species richness along environmental gradients with a novel network tool. Baudena, M., Sánchez, A.,Georg, C.P., Ruíz-Benito, P., Zavala, M.A., Rodríguez, M.A. and Rietkerk, M.G./td> Scientific Reports 5, 11561 (2015).
Reputation drives cooperative behaviour and network formation in human groups Cuesta, J.A., Gracia-Lázaro, C., Ferrer, A., Moreno, Y. and Sánchez, A. Scientific Reports 5, 7843 (2015).
Performance of Social Network Sensors During Hurricane Sandy. Kryvasheyeu, Y., Chen, H., Moro, E., Van Hentenryck, P. and Cebrian, M. PLoS ONE 10, 0117288 (2015)
Detection and evaluation of emotions in Massive Open Online Courses. Leony, D., Muñoz-Merino, P.J., Ruipérez-Valiente, J.A., Pardo, A., Arellano, D. and Delgado kloos, C. Journal of Universal Computer Science, vol. 21, no. 5, pp. 638-655 (2015)
Social Media Fingerprints of Unemployment. Llorente, A., García-Herranz, M., Cebrián, M. and Moro, E. PLoS ONE 10, 0128692 (2015)
Precise effectiveness strategy for analyzing the effectiveness of students with educational resources and activities in MOOCs Muñoz-Merino, P.J., Ruipérez-Valiente, J.A., Alario-Hoyos, C., Pérez-Sanagustín, M. and Delgado kloos, C. Computers in Human Behavior, vol. 47, pp. 108-118 (2015)
A Software Engineering Model for the Development of Adaptation Rules and its Application in a Hinting Adaptive E-learning System Muñoz-Merino, P.J., Delgado kloos, C., Muñoz-Organero, M. and Pardo, A. Computer Science and Information Systems, vol. 12, no. 1 (2015), pp. 203--231.
Rethinking Statistics with Big Data: learning from George Box Peña, D. Quality Technology &Quantitative Management 12, 1, 2015.
ALAS-KA: A learning analytics extension for better understanding the learning process in the Khan Academy platform Ruipérez-Valiente, J.A., Muñoz-Merino, P.J., Leony, D. and Delgado kloos, C. Computers in Human Behavior, vol. 47, pp. 139-148, (2015).
Theory must be informed by experiments (and back) - Comment on "Universal scaling for the dilemma strength in evolutionary games", by Z. Wang et al. Sánchez, A. Physics of Life Reviews 14, 52-53 (2015).
From seconds to months: an overview of multi-scale dynamics of mobile telephone calls Saramäki, J. and Moro, E. Eur. Phys. J. B 88, 164 (2015).