Alexandron, G., Ruipérez-Valiente, J. A., Chen, Z., Muñoz-Merino, P.J., & Pritchard, D. E
Computers & Education, Vol 108, pp 96-114 (2017)
This paper presents a detailed study of a form of academic dishonesty that involves the use of multiple accounts for harvesting solutions in a Massive Open Online Course (MOOC). It is termed CAMEO – Copying Answers using Multiple Existence Online. The detection of CAMEO is done using educational data mining. The study has three main goals: determining the prevalence of CAMEO, studying its detailed characteristics, and inferring the motivation(s) for using it.
Journal of Computational
and Graphical Statistics Vol 26, Issue 4, pp 865-872 (2017)
In this article, we focus on the estimation of a high-dimensional inverse covariance (i.e., precision) matrix. We propose a simple improvement of the graphical Lasso (glasso) framework that is able to attain better statistical performance without increasing significantly the computational cost. The proposed improvement is based on computing a root of the sample covariance matrix to reduce the spread of the associated eigenvalues. Through extensive numerical results, using both simulated and real datasets, we show that the proposed modification improves the glasso procedure. Our results reveal that the square-root improvement can be a reasonable choice in practice.
Barba, I., Miró-Casas, E., Torrecilla, J.L., Pladevall, E., Tejedor, S., Sebastián-Pérez , R., Ruiz-Meana, M., Berrendero, J.R. , Cuevas,A. and García-Dorado, D.
Journal of Nutritional Biochemistry, Vol. 40, Pages 187-193 (2017)
In this work, we study the differences induced by sex and diet in the metabolic phenotype and mitochondrial function of mice and their relation to cardiac events. The methodology includes the use of variable selection techniques with nuclear magnetic resonance spectra in order to detect relevant metabolites and improves the classification performance.
on Industrial Informatics Vol. PP Issue: 99, (2017)
Current trends in industrial systems opt for the use of different big-data engines as a mean to process huge amounts of data that cannot be processed with an ordinary infrastructure. The number of issues an industrial infrastructure has to face is large and includes challenges such as the definition of different efficient architecture setups for different applications, and the definition of specific models for industrial analytics. In this context, the article explores the development of a medium size big-data engine (i.e. implementation) able to improve performance in map-reduce computing by splitting the analytic into different segments that may be processed by the engine in parallel using a hierarchical model.
Basanta-Val, P., Fernández-García, N., Sánchez-Fernández,L. and Arias-Fisteus, J.
IEEE Transactions on Parallel and Distributed Systems, Vol. 28, Issue: 11 (2017)
In recent years, big data systems have become an active area of research and development. Stream processing is one of the potential application scenarios of big data systems where the goal is to process a continuous, high velocity flow of information items. High frequency trading (HFT) in stock markets or trending topic detection in Twitter are some examples of stream processing applications. In some cases (like, for instance, in HFT), these applications have end-to-end quality-of-service requirements and may benefit from the usage of real-time techniques. Taking this into account, the present article analyzes, from the point of view of real-time systems, a set of patterns that can be used when implementing a stream processing application. For each pattern, we discuss its advantages and disadvantages, as well as its impact in application performance, measured as response time, maximum input frequency and changes in utilization demands due to the pattern.
Typical infrastructure for big-data includes multiple machines with data accessed remotely with request–response patterns from different remote locations. Currently, most of the state-of-the-art remote invocation techniques are focused on models for distributed interactions, which have not explored the advantages given by parallel computing, such as those offered to run on distributed stream processors. In this context, the article is focused on the definition of a predictable remote procedure call (RPC) able to take advantage from the distributed stream processing technology.
Journal of the American Statistical Association, DOI: 10.1080/01621459.2017.1320287, (2017)
This paper provides: (a) Explicit expressions for the optimal (Bayes) rule in several classification problems of equivalent Gaussian processes. (b) An interpretation, in terms of mutual singularity, for the “near perfect classification” phenomenon described by Delaigle and Hall (2012) and an asymptotically optimal rule under singularity. (c) As an application, we propose a natural variable selection method and discuss the conditions for optimality. The approach relies on some classical results in the RKHS theory.
We present a new model for pricing electricity swaps. We posit swap electricity prices result from at least three driving forces. First, a stochastic factor acting as an anchor of the level of the forward curve. This is the average “consensus” price for the contracts within a maturity slot (yearly, quarterly, and monthly). Second, an element reflecting deterministic trend-seasonal components, because we assume market expects weather-related variations in demand. Third, a part accounting for (mean-reverting) stochastic deviations from the last two factors. These deviations depend on time to maturity and length of delivery period. By using a Multivariate Normal Inverse Gaussian (MNIG) distribution, our model embodies realistic probabilities of occurrence of extreme prices. Finally, we test the model using EEX data for the German market
We devise a novel approach to combine predictions of high-dimensional conditional covariance matrices using economic criteria based on portfolio selection. The combination scheme takes into account not only the portfolio objective function but also the portfolio characteristics in order to define the mixing weights.Three important advantages are that i) it does not require a proxy for the latent conditional covariance matrix, ii) it does not require optimization of the combination weights, and iii) can be calibrated in order to adjust the influence of the best performing models.
Congosto, M., Basanta-Val, P. and Sanchez-Fernandez, L.
Journal of Network and Computer Applications, Vol. 83, Pages 28-39 (2017)
This paper describes T-Hoarder: a framework that enables tweet crawling, data filtering, and which is also able to display summarized and analytical information about the Twitter activity with respect to a certain topic or event in a web-page. T-Hoarder is capable of managing very large experiments both in duration (more than one year) and size (millions of tweets).
Garcia Portugués, E., Sørensen M., Mardia, K.V. and Hamelryck, T.
Statistics and Computing, pp 1–22, (2017)
We introduce stochastic models for continuous-time evolution of angles and develop their estimation. We focus on studying Langevin diffusions with stationary distributions equal to well-known distributions from directional statistics, since such diffusions can be regarded as toroidal analogues of the Ornstein–Uhlenbeck process. We propose three approximate likelihoods that are computationally tractable and investigate the empirical performance of the approximate likelihoods. The software package sdetorus implements the estimation methods and applications presented in the paper
Iranzo J., José A. Cuesta, Susanna Manrubia, Mikhail I. Katsnelson, and Koonin, E. V.
Proceedings of the National Academy of Sciences (USA), Early Edition, vol. 114 no. 28 (2017)
We combine mathematical modeling of genome evolution with comparative analysis of prokaryotic genomes to estimate the relative contributions of selection and intrinsic loss bias to the evolution of different functional classes of genes and mobile genetic elements
Advances in Data Analysis and Classification (2017) doi.org/10.1007/s11634-017-0290-1
This paper presents DivClusFD, a new divisive hierarchical method for the non-supervised classification of functional data. Data of this type present the peculiarity that the differences among clusters may be caused by changes as well in level as in shape. Different clusters can be separated in different subregion and there may be no subregion in which all clusters are separated. In each step of division, the DivClusFD method explores the functions and their derivatives at several fixed points, seeking the subregion in which the highest number of clusters can be separated
Decision Support Systems, Vol. 98, Pages 49-58 (2017)
The paper presents a case study of a client acquisition decision support system for "Banco Santander, S.L.. In it, a reliability graph is built from client and transaction data provided by the bank. This graph models relationships based on a probability of traversal function that includes social measures. Then, an optimization procedure tailored to be efficient on very large sparse graphs with millions of nodes and edges identifies the most reliable sequence of clients that a manager should contact to reach a specific target.
The paper empirically presents the relative importance of different social variables for the computation of the tie strength and proposes a computational model independent of the Social Networks' domain. It includes the first dataset publicly available to explicitly include tie strength measures.
Journal of Statistical Mechanics: Theory and Experiment, DOI: 10.1088/1742-5468/aa9347 (2017)
This paper presents a simulation model to address the problem of people interacting on a network and having to choose between two options, when there is heterogeneity in the population. Thus, preferences are introduced by assigning to every individual a preference for one of the said options. The paper shows that the population then ends up in different situations depending on the type of network and the specific interaction. The model can be used to generate data about specific applications where this generic mechanism of identity is of relevance.
Este trabajo desarrolla y valida un nuevo algoritmo para detectar pasos mientras caminamos a muy baja velocidad (entre 30 y 40 pasos por minuto) basado en datos de un único acelerómetro triaxial. El algoritmo concatena tres fases consecutivas. En primer lugar, se realiza una detección de valores atípicos en los datos sensados basado en la distancia de Mahalanobis para detectar puntos candidatos en la serie temporal de aceleración que pueden contener un segmento de contacto del pie con el suelo. En segundo lugar, los segmentos de aceleración alrededor de los puntos atípicos pre-detectados se utilizan para calcular matrices de transición con el fin de capturar las dependencias temporales. Finalmente se usan autocodificadores entrenados con segmentos de datos que contienen matrices de transición de pasos etiquetados para decidir si un valor atípico corresponde con un paso a baja velocidad.
Munoz-Organero, M., Ruiz-Blaquez, R. and Sánchez-Fernández, L.
Computers Environment and Urban Systems. DOI: 10.1016/j.compenvurbsys.2017.09.005 (2017)
Este artículo presenta un mecanismo novedoso para la detección automática de elementos de infraestructura urbana que influyen en la conducción como semáforos, cruces de calles y rotondas. Con el fin de minimizar los requisitos del sistema y simplificar la recopilación de datos de muchos usuarios con un impacto mínimo para ellos, sólo se utilizan trazas de GPS de un dispositivo móvil durante la conducción. Las series temporales de aceleración y de velocidad se derivan de los datos GPS. Un algoritmo de detección de valores atípicos se utiliza en primer lugar con el fin de detectar ubicaciones de conducción anormal (que pueden ser debidas a elementos de infraestructura o condiciones particulares del tráfico). Utilizando herramientas de aprendizaje profundo, los patrones de velocidad y aceleración se analizan automáticamente con el fin de extraer características relevantes que luego se clasifican en un semáforo, cruce de calles, rotonda urbana u otro elemento.
In this paper we introduce a method to analyze data from transportation networks in order to identify the criteria used to decide how they have been built. The method can also be used to optimize an existing network subject to different types of constraints reflecting strategic decisions.
Pereda, M., Brañas-Garza,P., Rodríguez-Lara,I. and Sánchez, A.
Scientific Reports 7, Article number: 9684 (2017)
Experimental data shows very clearly that people are generous in so far as they give money to others when they are allowed to keep all of it without any punishment. In this work we introduce a simulation model that allows to understand the experimental data in terms of human behavior arising from reinforcement learning. For the model to reproduce the data properly, we show that mistakes during the process must be taken into account as the deterministic learning process does not fit the data quantitatively.
Quijano-Sanchez, L., Sauer, C., Recio-Garcia, J.A. and Diaz-Agudo, B.
Expert Systems with Applications, Vol. 76, Pages 36-48 (2017)
The paper proposes a Personalized Social Individual Explanation approach for group recommenders. Its goal is to study how to best explain proposed items to social groups performing joint activities and how to enhance users’ reactions towards a recommender system by recalling the groups’ affective bonds.
The term big data occurs more frequently now than ever before. A large number of fields and subjects, ranging from everyday life to traditional research fields (i.e., geography and transportation, biology and chemistry, medicine and rehabilitation), involve big data problems. The popularizing of various types of network has diversified types, issues, and solutions for big data more than ever before. In this paper, we review recent research in data types, storage models, privacy, data security, analysis methods, and applications related to network big data. Finally, we summarize the challenges and development of big data to predict current and future trends.
Advances in Data Analysis
and Classification DOI:
The accurate estimation of a precision matrix plays a crucial role in the current age of high-dimensional data explosion. To deal with this problem, one of the prominent and commonly used techniques is the ℓ1ℓ1 norm (Lasso) penalization for a given loss function. This approach guarantees the sparsity of the precision matrix estimate for properly selected penalty parameters. However, the ℓ1ℓ1 norm penalization often fails to control the bias of obtained estimator because of its overestimation behavior. In this paper, we introduce two adaptive extensions of the recently proposed ℓ1ℓ1 norm penalized D-trace loss minimization method. They aim at reducing the produced bias in the estimator.
The Review of Financial Studies,Vol. 27, Issue 4, Pages 1031–1073 (2014) .
We study whether investors can exploit serial dependence in stock returns to improve out-of-sample portfolio performance. We show that a vector-autoregressive (VAR) model captures stock return serial dependence in a statistically significant manner.
Journal of Banking
Vol 69, pp 108-120, (2016)
We analyze the optimal portfolio policy for a multiperiod mean–variance investor facing multiple risky assets in the presence of general transaction costs. For proportional transaction costs, we give a closed-form expression
for a no-trade region, shaped as a multi-dimensional parallelogram, and show how the optimal portfolio policy
can be efficiently computed for many risky assets by solving a single quadratic program. For market impact costs, we show that at each period it is optimal to trade to the boundary of a state-dependent rebalancing region. Finally, we show empirically that the losses associated with ignoring transaction costs and behaving myopically may be large.
The Journal of American Statistical Association, 111,515, 1121-1131, 2016.
Brillinger defined dynamic principal components (DPC) for time series based on a reconstruction criterion. He gave a very elegant theoretical solution and proposed an estimator which is consistent under stationarity. Here we propose a new enterally empirical approach to DPC.
Rodríguez, J., Lillo, R.E. and Ramírez Cobo, P. (2016).
Reliability Engineering & System Safety, 154, 19-30.
In this paper we examine in detail some of the modeling capabilities of the stationary m-state BMAP , with simultaneous events up to size k, noted BMAPm(k) . Specifically, we study the forms of the auto-correlation functions of the inter-event times and event sizes
This paper analyzes the use of optional activities in an educational online environment in two case studies with a Self-Regulated Learning approach. We found that the level of use of optional activites was low. Optional activities which are not related to learning are used more. Students finished the goals they set in more than 50 percent of the time and that they voted their peers' comments in a positive way. We also found that gender and the type of course can influence which optional activities are used.
Stochastic Environmental Research and Risk Assessment, Volume 30, Issue 4, pp 1115–1130 (2016)
This paper proposes methods to detect outliers in functional data sets and the task of identifying atypical curves is carried out using the recently proposed kernelized functional spatial depth (KFSD).