Unsupervised learning methods for sequences: Theory and Applications

Unsupervised methods aim at extracting knowledge from a unlabeled dataset. Specifically, clustering is a widespread technique whose goal is to separate a dataset into a certain number of disjoint groups or clusters. Data within a given cluster should be very similar and, at the same time, different from data in other clusters. With such a coarse description, it is obvious that clustering is a highly subjective task. That subjectivity comes from the notion of similarity, which is highly task-dependent.

In this project we focus on developing similarity measures for a kind of data whose importance is growing every day:  sequences. Examples of sequences include stock market evolution, speech, EEG,  video, … Working with sequences poses additional difficulties compared with the standard case of individual vectors in a Euclidean space. For example, different sequences can have different lengths, making standard similarity measures such as the Euclidean inner product unusable.  We have developed similarity measures for sequences that can take into account the dynamic evolution of each sequence, in order to group together sequences which behave in a similar fashion in the time dimension.  These methods have been shown to be highly competitive and have applications such as speaker clustering or music genre recognition.

Special emphasis has been put on controlling the computational complexity, trying to develop methods which scale well to large datasets. These methods have also been extended to allow for sequence segmentation, which consists in finding adjacent groups of samples within a sequence which exhibit high similarity. Recent work aims at defining new similarity measures for bags-of-vectors, that is to say, sets of vectors coming from a certain i.i.d. distribution. This is highly connected to the standard problem of defining distance measures or divergences between probability distributions.

Partners

  • GPM, Dept. of Signal Theory & Communications , Universidad Carlos III de Madrid
  • G2PI, Dept. of Signal Theory & Communications, Universidad Carlos III de Madrid
  • Department Informatik,  Universität Hamburg

Related Publications

  • D. García-García, E. Parrado-Hernández, F. Díaz-de-María, State-Space Dynamics Distance for Clustering Sequential Data, (under review)
  • D. García-García, E. Parrado-Hernández, J. Arenas-García, F. Díaz-de-María, Music Genre Classification using the Temporal Structure of Songs, IEEE Workshop on Machine Learning for Signal Processing (MLSP 2010), Kittilä, Finland, Aug. 2010.
  • D. García-García, E. Parrado-Hernández, F. Díaz-de María, Sequence Segmentation via Clustering of Subsequences, IEEE International Conference on Machine Learning and Applications (ICMLA 09), Miami Beach, Dec. 2009.
  • D. García-García, E. Parrado-Hernández, F. Díaz-de María, Model-Based Clustering and Segmentation of Sequences, NIPS'09 Workshop on Temporal Segmentation, Whistler, Dec. 2009.
  • D. García-García, E. Parrado Hernández, F. Díaz-de María, A New Distance Measure for Model-Based Sequence Clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 7, pp. 1325-1331, July 2009.
Aumentar Tamaño del texto Disminuir Tamaño del texto