Vassilis Athitsos               VLM Lab

Subsequence Matching in Large Databases of Time Series and Strings

This project has been funded by NSF grants IIS-0812601 (to UTA) and IIS-0812309 (to Boston University). The project was conducted in collaboration with Professor Gautam Das at UTA, and Professor George Kollios at Boston University. The project investigated methods for efficient subsequence matching in large databases of sequences (time series and strings).

Problem Definition and Background

In subsequence matching, given a specific sequence as input, we want to identify the best matching subsequences of (possibly long) sequences stored in a database. The sequences can be sequences of letters (for example, text, or DNA/protein sequences), sequences of numbers (such as closing stock market prices for each day of a year or a decade), sequences of images (movies), or sequences of notes (melodies).

Since sequences can represent such diverse data, the subsequence matching problem is relevant for several interesting applications. An example is sign spotting. Here, the input sequence is a video of a specific sign from some sign language, e.g., American Sign Language (ASL). The database contains sign language videos. The goal is to identify spots in the database videos where the input sign occurs.

Another example application is query-by-humming. Imagine having a melody in your mind, that you can't remember (or never knew in the first place) what song it is from. In a query-by-humming system, the user gives that melody as input to the system, by simply humming that melody on the computer's microphone. The system contains a large database of melodies, and it identifies the database melodies that contain parts similar to what the user hummed.

In this project, our goal was to different design methods, sometimes more generally applicable, sometimes applicable to more specific types of data, that can improve the state-of-the-art in subsequence matching. Improving the state-of-the-art means improving accuracy and efficiency in such systems, so that the system identifies more often the correct results, and the system produces those results to the user faster.

Key Findings

Software and Datasets

Software partially or fully developed under the support of the NSF Awards IIS-0812601 and IIS-0812309:

Selected Publications

Any opinions, findings, and conclusions or recommendations expressed here are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.