Pavel's Software Engineering Log: February 2009

Friday, February 27, 2009

PAA and SAX R implementations

Just want to post the figure I've got in R while designing test cases for the Java implementation of the PAA and SAX algorithms. The figure shows original trajectories, their normalized versions, PAA reduction from 14 to 10 points and finally the result of SAX transform into string.

Updated! Linky to the R code

Updated test results:

And the Java-based library I am working on

The lower bounding distance example:

Monday, February 23, 2009

SAX and iSAX

While I was following experiments from the previous week with similar setup trying to see the performance of DTW on the simpler trajectories I was continuing reviewing literature and hit the Symbolic Aggregate approXimation paper and SAX algorithm. What a find! Next were iSAX and MK. This algorithm for finding motifs (patterns) in time-series looks very good and after studying all this material I've decided to contact authors for Java or R code to try it on the telemetry data, - they replied quick but unfortunately they had no Java/R code in hands, only Matlab/C# BUT they were willing to help with Java code development. Currently I am setting up the project and going to code timeseries motif finding module for the Hackystat.

Tuesday, February 17, 2009

Euclidean and DTW metrics based sequence (subsequence) matching

Just did a small experiment with R using Euclidean and DTW metrics.
So, I had two synthetic timeseries in hands - one was a kind of devtime telemetry trend which I was using as the reference and the second one just a "peak pattern" The purpose was to see if this "peak pattern" will match anything on the reference timeseries and if it will, than what exactly.

So, figure 1 represents both timeseries and their normalized versions

Figure 2 shows the same normalized timeseries at the top, found matches for "peak pattern" when using Euclidean distance at the middle, and Euclidean distances at the bottom of the figure (red bars correspond to the set of the minimal found distances (7 in total)).

Figure 3 is very similar to the number 2, but the DTW distance is used.

Figure 4 mimics #3 but threshold is lowered, so more matches shown.

The R code is here. Will follow with more details in the next post.

Monday, February 16, 2009

Working on the DTW implementation and Tests while planning the subsequence search implementation.

Last week I was working on the DTW package for the Trajectory project. I've spent about 12 hours hacking the Trajectory DTW java package itself: now it performs DTW alignment using most of the step patterns (it was quite interesting to learn about IEEE standards for NaN (not a number)) about the same amount of hours I've spend working in R making test templates and verifying Java methods; hope all the R plots I've made will contribute to litreview/thesis.

I did some research and collected available time-series normalization techniques and typed all in TeX for the literature review and thesis.

Right now I am working on implementing DTW Constraints in Java and on the example of the time-series similarity search. And... it seems like it does work in R for now, I'll post the results tomorrow, today it is too late to process all the plots and clean-up the code.

Monday, February 9, 2009

Progress report

Last week I've finished the introductory part of the literature review concerned the time-series definition, history and classical analyses development. FYI: the first time-series plot is dated by the 10th century.

Currently I am reviewing literature concerning metrics used in the similarity search, specifically: Euclidian and Edit distances and their applications in DTW and LCS algorithms. While it seems sound somehow trivial, the time-series transformation rules defined in the research literature through years such as a moving averages smoothing, local and global scales and shifts are making life a little bit harder. The further development through the piecewise matching of the time-series without saving the temporal ordering moves it to the next level of complexity while the concept seems to be being valid for the software development which consists of episodes.

I'm drafting the review flow by walking through the time-series similarity measurements approaches while arranging the relevant research within sections chronologically. Think it should work.
Getting ready to move further in the review to the indexing/clustering of time-series methods.

Tuesday, February 3, 2009

Writing up a literature review.

Current progress.
I've spent a week working on the literature review. I finished the introductory part and skimmed through couple of classical time-series analysis books refreshing my knowledge of the well established methods of the time-series analysis and forecasting such as autoregressive ARMA/ARIMA models and based on these forecasting, lag analysis and spectrum analysis of the time series.

I think that while all these classical time-series analysis could be potentially used in the trajectory application for finding similarity between time series, it seems to be that this direction would require far more computations to be done (decomposing the time series) and would require a whole bunch of preliminary theoretical work researching specific models suitable for the software development and proving them, which seems to be really unnecessary and moreover impossible.

Interesting that while reading books and walking through the given practical examples from econometrics, I found that interrupted time series and lag analysis could an useful addition to the Hackystat analyzes family. Specifically it would be valuable to implement some kind of such analysis modules to see whether or not some development or managerial events (like (i) stopping regular development activity and switching to test coverage boosting; or (ii) adding/removing a developer to the team; (iii) switching development approach to TDD etc) really impact the development trends and how.

Plans.
This week I will be reviewing the time-series similarity measures and all kinds of applications based on this approach.

It is in my plans to start research on the time-series database indexing after the similarity measurement.

I hope that once these three parts will be done I will cover pretty much all of the stuff that I need to finish the literature review and sketch my path with dissertation proposal. Can't wait to get to this point.

Note on the development
While writing up the review I've started the design of the time-series analysis sub-module, DTW sub-module and thinking on the database (sensorbase) extension for the indexing.

Pavel's Software Engineering Log