Pavel's Software Engineering Log

Monday, February 9, 2009

Progress report

Last week I've finished the introductory part of the literature review concerned the time-series definition, history and classical analyses development. FYI: the first time-series plot is dated by the 10th century.

Currently I am reviewing literature concerning metrics used in the similarity search, specifically: Euclidian and Edit distances and their applications in DTW and LCS algorithms. While it seems sound somehow trivial, the time-series transformation rules defined in the research literature through years such as a moving averages smoothing, local and global scales and shifts are making life a little bit harder. The further development through the piecewise matching of the time-series without saving the temporal ordering moves it to the next level of complexity while the concept seems to be being valid for the software development which consists of episodes.

I'm drafting the review flow by walking through the time-series similarity measurements approaches while arranging the relevant research within sections chronologically. Think it should work.
Getting ready to move further in the review to the indexing/clustering of time-series methods.

Tuesday, February 3, 2009

Writing up a literature review.

Current progress.
I've spent a week working on the literature review. I finished the introductory part and skimmed through couple of classical time-series analysis books refreshing my knowledge of the well established methods of the time-series analysis and forecasting such as autoregressive ARMA/ARIMA models and based on these forecasting, lag analysis and spectrum analysis of the time series.

I think that while all these classical time-series analysis could be potentially used in the trajectory application for finding similarity between time series, it seems to be that this direction would require far more computations to be done (decomposing the time series) and would require a whole bunch of preliminary theoretical work researching specific models suitable for the software development and proving them, which seems to be really unnecessary and moreover impossible.

Interesting that while reading books and walking through the given practical examples from econometrics, I found that interrupted time series and lag analysis could an useful addition to the Hackystat analyzes family. Specifically it would be valuable to implement some kind of such analysis modules to see whether or not some development or managerial events (like (i) stopping regular development activity and switching to test coverage boosting; or (ii) adding/removing a developer to the team; (iii) switching development approach to TDD etc) really impact the development trends and how.

Plans.
This week I will be reviewing the time-series similarity measures and all kinds of applications based on this approach.

It is in my plans to start research on the time-series database indexing after the similarity measurement.

I hope that once these three parts will be done I will cover pretty much all of the stuff that I need to finish the literature review and sketch my path with dissertation proposal. Can't wait to get to this point.

Note on the development
While writing up the review I've started the design of the time-series analysis sub-module, DTW sub-module and thinking on the database (sensorbase) extension for the indexing.

Tuesday, January 27, 2009

User-account environment variables under Vista

I had some problems running the TexLive binaries under Vista caused by the conflict between the Cygwin TeX installation and TexLive. TeXLive installer adds the path to binaries as the user-specific environment variable which than added to the very end of the PATH variable and consequently TeXLive binaries are never reached. The way to fix this issue is to reshuffle the variables to make TexLive binaries to be loaded by default. This should help to find your user environment settings

Monday, January 26, 2009

Literature review plans

Finally I was able (almost) to install the TexLive on my laptop. The installer never finished so I don't really know if everything works (the 00-00 example compiles though). I'm running Windows Vista and it looks like the TexLive installer has permission issues with non-Administrative account + there seems to be a bug with installer when it tries to use system Perl instead of one shipped with installer (which causes some libraries and runtime issues). It took me some time to figure out both issues and I don't really see advantages of TexLive versus MikTEX at this point. The one more disadvantage I see is the lack of the DVI viewer in the TexLive distro, looks like I need to install the viewer, but I'm not sure if I will need it, so we'll see further if I will need one.

From other things I've set up all the Java stuff, updated libraries and Hackystat, checked out all the latest sources etc. and backed up the whole thing, just in case.
So, system is ready to go.

Most of the time I spent on putting together the outline for the literature review. I am seeing the purpose of this writing to be a comprehensive walkthrough through the field of the time-series analysis outlining the milestones and major discoveries and connecting them with my research. I found that I've totally missed some major things in the time-series analysis (funny huh?) and filling these gaps with reading and collecting the literature.

Following is the draft plan, I'm working on the third part and since it is based on the material from the part 2, I am changing its flow too.

Literature review plan

Introduction. (definitions, research field boundaries and common applications)
1. Introduction to time series.
  1. Data sources, time-series representation and common applications
    
    (the time series “origin”, common representation and mainstream applications)
  2. Streaming time-series.
    
    Time series as streams.
  3. Time-series databases and indexing
    
    (examples of existing time-series collections (+ the Hackystat sensorbase) and common time-series databases toolkit for time series data storage, search and retrieval)
2. Classical time series analyses.
  1. General exploration & description
    
    (time series descriptive exploration and common tools used: spectral analysis, autocorrelation, trends, periodicity (+ Hackystat Telemetry, + Hackystat Zorro?, + Hackystat Trajectory))
  2. Prediction and forecasting
    
    (stochastic modeling: AR, MA, ARMA, ARIMA and uses (+ Hackystat Trajectory))
3. Time series similarity (homogeneity) based analyses.
  1. The speech and handwriting recognition.
    
    (pioneering the area of DTW, LCS and HMM)
  2. Sign language, motion and gesture recognition.
    
    (ongoing research)
  3. Trajectory patterns recognition, surveillance applications, shape recognition.
    
    (modern applications)

Time series similarity-based analyses and algorithms

(known research tools, implemented applications and up-to date research directions)
1. Similarity metrics
  1. Euclidean distance.
    
    (application and problem of normalization)
  2. Hamming and Edit distances.
    
    (the formal introduction of edit distance, time-series transformations)
2. Similarity-finding algorithms
  1. DTW
  2. LCS
3. Methods (whole and sub sequence applications)
  1. Clustering
  2. Indexing
  3. Classification
  4. Anomaly detection
4. Known state of the art applications.

Possible application of the algorithms and methods to the Hacvkystat Telemetry Streams
1. Similarity search in the Sensorbase
  
  (the search for similarity using the raw telemetry data stored within the sensorbase)
2. Telemetry Streams data Indexing
  
  (defining the Telemetry patterns, indexing raw telemetry data using definitions and conducting search be means of indices and Edit distance)
3. Live Telemetry Stream analysis and features
  
  (patterns, anomaly detection)

Tuesday, January 20, 2009

First post in the 2009

The TechReport for 699 course. The importance of. :)
I've finished the Fall 2008 semester writing a technical report concerning the DTW algorithm, it's existing implementations, uses and extensions and outlined possible application to the software metrics based on my own implementation. While being a little worrying about this before starting I found the report writing as an extremely helpful and interesting activity for the number of reasons:

first of all it forced me to summarize and distill all of the essence from the work done so far and get it graded. By doing so I've not only reassessed my current position in the research but what I found extremely useful - I was able to identify gaps (weak points) in the research I am doing. The goal now is to make my research complete through evenly covering all of the areas of the interest and connecting it with adjacent fields.

secondly I found that in my case writing the "tech report for the 699" is actually more likely writing a draft (or an outline) of the Literature Review and the Thesis Proposal: two pieces of writing which are required for the PhD degree. How cool is that?

in addition to these two items, overviewing the research done so far, seeing things in the ToDo list and having outlined LitReview and Proposal makes ones (my) minds clear and brings confidence of the right track chosen.

Current progress.
This week I am working on the setting up the "working environment" for this semester. Taking in account the amount of the software development and writing ahead, spending some time on selecting technologies, cleaning and organizing the system environment and hardrive, updating tools etc seems to be a reasonable activity.

As in all previous work I am going to use Java, Eclipse, Hackystat infrastructure and the standard set of Hackystat libraries for the core programming. More likely I'll be using MIG Layout for the UI development of the stand-alone Trajectory tool. The R will be used for making figures and fast-scripting when I need to test something before actually implementing.

For my latest report and all other previous LaTeX-based documents I was successfully using a combination of the MikTex and TeXnicCeneter, but the recent changes in the CSDL requirements moving me towards the use of the TexLive and currently I'm setting up tools and environment testing this new for me approach.

Monday, December 8, 2008

The 699 report continued.

My work last week was focused on the writing 699 report for the Fall'08. The report essentially consists of the next parts:

Introduction

DTW Algorithm
DTW Customization

Weighting

Global path constraints

DTW Optimization

Query by the sample

Software metrics application

Telemetry database indexing

Future work

So far I'm done with first two sections and working on the third one. The progress a bit slow mainly because I was working on the last home assignments for my classes: The Analysis of Algorithms and Engineering Compiler. The finals scheduled at Dec 18 and Dec 19, Hope to finish the most of the report work by this time.

The report is hosted by Google and the permanent link is PDF

Tuesday, December 2, 2008

The 699 report

Last week I've tried to evaluate my progress and started the 699 Fall 08 report using TeX. Most of the time was spent collecting and reading articles. After all I've organized them through CiteULike. (Wondering if there any way to share my library with any other person?) Right now I am having around three dozens of DTW algorithm related papers describing basics of DTW, it's customization and optimization along with various applications: data mining (time series databases search), time series clustering, online signature matching, computer vision, computer animation, surveillance, protein sequence alignment, chemical engineering, music and signal processing. After reading I'm truly believe that approach would work applied to software metrics.

Find current report version here.

Pavel's Software Engineering Log

Monday, February 9, 2009

Progress report

Tuesday, February 3, 2009

Writing up a literature review.

Tuesday, January 27, 2009

User-account environment variables under Vista

Monday, January 26, 2009

Literature review plans

Introduction. (definitions, research field boundaries and common applications)

Time series similarity-based analyses and algorithms

(known research tools, implemented applications and up-to date research directions)

Possible application of the algorithms and methods to the Hacvkystat Telemetry Streams

Tuesday, January 20, 2009

First post in the 2009

Monday, December 8, 2008

The 699 report continued.

Tuesday, December 2, 2008

The 699 report

About Me

Blog Archive

Monday, February 9, 2009

Tuesday, February 3, 2009

Tuesday, January 27, 2009

Monday, January 26, 2009

Introduction. (definitions, research field boundaries and common applications)

Time series similarity-based analyses and algorithms (known research tools, implemented applications and up-to date research directions)

Possible application of the algorithms and methods to the Hacvkystat Telemetry Streams

Tuesday, January 20, 2009

Monday, December 8, 2008

Tuesday, December 2, 2008

About Me

Blog Archive

Time series similarity-based analyses and algorithms

(known research tools, implemented applications and up-to date research directions)