Pavel's Software Engineering Log: March 2009

Monday, March 30, 2009

Current build of the Hackystat Trajectory browser.

Current build of the Trajectory browser incorporates a new motif representation for missing values with underscores and allows to select various SAX data models (build, universal and normal on the screenshot):

Sunday, March 29, 2009

SQL Schema evolution

Working on the different cuts schema for the SAX moved me to add a new table to the TrajectoryDB; here it is:

Universal cuts for telemetry

Just coded a little function which is taking an array of data and deriving the number of cuts desired. It's purely based on the data, not on the distribution parameters: it simply counts how many data points will be in the each of the bins and makes these numbers equal. This is how cuts are looking for 5 letters alphabet:

From 21501 data values about 4200 ended up in each of the bins.

Saturday, March 28, 2009

Picture of the data values distribution in 9 telemetry streams

For each of 9 streams I have in my local DB I've plotted three plots for the raw data (left three) and three plots for the normal data (right three).

I do think that it is really helpful to overlook these plots in order to design SAX cut intervals specific for the Hackystat-type data.

And this is the illustration of the problem I'm working on:

As you can see, the motif finder picked the values of the 1,2 and 29! as the same.

Thursday, March 26, 2009

Hackystat telemetry data distribution

I've pulled all of the data available in my local database and plotted normal QQ plots for 9 types of telemetry to get an idea about underlying distribution: and I have no answer right now... it's not normal though

Wednesday, March 25, 2009

Latest improvements in the index browser tool.

Latest tool improvements:

- user can browse all built indexes;
- regular values chart is displayed along with normal chart, will probably change layout showing both together. *** I have used -1.0 value for N/A data, this is why it's so messed up. Need to solve this issue ASAP. ***

Next thing on the list is to see the distributions of the data and change SAX cutoff intervals favoring Hackystat-type data.

Monday, March 23, 2009

First working build.

Just managed to build and run the Trajectory search tool. Checkout some screenshots:

I'll post more detailed description later, have to run into the "cubicles area".

Sunday, March 22, 2009

iBATIS + MySQL + MIGLayout

During last week I've finished improving the TrajectoryDB schema and moved from the tricky JDBC connector business to the iBATIS data mapper. DB things become much easier now and this is a schema I'm currently using:

My current goal is to find out if there are any interesting motifs among the data I have. Using MySQL querying + R scripting turned out to be a quite time-consuming search, so I've ended up hacking a neat GUI tool to conduct the search. Here is a current GUI snapshot:

For now I can choose particular motif (left top scroll table, frequency is the number of motif occurrences over all charts) and set from the projects/charts (second from the top scroll table at left, showing project, chart and frequency) it was found. I am planning to get all of the selection to be interactively rendered at the right panel using JFreeChart tomorrow. So, how cool is that?

Monday, March 16, 2009

SAX code and TrajectoryDB schema evolution

In order to improve the indexing speed and move towards automatic motif discovery I've updated DB schema making it more motif-centric and customized SAX code data structures allowing automatic motif frequency computation.

Due to the fair amount of failing data retrievals from sensorbase I've decided to create a local chart index and rewrite the telemetry data retrieval code.

Current db schema:

As you can see the sax_motif table is the top-level summary of found motifs. It should be easy to count and sort by the frequency and see DISTINCT projects where the particular motif occurs. Just need to write a code and populate the data now.

Thursday, March 12, 2009

The confirmation of valid search

The figure depicts the result of unsupervised motif search among all the data available in the local sensorbase. I've run the search with only parameter of length 7 which would correspond a week and this is what I've got:

The first motif search results

Today I was able to index all of the data I currently have locally and there is a screnshot of highest frequences:

I've decided to pull more information about a couple of motif entries and there is a result:

As could be seen, for example, the motif `cgcccgi` was found in projects hackystat at position 23 and Default at position 2945, and motif `ccjjcccc` found at various position in the hackystat-sensor-* projects along with my compilers project at position 105 and so on...

But, the data distribution is still far away from normal :(

Sensorbase Index schema evolution

The new changes in the Sensorbase IndexDB are following the implementation of SAX indexing, currently schema, IMO, is unoptimized, but looks like it works. Currently I'm getting charts over the WAN from dasha and we'll see if I'll be able to find any "motifs" today.

Schema:

Wednesday, March 11, 2009

SAX based time series motif search primer

This post is an illustration of the SAX-based motif search in timeseries.
Let's assume that we are given the next timeseries:(1,1,3,5,8,7,6,2,3,4,3,1,1,5,4,5,2,3,4,6,9,8,5,2,3,4,2,5,3,2,5,6,8,9,0,3,3), the length is 37 points:

Now if we run the SAX algorithm using sliding window of size 7, no PAA and alphabet size 6, this is what the resulting substrings matrix looks like:

[aabdfee, abdfeea, bdfeeab, dfeeabb, ffeabcb, ffbcdca, fbdedaa, bdedaaf, dedaafe, dcaafdf, daafefb, aafefbd, afdfbcd, eceabcf, cdabcef, cabbdff, abbdffc, bbdffca, bdffcab, dffcabb, ffdbbcb, fdabcad, faceafc, bdebfdb, ceafcaf, daebaef, adbadef, cbacdff, bbddffa, bddffab, ddffabb]

if we compute all pairwise distances over this strings we will find next matches (distance between strings is zero, and the second column indicates substings indexes):

ffeabcb - ffdbbcb, [4] - [20]
cabbdff - cbacdff, [15] - [27]

which corresponds to next two motifs:

Monday, March 9, 2009

Indexing the Sensorbase and puzzled with the distribution (for now)

I'm following the SAX approach for the database indexing and currently pulled some data to run the distribution analysis and this is the current schema for the database I'm using:

What I am doing right now is running the "sensorbase crawler" which pulls all the projects summaries available for the given user and then pulls all the charts possible for the each project (and member). After that data get normalized, transformed into the SAX representation and stored in the database.

While I'm working on the configuring charts retrieval (those parameters), I am worrying about the distribution of the data points from Build and Devtime streams: it looks not normal, - rather exponential. I'm planning to pull other streams from the sensorbase and work a little more on the data distribution analysis and if it'll be the same I guess it will require SAX schema correction.

Pavel's Software Engineering Log