Tuesday, October 13, 2009

voidbase

voidbase : open source implementation of queue-based computing paradigm is a project aimed at providing framework for development of applications for analytical processing on large-scale data streams, using the model of computation oriented around channeling of data streams in appropriate queues with assigned "stream-friendly" (convolutional) aggregation operations. Initial release is now available at project website.

Sunday, July 26, 2009

Queue-based computing : framework for soft real-time parallel data analysis

Queue-based computing represents a simple paradigm we stumbled upon in the process of work on voidsearch* real time data analysis engine.

Let the Q={q_1..q_n} represents a dynamic set of queues q_i of fixed size s_i. We define aggregation A_i as a arbitrary convolution function of queue elements such that A_i(t)=f(A_i(t-1),q_i), meaning that the update of value of aggregation for new queue entry can be computed from the previous value of aggregation and value of new entry.

We now define arbitrary analysis task on set of n-field data entries D={d_1..d_k} as a task of providing proper mapping M = {dq_1..dq_k} | (dq : d_i -> q_i) and a proper set of aggregation functions A.

Finally, (Q,D,A,M) represents a complete description of queue computation for given data D, mapping via set of functions M to the queues from the set Q, each provided with aggregation function A. Data D represents an input to the problem, while values of aggregates represents the output of the problem.

What makes this paradigm different than similar parallel data processing techniques is the constraint on convolution nature of the aggregate function. This means that by providing (Q,D,A,M) tuple, as described above, we would be immediately able to track the values of aggregates (updated as new data arrive to the queue) - and even observe interesting patterns like convergence, which can be of particular interest in data analysis tasks. Additionally, this paradigm is especially suited for analysis of continuous flows of data (data streams), which are not particularly well tackled by the standard batch-processing approach to data analysis.

to be continued...

Friday, May 08, 2009

randomNode : revisited

After a >2 year delay since the original randomNode clustering search engine experimental project (and a related paper), I have finally decided to give it a shot and try to apply the concept in the case of practical real-world search engine. It seems that such approximate graph partitioning might be even more useful in the context of realtime data clustering (for example, for dynamic sources like twitter). Stay tuned for a upcoming demo .... :)

Tuesday, November 25, 2008

Conference slides..

Just got back from conference, and here are the slides from given talk. All in all - not bad. Now - off to the further research on dimension reduction using query-induced subgraphs. Got really interesting question at the conference - what is the minimal number of nodes for which the similarity of \beta (power-law exponent) for entire graph and of induced subgraph holds? This would be a "phase transition" figure bellow for which \beta should explode, and all nice properties (existence of 'giant' component, walk convergence, etc) - should disappear. Anyhow ... we'll see :) . Meanwhile - you can checkout the paper - here . Stay tuned. ;)

Saturday, October 11, 2008

Paper submited !

Voila!

After more than a month of day-slicing and hour-saving, I have finally managed to complete the paper I was working on for quite some time, and submit it to 16th Telecommunications Forum (TELFOR) 2008 conference.

Title of the paper is "Search Result Clustering via Randomized Partitioning of Query-Induced Subgraphs", and it deals with the problem of graph clustering and it's applicability to the general problem of search result clustering. General scale-free properties of proposed concept of query-induced subgraphs are analyzed, and new algorithm is introduced, which performs efficient clustering on power-law graphs in computationally and space efficient manner.

Getting this paper out there is quite pleasing, actually, especially since it finaly enables me to let the topic go and move on with further research. (memory note : sticking to single topic and research result for too long, waiting for you to "perfect" it is super-counter-productive and will get you in a no-progress-no-paper infinite loop sooner or later. Just break the loop, wrap what you've have accomplished into a paper and move on) :)

Tuesday, August 12, 2008

Complete subgraphs in random graph : a generative approach

Question : "Describe the random graph model M with minimum number of vertices V, having a complete subgraph of size k , with probability > p"

..........

Friday, July 25, 2008

back on (research) track.....

After a while - i'm getting back to my research. The current topic is "Clustering the Social Interaction Graph", a paper i'm currently working on in process of getting into a phd programme. The topic builds on the previous research i have done in the field of approximate graph clustering (a "Search result clustering by randomized partition of topic-induced subgraphs" - to be published at TELFOR 2008 - a small diploma-thesis research topic i have done last year).

Allthough a year had passed since the original topic, a social element had gain a significant momentum, which revitalized interest in the graph-related applications, which makes a perfect timing for getting back into this area of research.

So, i'm planning to get my stuff together, and start rolling on this new research project. I'll be posting various stuff/ideas related to the general study of networks and bunch of practical ideas/applications - which is where we are actually headed for.

stay tuned.... ;)