### Queue-based computing : framework for soft real-time parallel data analysis

Queue-based computing represents a simple paradigm we stumbled upon in the process of work on voidsearch* real time data analysis engine.

Let the Q={q_1..q_n} represents a dynamic set of queues q_i of fixed size s_i. We define aggregation A_i as a arbitrary convolution function of queue elements such that A_i(t)=f(A_i(t-1),q_i), meaning that the update of value of aggregation for new queue entry can be computed from the previous value of aggregation and value of new entry.

We now define arbitrary analysis task on set of n-field data entries D={d_1..d_k} as a task of providing proper mapping M = {dq_1..dq_k} | (dq : d_i -> q_i) and a proper set of aggregation functions A.

Finally, (Q,D,A,M) represents a complete description of queue computation for given data D, mapping via set of functions M to the queues from the set Q, each provided with aggregation function A. Data D represents an input to the problem, while values of aggregates represents the output of the problem.

What makes this paradigm different than similar parallel data processing techniques is the constraint on convolution nature of the aggregate function. This means that by providing (Q,D,A,M) tuple, as described above, we would be immediately able to track the values of aggregates (updated as new data arrive to the queue) - and even observe interesting patterns like convergence, which can be of particular interest in data analysis tasks. Additionally, this paradigm is especially suited for analysis of continuous flows of data (data streams), which are not particularly well tackled by the standard batch-processing approach to data analysis.

to be continued...

Let the Q={q_1..q_n} represents a dynamic set of queues q_i of fixed size s_i. We define aggregation A_i as a arbitrary convolution function of queue elements such that A_i(t)=f(A_i(t-1),q_i), meaning that the update of value of aggregation for new queue entry can be computed from the previous value of aggregation and value of new entry.

We now define arbitrary analysis task on set of n-field data entries D={d_1..d_k} as a task of providing proper mapping M = {dq_1..dq_k} | (dq : d_i -> q_i) and a proper set of aggregation functions A.

Finally, (Q,D,A,M) represents a complete description of queue computation for given data D, mapping via set of functions M to the queues from the set Q, each provided with aggregation function A. Data D represents an input to the problem, while values of aggregates represents the output of the problem.

What makes this paradigm different than similar parallel data processing techniques is the constraint on convolution nature of the aggregate function. This means that by providing (Q,D,A,M) tuple, as described above, we would be immediately able to track the values of aggregates (updated as new data arrive to the queue) - and even observe interesting patterns like convergence, which can be of particular interest in data analysis tasks. Additionally, this paradigm is especially suited for analysis of continuous flows of data (data streams), which are not particularly well tackled by the standard batch-processing approach to data analysis.

to be continued...

## 0 Comments:

Post a Comment

<< Home