Friday, November 15, 2013

Streaming Systems history, part 1

Lately I've been very busy studying streaming systems in general. I've took a look at the last decade streaming systems and what literature proposed. I want to share a little bit my findings summing up things here a little bit. I'm going to divide this series of posts in three, one for each generation of streaming systems. All the data should be more or less correct, but feel free to comment below and ask question of correct mistakes if there are! I would like to be the more precise as possible.

The first generation of streams proposes, among all, StreamIt (2002), Sawzall (2003) and CQL (2003).

StreamIt is a compilation infrastructure and programming language created to setup pipelines of streams. Users can create any kind of topology of the pipeline thanks to different kind of filters available. The filter structure available are:

  •  Pipeline: Let the user build a filter which has one input channel and one output channel, thus with a series of pipeline filters you can only build linear streaming pipelines.
  • SplitJoin: The first filter is a Split which has two output streams, then there can be different pipeline filters and at the end a Join filter which joins the work performed in parallel.
  • Feedback Loop: Let the user build a node which has two output streams, one that goes on in the pipeline and the other that feeds itself. Useful for tail recursive computations (i.e. Fibonacci).
The programming model is imperative, with a much Java/C++-like syntax. Pipelines built with StreamIt can be run on clusters of machines. To cope with load problems it has a mechanism of reinitialisation, where if a bottleneck is found, the pipeline is restarted with a different configuration to cope with the load changes.

Sawzall is a procedural domain-specific programming language which includes support for statistical aggregation of values read or computed from the input, very similar to Pig. Sawzall was developed by Google to process log data generated by Google's server. It processes one record at a time, and emits an output in tabular form. Sawzall is stateless, and thanks to MapReduce, each Sawzall program can run on multiple hosts (cluster of machines). 

Last but not least, CQL (Continuous Query Language) is an SQL based language for writing and maintaining continuous queries over a stream of data. Those queries are suitable for reactive and real-time programs; for example to keep an up to date view of data. It was implemented as a part of a project named STREAM.
Here is an example of a CQL query:

Select Sum(O.cost)
From Orders O, Fulfillments F [Range 1 Day]
Where O.orderID = F.orderID And F.clerk = "Alice"
    And O.customer = "Bob"

As you can see, CQL is much alike SQL, with the difference of having a timespan on the query (i.e. the day range in the example). This is because queries are performed over continuous streams of data, thus a time range has to be specified to get a result. In the example, the query selects the sum of the total costs of orders in one day, performed by Alice and bought by Bob.

And that's it for the first part of this streaming system history. I'm sorry about not writing something more about Sawzall, if you have some suggestions, comments or corrections please comment the post!