Monday, November 25, 2013

Streaming Systems history, part 2

Here I am again with some history on streaming systems! Today I would like to dive into systems born from 2005 to 2010. If you find any mistake please comment the post and let me know!

In this post I'm going to talk about the second generation of streaming systems: Borealis (2005), SPC (2006), DryadLINQ (2008) and S4 (2010).

Borealis is a distributed stream processing engine which inherits much from Aurora and Medusa. Aurora is a framework for monitoring application born around 2002. At the high level, the system model is made out of operators which receive and forward data thorough continuous queries. After Aurora was born, Aurora* was proposed, which is a version of Aurora with distribution and scalability. Borealis implements all the functionality of Aurora and is designed to support not only dynamic query modification and revision but also to be flexible an highly scalable.


Borealis can run Directed Acyclic Graphs (DAGs) as pipelines but not Arbitrary pipelines. A bachelor thesis (if I'm not mistaken) proposed a graphical user interface for Borealis that let users build pipelines using visual building blocks. Otherwise, one can implement a pipeline directly in C# (imperative programming). Borealis is the only one system in the second generation that supports not only deployment on clusters and the cloud, but also on pervasive devices (i.e. microcontrollers).
At runtime, it is flexible at node level to handle failures: it is capable of reinitialising a failed node while the stream is running, thanks to a replication mechanism that sends the data to the closest upstream replica. Load balancing is again performed with load shedding.

SPC stands for Stream Processing Core and is a middleware for distributed stream processing which targets data mining applications. It was built to support applications that extract data from multiple digital data streams. Topologies are composed by Processing Elements (which are what I call nodes) which implement application-defined operators that are connected by stream subscriptions.

Likewise Borealis, this system supports DAG as topology and has an imperative programming language as well as a graphical user interface. The deployment is only available on cluster of machines and the cloud. It is flexible at pipeline level, in the sense that it can change pipeline while the stream is flowing. The idea is to have the pipeline to be extended at runtime: new nodes can join the pipeline and connected to the stream while at runtime. Again like Borealis, it has replication to cope with faults and has no mean to cope with load balancing.

DyadLINQ imports a new programming model for distributed computing on large scale. It supports both general-purpose declarative and imperative operations on datasets thanks o a high-level programming language. A streaming application built with DryadLINQ is composed by LINQ expressions automatically translated by the compiler into a distributed execution plan (which DryadLINQ calls "job graph", and is a "topology") passed to the Dryad executing platform.

DryadLINQ supports DAG pipelines and both imperative and declarative programming models. The deployment is only targeting cluster of machines, while likewise SPC, it supports dynamic topology reconfiguration at runtime. Fault tolerance is indeed tackled with reconfiguration while load balancing is faced by a dynamic adaptive controller. More in details, DryadLINQ exploits some hooks in the Dryad API to mutate on runtime the pipeline to improve performance. Ideally performance is improved by aggregating nodes, thus decreasing I/O times.

Last but not least, we have S4 (Simple Scalable Streaming System), which is again a general-purpose distributed streaming platform developed by Yahoo! and inspired by the Actor model and MapReduce. It allows developers to program applications that process unbounded streams of data. Developers can set up pipelines with Processing Nodes that host Processing Elements. Each Processing Element (PE) is associated with a function and the type of event that it consumes.

Also S4, like almost all the others, supports DAGs as topology. The programming model is imperative and the deployment is possible on cluster of machines and the cloud. The pipeline is flexible at Node level, while for fault tolerance it supports reconfiguration (it can reconfigure a pipeline splitting the nodes deployed on a failed host among the remaining available execution resources) and for load balancing it again shows a dynamic adaptive controller.

And that's it. Hopefully I haven't made much mistakes.

See you next time with the last part!

0 comments:

Post a Comment