Friday, February 7, 2014

The Blog Has Moved!

Hi all,

I'm finally (almost) done with transitioning the contents from here to the new domain, . From now on I will update that blog, but I'll keep this open for a while so that people can find me there!

See ya on the new domain!

Thursday, February 6, 2014

Surprise? Not really...

Hi all,

as mentioned in the previous post, I wanted to make a huge surprise to all of you. I bought a couple of weeks ago a domain for this blog, here! Yay! But... I had several issues setting it up, and I'm still struggling to bring all the old posts there and make it work. When everything is going to be setup, I will close this blog and move to the other one!

In the future I have to finish the streaming history as it is taking a long time to end... and then I have already a couple of ideas in mind as for working reasons I'm going to treat REST, thus I might end up posting a couple of entries about it.

Stay tuned, and sorry for the lack of updates!

Sunday, February 2, 2014

Happy new year!

I know it's a bit late, but happy new year! I had many deadlines coming and I couldn't really update the blog. I have a couple of ideas for articles but I don't really have time to write them. Anyways I'm working on a surprise. Hopefully by tomorrow I will unveil it!

Stay tuned!

Saturday, December 28, 2013

Streaming Systems history, part 3

Finally, here we are on the third part of the streaming system history. I'm sorry it took so long, but I had many, many things to do for work and I was very busy with real life stuff.
In this part I'm going to introduce systems from 2011. If you find any mistake or anything incomplete feel free to let me know by commenting the post! Not all the 2011 systems are shown in this post, I will complete it with the final, fourth part.

In this part I'm going to quickly talk about Storm (2011) and WebRTC (2011).

Storm is a distributed computational environment, free and open source (check out the link!) originally developed by Twitter. The building blocks, where information can be manipulated and created are called "spouts" (blocks that produce data) and "bolts" (blocks that receive streamed data). With them, developers can build a pipeline; they are indeed very similar to MapReduce jobs, with the only difference that they can (theoretically) run forever. Every spout and bolt can be run by one or more worker, where the core of the Storm computation is executed.

Given spouts and bolts as basic building blocks for a pipeline, developers can build Direct Acyclic Graph pipelines. The programming language is Java, thus the programming representation is imperative. Storm pipelines can be deployed on the cloud: that is they can run on local machine and have spouts or bolts somewhere on the cloud. This implementation detail became very important as time passed by. The pipeline is dynamic at node level, that is a Storm spout or bolt can alter the number of workers at runtime. Faults are dealt by reconfiguring the pipeline at runtime thanks to some controller threads that constantly run checking for errors. I call these kind of controllers "dynamic adaptive", like the ones seen in DryadLINQ and S4.

Web Real-Time Communication (WebRTC) is an API currently drafted by the World Wide Web Consortium (W3C). It enables browser-to-browser connection for applications like video, chat or voIP. Many examples of this API have been implemented since its initial release, for example peer-to-peer file sharing or video conferencing applications. This API is not a real streaming system per se, but brings a whole new connectivity for what concerns browser applications, opening path for future browser streaming applications.

Given the fact that WebRTC is only an API, we can build any kind of pipeline (arbitrary pipelines). It can only be programmed in JavaScript, thus likewise Storm, its programming representation is imperative while the deployment, for now, is only the web browser. The pipeline flexibility is dynamic at topology level, as nodes (browsers) can connect and disconnect from the pipeline at runtime. Of course there is no fault tolerance or load balancing, thus once the pipeline is disrupted, it can't be reconstructed again. Ideally, we would need a software on top of it to do so.

This was a small part, in respect to the previous two. As I mentioned, I'm very busy lately thus I only found time to write this. The next part will hopefully be the last one (this one was supposed to be, but it turned out to be very long).
See ya!


Monday, November 25, 2013

Streaming Systems history, part 2

Here I am again with some history on streaming systems! Today I would like to dive into systems born from 2005 to 2010. If you find any mistake please comment the post and let me know!

In this post I'm going to talk about the second generation of streaming systems: Borealis (2005), SPC (2006), DryadLINQ (2008) and S4 (2010).

Borealis is a distributed stream processing engine which inherits much from Aurora and Medusa. Aurora is a framework for monitoring application born around 2002. At the high level, the system model is made out of operators which receive and forward data thorough continuous queries. After Aurora was born, Aurora* was proposed, which is a version of Aurora with distribution and scalability. Borealis implements all the functionality of Aurora and is designed to support not only dynamic query modification and revision but also to be flexible an highly scalable.

Borealis can run Directed Acyclic Graphs (DAGs) as pipelines but not Arbitrary pipelines. A bachelor thesis (if I'm not mistaken) proposed a graphical user interface for Borealis that let users build pipelines using visual building blocks. Otherwise, one can implement a pipeline directly in C# (imperative programming). Borealis is the only one system in the second generation that supports not only deployment on clusters and the cloud, but also on pervasive devices (i.e. microcontrollers).
At runtime, it is flexible at node level to handle failures: it is capable of reinitialising a failed node while the stream is running, thanks to a replication mechanism that sends the data to the closest upstream replica. Load balancing is again performed with load shedding.

SPC stands for Stream Processing Core and is a middleware for distributed stream processing which targets data mining applications. It was built to support applications that extract data from multiple digital data streams. Topologies are composed by Processing Elements (which are what I call nodes) which implement application-defined operators that are connected by stream subscriptions.

Likewise Borealis, this system supports DAG as topology and has an imperative programming language as well as a graphical user interface. The deployment is only available on cluster of machines and the cloud. It is flexible at pipeline level, in the sense that it can change pipeline while the stream is flowing. The idea is to have the pipeline to be extended at runtime: new nodes can join the pipeline and connected to the stream while at runtime. Again like Borealis, it has replication to cope with faults and has no mean to cope with load balancing.

DyadLINQ imports a new programming model for distributed computing on large scale. It supports both general-purpose declarative and imperative operations on datasets thanks o a high-level programming language. A streaming application built with DryadLINQ is composed by LINQ expressions automatically translated by the compiler into a distributed execution plan (which DryadLINQ calls "job graph", and is a "topology") passed to the Dryad executing platform.

DryadLINQ supports DAG pipelines and both imperative and declarative programming models. The deployment is only targeting cluster of machines, while likewise SPC, it supports dynamic topology reconfiguration at runtime. Fault tolerance is indeed tackled with reconfiguration while load balancing is faced by a dynamic adaptive controller. More in details, DryadLINQ exploits some hooks in the Dryad API to mutate on runtime the pipeline to improve performance. Ideally performance is improved by aggregating nodes, thus decreasing I/O times.

Last but not least, we have S4 (Simple Scalable Streaming System), which is again a general-purpose distributed streaming platform developed by Yahoo! and inspired by the Actor model and MapReduce. It allows developers to program applications that process unbounded streams of data. Developers can set up pipelines with Processing Nodes that host Processing Elements. Each Processing Element (PE) is associated with a function and the type of event that it consumes.

Also S4, like almost all the others, supports DAGs as topology. The programming model is imperative and the deployment is possible on cluster of machines and the cloud. The pipeline is flexible at Node level, while for fault tolerance it supports reconfiguration (it can reconfigure a pipeline splitting the nodes deployed on a failed host among the remaining available execution resources) and for load balancing it again shows a dynamic adaptive controller.

And that's it. Hopefully I haven't made much mistakes.

See you next time with the last part!

Friday, November 15, 2013

Streaming Systems history, part 1

Lately I've been very busy studying streaming systems in general. I've took a look at the last decade streaming systems and what literature proposed. I want to share a little bit my findings summing up things here a little bit. I'm going to divide this series of posts in three, one for each generation of streaming systems. All the data should be more or less correct, but feel free to comment below and ask question of correct mistakes if there are! I would like to be the more precise as possible.

The first generation of streams proposes, among all, StreamIt (2002), Sawzall (2003) and CQL (2003).

StreamIt is a compilation infrastructure and programming language created to setup pipelines of streams. Users can create any kind of topology of the pipeline thanks to different kind of filters available. The filter structure available are:

  •  Pipeline: Let the user build a filter which has one input channel and one output channel, thus with a series of pipeline filters you can only build linear streaming pipelines.
  • SplitJoin: The first filter is a Split which has two output streams, then there can be different pipeline filters and at the end a Join filter which joins the work performed in parallel.
  • Feedback Loop: Let the user build a node which has two output streams, one that goes on in the pipeline and the other that feeds itself. Useful for tail recursive computations (i.e. Fibonacci).
The programming model is imperative, with a much Java/C++-like syntax. Pipelines built with StreamIt can be run on clusters of machines. To cope with load problems it has a mechanism of reinitialisation, where if a bottleneck is found, the pipeline is restarted with a different configuration to cope with the load changes.

Sawzall is a procedural domain-specific programming language which includes support for statistical aggregation of values read or computed from the input, very similar to Pig. Sawzall was developed by Google to process log data generated by Google's server. It processes one record at a time, and emits an output in tabular form. Sawzall is stateless, and thanks to MapReduce, each Sawzall program can run on multiple hosts (cluster of machines). 

Last but not least, CQL (Continuous Query Language) is an SQL based language for writing and maintaining continuous queries over a stream of data. Those queries are suitable for reactive and real-time programs; for example to keep an up to date view of data. It was implemented as a part of a project named STREAM.
Here is an example of a CQL query:

Select Sum(O.cost)
From Orders O, Fulfillments F [Range 1 Day]
Where O.orderID = F.orderID And F.clerk = "Alice"
    And O.customer = "Bob"

As you can see, CQL is much alike SQL, with the difference of having a timespan on the query (i.e. the day range in the example). This is because queries are performed over continuous streams of data, thus a time range has to be specified to get a result. In the example, the query selects the sum of the total costs of orders in one day, performed by Alice and bought by Bob.

And that's it for the first part of this streaming system history. I'm sorry about not writing something more about Sawzall, if you have some suggestions, comments or corrections please comment the post!

Thursday, November 7, 2013

I'm still alive

Hi all, I'm sorry I couldn't post much later. I've been in vacation for a while, and then suddenly a big chunk of work arrived, and I had practically 0 time to write here. I'm not even programming anymore, just paper writing and teaching assisting. Hopefully I'll be back on track after the 15th, the last deadline for the last paper I wrote.
Until next time, see ya!