Skip to main content

Posts

Showing posts from June, 2018

Developing data processing job using Apache Beam - Streaming pipeline

This time we are going to talk about one of the most demanded thing in modern BigData world nowadays – processing of Streaming data. The principal difference between Batching and Streaming is type of input data source. When your data set is limited (even if it’s huge in terms of size) and it is not being updated along the time of processing, then you would likely use Batching pipeline. Input source in this case can be, for instance, files, database tables, objects in object storages, etc. I want to underline one more time that, with batching, we assume that data is immutable during all the processing time and number of input records is constant. Why we should pay attention on this? Because even with files we can have unlimited data stream when files are always added or changed. In such case we have to apply streaming approach to work with data. So, if we know that our data is limited and immutable then we need to develop batching pipeline, like it was showed and explained in the fi