Skip to main content

Posts

Cross-language pipelines in Beam

Java is famous of its paradigm – Write once, Run anywhere – which was defined in 1995 by Sun Microsystems to illustrate the cross-platform benefits of the Java language. Apache Beam follows the similar principle but in regard to cross-platform data processing engines – Write pipeline once and Run it on every data processing engine . Beam achieves that by leveraging a conception of Beam Runner – the programming framework, which is responsible to translate a pipeline, written in Beam model way, into a code that can be run on required processing engine, like Apache Spark, Apache Flink, Google Dataflow, etc. All translations actually happen in runtime – users don’t even need to recompile their code to change a runner if all dependencies were already provided in compile time. Therefore, Beam already supports a bunch of different runners  that can be easily used to run a user’s pipeline on different platforms.  However, the classical runner translates user code only from and to th...
Recent posts

Apache Beam for developers: Creating new IO connectors, part 1

By this post, I'll start a series of blog posts about creating new IO connectors in Apache Beam . Introduction to Beam IO Before getting into Beam IO internals, let's take a quick look on what actually Beam pipeline codebase is. In general, logically all code, that required to run a user pipeline, can be split into 4 layers - Runner , SDK , Transforms & IO , User Code .  On the bottom level, there is a Runner  code, which is responsible for all translations of user pipeline to make it possible to run on preferred data processing engine, like Apache Spark, Apache Flink, Google Dataflow, etc.   On the second level, we have a SDK  code. This part of code allows to write a Beam pipeline in favourite user programming language. For the moment, Beam supports the following SDKs: Java, Python and Go. Scala is supported through 3rd party SDK called Scio .  Third level incorporates different Beam Transforms , like ParDo , GroupByKey , Combine , e...

Developing data processing job using Apache Beam - Avro schema and schema-aware PCollections

Avro in Beam Before, we always talked about PCollections that were not aware about the structure of their records. For example, in this post we have had to parse text files manually, extract the fields and perform GroupByKey operation based on this. This is inconvenient and error-prone way of working with structured data. It can be done sufficiently easier and in more clear and elegant way if we had schema-based records as an input data collection. For the moment, there are many different data structure formats “on the market” - JSON, Avro, Parquet, Protocol Buffers, etc. - that could be used for structuring our data in more fancy way than just plain text. Apache Beam provides a good support of Avro format , so we will see in this post how to work efficiently with it. First of all, there is an AvroIO , which allows to work with files, written in Avro format, in Beam pipelines. Under the hood, it uses FileIO , so it means that we can read/write files from/to different File Systems...

Developing data processing job using Apache Beam - Windowing

Talking about Streaming data processing, it's definitively worth to mention "Windowing".  An example, showed in the previous post about streaming, was quite simple and didn't incorporate any aggregation operations like GroupByKey or Combine. But what if we need to perform such data transforms on unbounded stream? The answer is not as obvious as for bounded data collections because, in this case. we don't have an exact moment on a time line when we can conclude that all data has been arrived. So, for this purpose, most of the data processing engines use a conception of Windowing that allows to split unbounded data stream into time "windows" and perform such operation inside it. Beam is not an exception and it provides rich API for doing that. Problem description Let's take another example to see how windowing works in practice. Just to recall, the previous task was the following – there is unlimited stream of geo coordinates (X and Y) of some obje...

Developing data processing job using Apache Beam - Streaming pipeline

This time we are going to talk about one of the most demanded thing in modern BigData world nowadays – processing of Streaming data. The principal difference between Batching and Streaming is type of input data source. When your data set is limited (even if it’s huge in terms of size) and it is not being updated along the time of processing, then you would likely use Batching pipeline. Input source in this case can be, for instance, files, database tables, objects in object storages, etc. I want to underline one more time that, with batching, we assume that data is immutable during all the processing time and number of input records is constant. Why we should pay attention on this? Because even with files we can have unlimited data stream when files are always added or changed. In such case we have to apply streaming approach to work with data. So, if we know that our data is limited and immutable then we need to develop batching pipeline, like it was showed and explained in the fi...

Developing data processing job using Apache Beam - Batch pipeline

Have you heard something about Apache Beam ? No? Well, I’m not surprised - this project is quite new in data processing world. Actually, I was in the same boat with you until recently, when I started to work closely with it. In short, Apache Beam is unified programming model that provides an easy way to implement batch and streaming data processing jobs and run them on any execution engine using a set of different IOs. Sounds promising but not very clear, right? Ok, let’s try to look more closely on what actually it does mean. Starting with this, I’m going to launch a series of posts where I’ll show some examples and highlight several use cases of data processing jobs using Apache Beam. Our topic for today is batch processing. Let’s take the following example - you work on analytics and you want to analyze how many cars of each brand were sold for the whole period of time of observations. It means that our data set is bounded (finite amount of data) and it won’t be updated. So, i...