Skip to main content

Apache Beam for developers: Creating new IO connectors, part 1

By this post, I'll start a series of blog posts about creating new IO connectors in Apache Beam.

Introduction to Beam IO

Before getting into Beam IO internals, let's take a quick look on what actually Beam pipeline codebase is. In general, logically all code, that required to run a user pipeline, can be split into 4 layers - Runner, SDK, Transforms & IO, User Code

On the bottom level, there is a Runner code, which is responsible for all translations of user pipeline to make it possible to run on preferred data processing engine, like Apache Spark, Apache Flink, Google Dataflow, etc.  

On the second level, we have a SDK code. This part of code allows to write a Beam pipeline in favourite user programming language. For the moment, Beam supports the following SDKs: Java, Python and Go. Scala is supported through 3rd party SDK called Scio

Third level incorporates different Beam Transforms, like ParDo, GroupByKey, Combine, etc. Also, it includes all IO connectors that allow to read/write data from/to different data sources and sinks.

And, finally, on top of that, there is a forth level of Beam code called User Code.  This is where actually all the business logic of user pipeline is living. Usually, it defines how and from where to read data, them how it will be transformed downstream of pipeline, and where the results will be stored in the end.



As we can see from this diagram, IO connectors and Beam transforms stay on the level of hierarchy. This is not by chance. It means that actually Beam IO is Beam transform. It can be quite complicated, provide many configuration options and different user API. It could be even consisted from several transforms (composite transform), but in the end, this is just a Beam transform.

Let's take as an example a famous WordCount task. Below, there is a Java code which consists from four different steps - two business logic transforms (CountWords and MapElements) and two IO transforms, that are implemented as a part of TextIO, and allow to read and write files using different file systems.

Beam already provides a lot of IO connectors for your choice. So, you can use them to work with different file systems, file formats, SQL and NoSQL databases, messaging and distributed queue systems, cloud and so on. Though, most of connectors are available on Java SDK.

Why would I need to write yet another IO?

There are a bunch of reasons why would you want to write your own IO connector in Beam.

BigData world is growing

Every year we see more and more storage technologies, databases, streaming system appear on the market. Unfortunately, there is no universal BigData API that every system would follow and implement. They all work differently, implement own architecture, provide own API. So, to add a support of these to Beam we will need to create a new IO connector.

We can’t use native IO of data processing engine

The logical question here could be - why not to use an IO implementation for required data processing engine? For instance, Apache Spark already provides an effective way to work with Kafka as one of the most popular messaging system. Why we need to do more?

There are several reasons for that:
  • Beam is based on own model which defines how to process distributed data. So, native IOs don’t comply (or only partly comply) with Beam model and they won't work properly in Beam pipeline.
  • All Beam IOs must be supported by all Beam runners. Beam-based created IOs will do this automatically.
  • Developed specifically for Beam, the IOs will be much effective in terms of their performance.

Provide more features for existing IOs

Most of the data storage systems is evolving in time. So, if you need to use recent versions of their API and profit new features of this, then it could be another strong reason to extend existing connector and do such improvements.

Write own connector for specific purposes

If you use Beam for the purposes of using it with specific enterprise data storage systems with closed API, then it won't be probably possible to share created connector with open source world. In this case, you need to create own enterprise connector by your own. 

Also, sometimes you want to add specific features to already existing Beam IO but it won't be accepted for different reasons into Beam project code base. In such case you will need to extend existing API by your own as well.

Way of learning

In the end, writing your own IO connector or improving existing one is a very good way to learn how Beam works internally and this is a good start point to contribute to Apache Beam and open source.

What'next?

In the next part will talk about the structure of every Beam IO:

  • from which parts it consists; 
  • what are the difference between Bounded and Unbounded sources;
  • what are the general requirements for IO code.
Stay tuned and happy Beaming!






Comments

  1. Great article! Looking forward for your series introducing what is behind Beam IO.

    ReplyDelete

Post a Comment

Popular posts from this blog

Developing data processing job using Apache Beam - Avro schema and schema-aware PCollections

Avro in Beam Before, we always talked about PCollections that were not aware about the structure of their records. For example, in this post we have had to parse text files manually, extract the fields and perform GroupByKey operation based on this. This is inconvenient and error-prone way of working with structured data. It can be done sufficiently easier and in more clear and elegant way if we had schema-based records as an input data collection. For the moment, there are many different data structure formats “on the market” - JSON, Avro, Parquet, Protocol Buffers, etc. - that could be used for structuring our data in more fancy way than just plain text. Apache Beam provides a good support of Avro format , so we will see in this post how to work efficiently with it. First of all, there is an AvroIO , which allows to work with files, written in Avro format, in Beam pipelines. Under the hood, it uses FileIO , so it means that we can read/write files from/to different File Systems...

Developing data processing job using Apache Beam - Batch pipeline

Have you heard something about Apache Beam ? No? Well, I’m not surprised - this project is quite new in data processing world. Actually, I was in the same boat with you until recently, when I started to work closely with it. In short, Apache Beam is unified programming model that provides an easy way to implement batch and streaming data processing jobs and run them on any execution engine using a set of different IOs. Sounds promising but not very clear, right? Ok, let’s try to look more closely on what actually it does mean. Starting with this, I’m going to launch a series of posts where I’ll show some examples and highlight several use cases of data processing jobs using Apache Beam. Our topic for today is batch processing. Let’s take the following example - you work on analytics and you want to analyze how many cars of each brand were sold for the whole period of time of observations. It means that our data set is bounded (finite amount of data) and it won’t be updated. So, i...