Skip to main content

Apache Beam for developers: Creating new IO connectors, part 1

By this post, I'll start a series of blog posts about creating new IO connectors in Apache Beam.

Introduction to Beam IO

Before getting into Beam IO internals, let's take a quick look on what actually Beam pipeline codebase is. In general, logically all code, that required to run a user pipeline, can be split into 4 layers - Runner, SDK, Transforms & IO, User Code

On the bottom level, there is a Runner code, which is responsible for all translations of user pipeline to make it possible to run on preferred data processing engine, like Apache Spark, Apache Flink, Google Dataflow, etc.  

On the second level, we have a SDK code. This part of code allows to write a Beam pipeline in favourite user programming language. For the moment, Beam supports the following SDKs: Java, Python and Go. Scala is supported through 3rd party SDK called Scio

Third level incorporates different Beam Transforms, like ParDo, GroupByKey, Combine, etc. Also, it includes all IO connectors that allow to read/write data from/to different data sources and sinks.

And, finally, on top of that, there is a forth level of Beam code called User Code.  This is where actually all the business logic of user pipeline is living. Usually, it defines how and from where to read data, them how it will be transformed downstream of pipeline, and where the results will be stored in the end.



As we can see from this diagram, IO connectors and Beam transforms stay on the level of hierarchy. This is not by chance. It means that actually Beam IO is Beam transform. It can be quite complicated, provide many configuration options and different user API. It could be even consisted from several transforms (composite transform), but in the end, this is just a Beam transform.

Let's take as an example a famous WordCount task. Below, there is a Java code which consists from four different steps - two business logic transforms (CountWords and MapElements) and two IO transforms, that are implemented as a part of TextIO, and allow to read and write files using different file systems.

Beam already provides a lot of IO connectors for your choice. So, you can use them to work with different file systems, file formats, SQL and NoSQL databases, messaging and distributed queue systems, cloud and so on. Though, most of connectors are available on Java SDK.

Why would I need to write yet another IO?

There are a bunch of reasons why would you want to write your own IO connector in Beam.

BigData world is growing

Every year we see more and more storage technologies, databases, streaming system appear on the market. Unfortunately, there is no universal BigData API that every system would follow and implement. They all work differently, implement own architecture, provide own API. So, to add a support of these to Beam we will need to create a new IO connector.

We can’t use native IO of data processing engine

The logical question here could be - why not to use an IO implementation for required data processing engine? For instance, Apache Spark already provides an effective way to work with Kafka as one of the most popular messaging system. Why we need to do more?

There are several reasons for that:
  • Beam is based on own model which defines how to process distributed data. So, native IOs don’t comply (or only partly comply) with Beam model and they won't work properly in Beam pipeline.
  • All Beam IOs must be supported by all Beam runners. Beam-based created IOs will do this automatically.
  • Developed specifically for Beam, the IOs will be much effective in terms of their performance.

Provide more features for existing IOs

Most of the data storage systems is evolving in time. So, if you need to use recent versions of their API and profit new features of this, then it could be another strong reason to extend existing connector and do such improvements.

Write own connector for specific purposes

If you use Beam for the purposes of using it with specific enterprise data storage systems with closed API, then it won't be probably possible to share created connector with open source world. In this case, you need to create own enterprise connector by your own. 

Also, sometimes you want to add specific features to already existing Beam IO but it won't be accepted for different reasons into Beam project code base. In such case you will need to extend existing API by your own as well.

Way of learning

In the end, writing your own IO connector or improving existing one is a very good way to learn how Beam works internally and this is a good start point to contribute to Apache Beam and open source.

What'next?

In the next part will talk about the structure of every Beam IO:

  • from which parts it consists; 
  • what are the difference between Bounded and Unbounded sources;
  • what are the general requirements for IO code.
Stay tuned and happy Beaming!






Comments

  1. Great article! Looking forward for your series introducing what is behind Beam IO.

    ReplyDelete

Post a Comment

Popular posts from this blog

Cross-language pipelines in Beam

Java is famous of its paradigm – Write once, Run anywhere – which was defined in 1995 by Sun Microsystems to illustrate the cross-platform benefits of the Java language. Apache Beam follows the similar principle but in regard to cross-platform data processing engines – Write pipeline once and Run it on every data processing engine . Beam achieves that by leveraging a conception of Beam Runner – the programming framework, which is responsible to translate a pipeline, written in Beam model way, into a code that can be run on required processing engine, like Apache Spark, Apache Flink, Google Dataflow, etc. All translations actually happen in runtime – users don’t even need to recompile their code to change a runner if all dependencies were already provided in compile time. Therefore, Beam already supports a bunch of different runners  that can be easily used to run a user’s pipeline on different platforms.  However, the classical runner translates user code only from and to the same SD

Developing data processing job using Apache Beam - Avro schema and schema-aware PCollections

Avro in Beam Before, we always talked about PCollections that were not aware about the structure of their records. For example, in this post we have had to parse text files manually, extract the fields and perform GroupByKey operation based on this. This is inconvenient and error-prone way of working with structured data. It can be done sufficiently easier and in more clear and elegant way if we had schema-based records as an input data collection. For the moment, there are many different data structure formats “on the market” - JSON, Avro, Parquet, Protocol Buffers, etc. - that could be used for structuring our data in more fancy way than just plain text. Apache Beam provides a good support of Avro format , so we will see in this post how to work efficiently with it. First of all, there is an AvroIO , which allows to work with files, written in Avro format, in Beam pipelines. Under the hood, it uses FileIO , so it means that we can read/write files from/to different File Systems