It’s been a year since our adoption of Kafka as the data pipeline that delivers data to different applications within our company. We still have a long journey ahead of us, but the results so far are quite satisfactory.
Let’s go back in time, precisely one year ago. We decided to explore Kafka to better understand what it can bring to the table. To do that, we started by improving the data pipeline that we had in place back then.
To give you a bit of context, at IceMobile we work with digital loyalty programs. “Digital” is a keyword here: when you do your groceries and get paper stamps, you’re participating in a loyalty program. But if, instead of paper, you’re able to see the points you collected via a mobile application, then you’re participating in a digital loyalty program. IceMobile offers retailers an entire back- and front-end solution for digital loyalty programs.
The back-end, commonly referred to as “the platform”, is up 24/7. Its main responsibility is to process all the events coming from the retailer POS (Point of Sale) system and from user interactions through our web front-end, iOS and Android apps. Those events can be a purchase transaction, a login from a device or a result of a push notification, for example. As a result, lots of data is generated, and this data needs to be aggregated and analysed somewhere. This is why we have to move this data into our data warehouse, otherwise known to us as the BI database. We use a data pipeline to move the data from the platform to the warehouse.
A data pipeline is a set of connected services, where the output of one service is the input for the next. At IceMobile, the data pipeline is responsible for delivering data from the main platform into the BI database. The image below shows the flow of the data.
As you can see, the data has its origins in the platform. The platform publishes the data into an SQS queue that is picked up by a Lambda function and then stored into an S3 bucket. Later, the BI platform connects to S3 to request the data events and stores them into the database.
This pipeline works fine, but there were a few problems with it:
- It’s not possible to replay old messages that came in the SQS queue;
- There may be data loss in case something goes wrong with the Lambda function;
- It’s not easy to distribute data among other applications.
How would changing to Kafka address those issues? And how much better or worse would it be? But, first of all... what is Kafka? And why did we choose it?
Kafka, you asked?
Apache Kafka (no, not the writer) is, in a nutshell, a message broker. It’s used for “building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies” (Source: https://kafka.apache.org/).
Kafka gives us the idea of topics, producers and consumers.
- A topic is basically a queue;
- A producer is a service that produces data into one or more topics in Kafka;
- A consumer is a service that consumes data from one or more Kafka topics.
The topic works in a FIFO (First In, First Out) schema, meaning that the order in which the data arrives in a topic is the same order in which it will be consumed (this is true if, and only if, the topic has only one partition, see more here: https://kafka.apache.org/documentation/#intro_topics). For example, the image below shows one producer inserting data into two topics, and two consumers retrieving data, each from one particular topic.
What kind of data would we insert into Kafka? Well, event objects for example. An event object is an object that contains all the necessary data to describe an event of interest. In our case it could be a purchase transaction or a login action from a mobile device. We could create one topic per event, and our services would be listening to the events they’re interested in, as shown in the illustration below.
And why did we choose Kafka? The idea of topics, producers and consumers isn’t new and very common among many platforms, but this is not the reason why we picked it. We chose Kafka because it addresses the issues we had in our previous pipeline in a more elegant way. Plus, it allows us to do more. For example, Kafka:
- Can retain the data in its topics for a certain amount of time, which makes it possible to replay past events and prevent data loss;
- Allows multiple services to register as a consumer to one topic[RvB3] [RdS4] , which is perfect to distribute data among other services within the company;
- Provides data in real time, making it perfect to create real time applications and dashboards;
- Decreases the latency in which the data events move from the platform into BI;
- Provides tools for stream processing.
Kafka is depicted as a data pipeline solution in this post, but in our eyes it’s much more than that. For example, it lets us better distribute the data, and gives us more control over applications that [SB5] [RdS6] use it. Kafka also helps us create real-time applications for several purposes, such as a real-time dashboard or a real-time fraud detection system.
Kafka as data pipeline
And how did we improve our data pipeline? We came up with a topic topology and defined which services would put data in and take data out. The illustration below shows how it works.
The platform outputs its data in its own event queue, this could be an SQS queue or any other queuing system. Then, the Ingestion Service picks up the data from there, applies any necessary transformations and inserts it into the corresponding Kafka topic. Finally, the Database Consumption Service picks up the data from the Kafka topics and inserts it into the BI database. Note that the BI database isn’t requesting data anymore as it was before. Instead, in this new setup, it passively receives the data.
The Ingestion and Database consumption services are quite simple. They were written in Scala, using the Kafka Java client library to produce and consume.
Important to note is that we’ve created a topic per event type. One of the reasons for that is to (intuitively) better organize the data. Another reason, and the most important one, is to properly define the data schema for each topic. As you might guess, each event contains different types of fields and a different data structure. To make sure that all the data within a topic follows the same definition, we define its schema using Avro (https://avro.apache.org/). Apache Avro is a data serialization library that provides rich data structures and serializes the data into a binary format, resulting in a faster and more compact format.
Summing things up
This first Kafka application within our company, the data pipeline, was a very exciting one. It made us understand how to work with Kafka and what to expect from it. In turn, Kafka has proven itself as the right solution for us. Not only because of its performance as a data pipeline, but also (and especially) because it enables us, as a company, to easily distribute and process the data among many other applications in (near) real-time.
It’s important to say that the Kafka setup discussed in this article only shows one consumer for the data inside it: the BI database. Needless to say, Kafka centralizes all the data within it, making it easier to distribute among other applications, such as support, recommendation engines, fraud detection systems and so on. In the same way, the applications consuming data can also produce data into Kafka. The fact that Kafka is fault tolerant increases our trust in it, making it a good tool to distribute data.
There are many more details to be discussed about Kafka (such as topic partitions, offset and the deployment process) and the technologies that surround it (like Zookeeper, Schema Registry, Avro, Kafka Streams and KSQL) and we’ll also write about them. But its basics are pretty simple to understand and easy to start with. If you have some basic Docker skills, you can try Kafka by using the Fast Data Dev by Landoop (https://github.com/Landoop/fast-data-dev). It’s a very easy and fast way to run your own Kafka Cluster on your local machine. Just follow the instructions on the main GitHub page and you’re good to go!
Stay tuned for more posts about Kafka.