If you are in the IT industry, I am sure you would have already heard about Kafka. There are a lot of articles around how Kafka works from a technical standpoint but very few on exactly what is the need and what use cases it serves. This article is my humble attempt at later.
Overview: Kafka as Message Queue
Let's first understand that Kafka is a message queue. What this means is that Message-Producer publishes a message onto Kafka queue from which a Message-Consumer consumes that message. Now, why is a message queue needed in the first place? The answer is that it acts as an intermediate communication layer that helps various modules aka service decouples from each other. This is the basis of a microservices-based architecture.
Kafka's Advantages
Next, let us understand how Kafka founders viewed data which is fundamental to why such a platform was created in the first place. They believed that instead of focusing on piles of data in relational databases, cache, key-value pairs, all of which are static, focus on data in real-time, as and when its captured. Let's understand this by figuring out what all data is generated for one user session on Netflix. Data is generated by one or more events corresponding to various user activities viz. when a user logs in, browsing different genre options, watching a preview, selecting and playing a movie, and then pausing it and then resuming it after a while. Now across all these generated events, appropriate actions like recommendations, resuming it on another device, have to be generated and fed back to the user in real-time. The 'real-time' in this example is where Kafka fits in. Kafka allows for events to be published and consumed by various services with high throughput (million records per second) and low latency (less than 20 milliseconds) and thus cater to high data being generated in today's systems.
Kafka wasn’t the first one in the market with this idea. We had JMS, RabbitMQ, and AMQP but what worked in favor of Kafka was higher throughput, reliability, and replication characteristics suited for today's real-time logging and analytics requirements. RabbitMQ can also process a million messages per second but requires a big cluster (30+ nodes) for in-memory operations and thus is not suitable from a hardware perspective.
Another advantage of Kafka is that it allows for on-the-fly horizontal scaling and is fault tolerance. As compared to traditional systems that are limited on scalability because of hardware limits and downtime to add new hardware, on Kafka, adding a new machine does not require downtime nor are there any limits to the number of machines you can have in your cluster. For fault tolerance, in a lot of non-distributed systems, there is a single point of failure. In Kafka, on the other hand, in a 3 node cluster, you can continue to work even if two nodes go down.
Usage
Coming onto usage in the market, according to HG insights[1], approx. 20,000 companies use Kafka including LinkedIn, Spotify, Uber, JP Morgan Chase, New York Times, Shopify, Cisco, CloudFlare, and Netflix. Let's look at some of the use cases -
- Uber uses Apache Kafka as a message bus for connecting different parts of the ecosystem. They collect system and application logs as well as event data from the rider and driver apps viz. location coordinates of the ride and driver and use this for computing nearest vehicle, exact route taken by vehicle, computing the price, etc. They handle trillion+ (info from 2017) messages per day over tens of thousands of topics.
- Netflix which we covered above has ~500 billion events and ~1.3 PB per day generated from video viewing activities, UI activities, Error logs, Performance events, Troubleshooting & diagnostic events
- New York Times uses Kafka to connect multiple Content Management Systems, third-party data and wire stories on one side and a range of services and applications like search engines, personalization services, feed generators, as well as all the different front-end applications, like the website and the native apps that need access to this published content on the other side. Whenever an asset is published, it is made available to all these systems with very low latency — this is news, after all — and without data loss.
- LinkedIn handles 7 trillion messages per day, divided into 100,000 topics, 7M partitions, stored over 4000 brokers. Kafka is used extensively throughout its software stack, powering use cases like activity tracking, message exchanges, metric gathering.
You can view [1] and [2] for more use cases.
I hope this basic info was useful!
[1] https://discovery.hgdata.com/product/apache-kafka
[2] https://blog.softwaremill.com/who-and-why-uses-apache-kafka-10fd8c781f4d
[3] https://kafka.apache.org/powered-by