Sunday, April 26, 2020

Kafka for Business Professionals

If you are in the IT industry, I am sure you would have already heard about Kafka. There are a lot of articles around how Kafka works from a technical standpoint but very few on exactly what is the need and what use cases it serves. This article is my humble attempt at later.

Overview: Kafka as Message Queue

Let's first understand that Kafka is a message queue. What this means is that Message-Producer publishes a message onto Kafka queue from which a Message-Consumer consumes that message. Now, why is a message queue needed in the first place? The answer is that it acts as an intermediate communication layer that helps various modules aka service decouples from each other. This is the basis of a microservices-based architecture.

Kafka's Advantages

Next, let us understand how Kafka founders viewed data which is fundamental to why such a platform was created in the first place. They believed that instead of focusing on piles of data in relational databases, cache, key-value pairs, all of which are static, focus on data in real-time, as and when its captured. Let's understand this by figuring out what all data is generated for one user session on Netflix. Data is generated by one or more events corresponding to various user activities viz. when a user logs in, browsing different genre options, watching a preview, selecting and playing a movie, and then pausing it and then resuming it after a while. Now across all these generated events, appropriate actions like recommendations, resuming it on another device, have to be generated and fed back to the user in real-time. The 'real-time' in this example is where Kafka fits in. Kafka allows for events to be published and consumed by various services with high throughput (million records per second) and low latency (less than 20 milliseconds) and thus cater to high data being generated in today's systems.

Kafka wasn’t the first one in the market with this idea. We had JMS, RabbitMQ, and AMQP but what worked in favor of Kafka was higher throughput, reliability, and replication characteristics suited for today's real-time logging and analytics requirements. RabbitMQ can also process a million messages per second but requires a big cluster (30+ nodes) for in-memory operations and thus is not suitable from a hardware perspective.

Another advantage of Kafka is that it allows for on-the-fly horizontal scaling and is fault tolerance. As compared to traditional systems that are limited on scalability because of hardware limits and downtime to add new hardware, on Kafka, adding a new machine does not require downtime nor are there any limits to the number of machines you can have in your cluster. For fault tolerance, in a lot of non-distributed systems, there is a single point of failure. In Kafka, on the other hand, in a 3 node cluster, you can continue to work even if two nodes go down.

Usage

Coming onto usage in the market, according to HG insights[1], approx. 20,000 companies use Kafka including LinkedIn, Spotify, Uber, JP Morgan Chase, New York Times, Shopify, Cisco, CloudFlare, and Netflix. Let's look at some of the use cases -

  • Uber uses Apache Kafka as a message bus for connecting different parts of the ecosystem. They collect system and application logs as well as event data from the rider and driver apps viz. location coordinates of the ride and driver and use this for computing nearest vehicle, exact route taken by vehicle, computing the price, etc. They handle trillion+ (info from 2017) messages per day over tens of thousands of topics.
  • Netflix which we covered above has ~500 billion events and ~1.3 PB per day generated from video viewing activities, UI activities, Error logs, Performance events, Troubleshooting & diagnostic events
  • New York Times uses Kafka to connect multiple Content Management Systems, third-party data and wire stories on one side and a range of services and applications like search engines, personalization services, feed generators, as well as all the different front-end applications, like the website and the native apps that need access to this published content on the other side. Whenever an asset is published, it is made available to all these systems with very low latency — this is news, after all — and without data loss.
  • LinkedIn handles 7 trillion messages per day, divided into 100,000 topics, 7M partitions, stored over 4000 brokers. Kafka is used extensively throughout its software stack, powering use cases like activity tracking, message exchanges, metric gathering.
You can view [1] and [2] for more use cases.

I hope this basic info was useful!

[1] https://discovery.hgdata.com/product/apache-kafka
[2] https://blog.softwaremill.com/who-and-why-uses-apache-kafka-10fd8c781f4d
[3] https://kafka.apache.org/powered-by

Saturday, April 25, 2020

Uber and Airbnb - Both sharing economy, yet so different


Both Uber and Airbnb started in 2007 and defined what is now known as "Sharing Economy". The U.S. Commerce Department in a report [1] in June 2016 attempted to define and map out the contours of this emerging business sector as following :
  1. They use information technology (IT systems), typically available via web-based platforms, such as mobile “apps” on Internet-enabled devices, to facilitate peer-to-peer transactions.
  2. They rely on user-based rating systems for quality control, ensuring a level of trust between consumers and service providers who have not previously met.
  3. They offer the workers who provide services via digital matching platforms flexibility in deciding their typical working hours.
  4. To the extent that tools and assets are necessary to provide a service, digital matching firms rely on the workers using their own.

Both Uber and Airbnb tend to meet the above criteria. Despite this, they are different in two main respects -
The first relates to geographical scalability of marketplace [2] -
Uber’s model relies on hyperlocal network effects, i.e. the addition of a unit of supply (a driver) makes the product more valuable for the demand side (riders) within a small geographic radius. However, when Uber expands to other cities, they have to re-invest in driver acquisition without the benefit of any latent demand. Airbnb’s model, on the other hand, is built on cross-border network effects, i.e. the addition of a unit of supply (a host) makes the product more valuable for the demand side (guests) across geographic boundaries. While Uber faced local competitors like Didi in China, Ola in India, Grab in SE Asia who had replicated the same model in their own market, Airbnb, on the other hand, faced very little competition from other regional startups who had limited supply in cross-geo regions.

The second relates to the commoditization of supply [3] -
Uber’s suppliers are interchangeable (or commoditized), i.e. customers just want a ride and are not particularly sensitive to driver identity or vehicle brand. Riders too are indifferent to wait times below a certain threshold. Airbnb, on the other hand, has differentiated supply, i.e. each unit of supply is unique to some degree across a number of attributes: type of property, quality, nightly rate, location, capacity, etc. Thus, as Airbnb scaled offering choices across each variety, it became more and more difficult for its competitors to match the variety while in the case of Uber, it was much easy to build a similar network of vehicle types and drivers.

These two differences account for why valuation to funding ratio is so different for both these companies.

Credits to Sameeer Singh from Breadcrumb.vc for insights into this.

Is Zoom sustainable?

Given the usage that Zoom has seen in Covid infected world, the question is if Zoom's business model is sustainable. This article[1] from Sameer Singh explains that Zoom's core business model is based on virality of its product and not on the network effect. As compared to other video conferencing tools (Skype, Facetime) that need you to be part of a network to enjoy the features, there is no such limitation on Zoom. Zoom meeting host can share the meeting ID and anyone having the link can join. There is no attempt at forming the network in the first place which could have built a high switching cost for its users. Sameer thus feels that Zoom's business is susceptible to being run over by a similar video conference app with better customer experience or technology.