Kafka Fundamentals

Priyanka Singh
7 min readAug 29, 2020

This post is for all the beginners in big data who wants to start their career in this field. I am going to explain in a very simple way so that any IT professional, who is curious to know the concept, can spend few minutes on reading this article and grab the knowledge of Kafka.
It will take you through a brief understanding of Kafka, How it became so popular in no time, how does it work, its architecture, etc.

History

To accommodate the growing membership and increasing site complexity, it became a need to completely redesign LinkedIn’s infrastructure. LinkedIn’s team had already migrated from a monolithic application infrastructure to one based on microservices. After that “Search, profile, communications and other platforms started working more efficiently.

They initially developed several different custom data pipelines for their various streaming and queuing data. The use cases for these platforms ranged from tracking site events like page views to gathering aggregated logs from other services. These needed to scale along with the site. Rather than maintaining and scaling each pipeline individually, they invested in the development of a single, distributed pub-sub platform. Thus, Kafka was born!

Apache Kafka was originally developed by LinkedIn, and was subsequently open sourced in early 2011.

What is Kafka?

Apache Kafka is highly fast, horizontally scalable, fault-tolerant messaging system which enables communication between producers and consumers using message-based topics. Also, it allows a large number of permanent or ad-hoc consumers. One of the best features of Kafka is, it is highly available and resilient to node failures and supports automatic recovery. This feature makes Apache Kafka ideal for communication and integration between components of large-scale data systems in real-world data systems.

In short, Kafka is used for real time stream processing, website activity tracking, metrics collection and monitoring, log aggregation, real-time analytics, CEP, ingesting data into Spark, ingesting data into Hadoop, CQRS, replay messages, error recovery, and guaranteed distributed commit log for in-memory computing (microservices).

Messaging System,
A messaging system is a system that is used for transferring data from one application to another so that the applications can focus on data and not on how to share it. Kafka is a distributed publish-subscribe messaging system. In a publish-subscribe system, messages are persisted in a topic. Message producers are called publishers and message consumers are called subscribers. Consumers can subscribe to one or more topic and consume all the messages in that topic (we will discuss these terminologies later in the post).

How does it work — with architecture

Entities of Kafka,
All Kafka messages are organised into topics. If you wish to send a message you send it to a specific topic and if you wish to read a message you read it from a specific topic. A consumer pulls messages off of a Kafka topic while producers push messages into a Kafka topic. Lastly, Kafka, as a distributed system, runs in a cluster. Each node in the cluster is called a Kafka broker.

Data in Kafka,

Kafka storing data in Topic

Kafka stores key-value messages that come from arbitrarily many processes called producers. The data can be partitioned into different “partitions” within different “topics”. Within a partition, messages are strictly ordered by their offsets (the position of a message within a partition), and indexed and stored together with a timestamp. Other processes called “consumers” can read messages from partitions. For stream processing, Kafka offers the Streams API that allows writing Java applications that consume data from Kafka and write results back to Kafka. Apache Kafka also works with external stream processing systems such as Apache Apex, Apache Flink, Apache Spark, and Apache Storm.

How does Kafka work?

Kafka runs on a cluster of one or more servers (called brokers), and the partitions of all topics are distributed across the cluster nodes. Additionally, partitions are replicated to multiple brokers. This architecture allows Kafka to deliver massive streams of messages in a fault-tolerant fashion and has allowed it to replace some of the conventional messaging systems like Java Message Service (JMS), Advanced Message Queuing Protocol (AMQP), etc. Since the 0.11.0.0 release, Kafka offers transactional writes, which provide exactly-once stream processing using the Streams API.

Kafka supports two types of topics: Regular and compacted. Regular topics can be configured with a retention time or a space bound. If there are records that are older than the specified retention time or if the space bound is exceeded for a partition, Kafka is allowed to delete old data to free storage space. By default, topics are configured with a retention time of 7 days, but it’s also possible to store data indefinitely. For compacted topics, records don’t expire based on time or space bounds. Instead, Kafka treats later messages as updates to older message with the same key and guarantees never to delete the latest message per key. Users can delete messages entirely by writing a so-called tombstone message with null-value for a specific key.

key terms of Kafka architecture: Topics, Producers, Consumers, and Brokers.

Topic: Producer writes a record on a topic and the consumer listens to it. A topic can have many partitions but must have at least one.
Broker: An instance in a Kafka cluster is called a broker. In a Kafka cluster, if you connect to any one broker, you will be able to access the entire cluster. The broker instance that we connect to in order to access the cluster is known as a bootstrap server. Each broker is identified by a numeric ID in the cluster. To start a Kafka cluster, three brokers is a good number, but there are clusters with hundreds of brokers
ZooKeeper: It is used for managing and coordinating Kafka broker. ZooKeeper service is mainly used to notify producer and consumer about the presence of any new broker in the Kafka system or failure of the broker in the Kafka system.
Producer: Creates a record and publishes it to the broker.
Consumer: Consumes records from the broker.

Kafka runs as a cluster on one or more servers.
The Kafka cluster stores a stream of records in categories called topics.
Each record consists of a key, a value, and a timestamp.

Four Core APIs

  1. Producer API: Allows clients to connect to Kafka servers running in the cluster and publish the stream of records to one or more Kafka topics.
  2. Consumer API: Allows clients to connect to Kafka servers running in the cluster and consume streams of records from one or more Kafka topics. Kafka consumes the messages from Kafka topics.
  3. Streams API: Allows clients to act as stream processors by consuming streams from one or more topics and producing the streams to other output topics. This allows transforming the input and output streams.
  4. Connector API: Allows writing reusable producer and consumer code; for example, if we want to read data from any RDBMS to publish the data to the topic and consume data from the topic and write that to RDBMS. We can create reusable source and sink connector components for various data sources.

Why is Kafka getting all the love?

  1. Reliability. Kafka is distributed, partitioned, replicated, and fault tolerant. Kafka replicates data and is able to support multiple subscribers. Additionally, it automatically balances consumers in the event of failure.
  2. Scalability. Kafka is a distributed system that scales quickly and easily without incurring any downtime.
  3. Durability. Kafka uses a distributed commit log, which means messages persists on disk as fast as possible providing intra-cluster replication, hence it is durable.
  4. Performance. Kafka has high throughput for both publishing and subscribing messages. It maintains stable performance even when dealing with many terabytes of stored messages.

Who uses Kafka ?

A lot of large companies who handle a lot of data use Kafka. LinkedIn, where it originated, uses it to track activity data and operational metrics. Twitter uses it as part of Storm to provide a stream processing infrastructure. Square uses Kafka as a bus to move all system events to various Square data centers (logs, custom events, metrics, and so on), outputs to Splunk, for Graphite (dashboards), and to implement Esper-like/CEP alerting systems.

It’s also used by other companies like Spotify, Uber, Tumbler, Goldman Sachs, PayPal, Box, Cisco, CloudFlare, and Netflix.

Apache Kafka is a software platform that has the following reasons to describe the need of Apache Kafka.

  1. Apache Kafka is capable of handling millions of data or messages per second.
  2. Apache Kafka works as a mediator between the source system and the target system. Thus, the source system (producer) data is sent to the Apache Kafka, where it decouples the data, and the target system (consumer) consumes the data from Kafka.
  3. Apache Kafka is having extremely high performance, i.e., it has really low latency value less than 10ms which proves it as a well-versed software.
  4. Apache Kafka has a resilient architecture which has resolved unusual complications in data sharing.
  5. Organizations such as NETFLIX, UBER, Walmart, etc. and over thousands of such firms make use of Apache Kafka.
  6. Apache Kafka is able to maintain the fault-tolerance. Fault-tolerance means that sometimes a consumer successfully consumes the message that was delivered by the producer. But, the consumer fails to process the message back due to backend database failure, or due to presence of a bug in the consumer code. In such a situation, the consumer is unable to consume the message again. Consequently, Apache Kafka has resolved the problem by reprocessing the data.
  7. Learning Kafka is a good source of income. So, those who wish to raise their income in future.

If you enjoyed this story, please click the 👏 button and share to help others find it!

--

--