- Kafka allows us to decouple data-streams and applications
-
Created by Linkedln, now Open Source project and maintainted by Confluent
- Horizontal scalability:
- Can scale to 100s of brokers
- Can scale to millions of messages per second
- High performance (latency of less than 10ms) - real time
- Used by the 2000+ firms
Apache Kafka: Use cases
- Messaging System
- Activity Tracking
- Gather metrics from many different locations
- Application Logs gathering
- Stream processing (with the Kafka Streams API or Spark for example)
- De-coupling of system dependencies
- Integration with Spark, Flink, Storm, Hadoop and many other Big Data technologies
How the Kafka used in Companies:
- Netflix uses Kafka to apply recommendations in real-time while user watches TV shows
- Uber uses Kafka to gather user, taxi and trip data in a real-time to compute and forecast demand, and compute surge pricing in real-time
- Linkedln uses Kafka to prevent spam, collect user interactions to make better connection recommendations in real time
Topics, partitions and offsets
- Topics: a particular stream of data
- Similar to a table in a database (without all the constraints)
- You can have as many topics as you want
- Topic is identified by its name
- Topics are split in partitions
- Each partition is ordered
- Order is guaranteed only within a partition(not across partitions)
- Each message within a partition gets an incremental offset
- When topic is created the number partitions must be defined
- When topic is created a replication factor should be choosen (usually bigger than 1)
- Data within partition kept only for a limited time (default is one week)
- Data inside partition is immutable, means once data is written to a partition, it can’t be changed
- Data is assigned randomly to a partition unless a key is provided
Brokers
Brokers hold the topics (distributed among multiple brokers), and Kafka cluster is a composed of multiple brokers (servers)
- Each broker is identified with its ID (integer <- must be a number)
- Connection to any broker (called a bootstrap broker) means you are connected to the entire cluster
- A good number to get started is 3 brokers, but some big clusters have over 100 brokers
Concept of Leader for a Partition
- At any time only ONE broker can be a leader for a given partition
- Only that leader can receive and serve data for a partition
- The other brokers will syncrhonize the data
- Each partition has one leader and multiple ISR (in-sync replica)
Producers
- Producers write data to topics
- Producers automatically know to which broker and partition to write to
- In case of Broker failures, Producers will automatically recover
- Producer can choose to receive acknowledgmenet of data writes:
- acks=0: Producer won’t wait for ack (possible data loss)
- acks=1: Producer waits for leader ack (limited data loss) <– selected by default
- acks=all: Leader + replicas acks (no data loss)
Here is the case where acks = all
Producers: Message keys
- Producers can choose to send a key with the message (string, number, etc)
- If key=null, then data is sent round robin
- if key is sent, then all messages with that key will always go to the same partition
- A key is basically sent if you need message ordering for a specific field
- We don’t choose which key goes to which partition, but we know that same key messages endup in a same partition
Consumers
- Consumers read data from a topic (identified by name)
- Consumers know which broker to read from
- In case of broker failures, consumers know how to recover
- Data is read in order within each paritions
- If there are too many consumers than paritions then some consumers will be inactive
Consumer Offsets
- Kafka stores the offsets at which a consumer group has been reading
- The offset commited live in a Kafka topic named __consumer_offsets
- When consumer in a group has processed data received from Kafka, it should be commiting the offsets
- If a consumer dies, it will be able to read back from where it left off. That is possible because consumer commited its offset to a kafka topic
Delivery semantics for Consumers
There are three delivery semantics about when the consumer commits its offset:
- At most once
- offsets are commited as soon as the message is received
- if the processing goes wrong, the message will be lost (it won’t read again, because consumer alerady sent commit command and kafka moved the offset)
- At least once
- offsets are commited after the message is processed
- if the processing goes wrong, the message will be read again
- But the application developer must be cautios about this case, because same message may get processed multiple times
- Exactly once
- Can be achieved for Kafka to Kafka workflows using Kafka Streams API
- For Kafka to External system workflows, idempotent consumer must be used
Kafka Broke Discovery
- Every Kafka broker is also called a “bootstrap server”
- That means that you only need to connect to one broker, and you will be connected to the entire cluster
- Each broker knows about all brokers, topics and partitions (metadata)