Please enable JavaScript.

Coggle requires JavaScript to display documents.

Internals (Physical Storage p128 (File Management p130 (Segment (partition…

- - - - partition consists of series of segments
      - each segment is by default 1GB or a week of data,
        when reached the limit, the segment file is closed and create a new one
      - active segment NEVER get deleted or compacted
      - :!: Kafka keep an open file handle to every segment in every partition - even inactive segments,
        leading to high open file handles, OS needs to be tuned accordingly
      - each segment is stored in a single file
    - - the same format on disk and over the wire to achieve zero-copy
      - format consist of key, value, offset, msg size, checksum, magic byte for version of format, compression codec, timestamp
      - :star: recommend compressing on producer side, better for both network and broker disks
    - - a tool to inspect partition segments and their contents and all metadata
    - - one index per partition, mapping offsets to segment files and positions within the file
      - index is broken into segments, automatically regenerate when missing/corrupted
      - :question: implementation ?
    - - :question: How does Kafka stores topics into Zookeeper?
      - :question: How does Kafka store topic and its partitions on disk?
        Topic per dir under /var/kafka/data/ ?
  - - - delete
        
        delete events older than retention time
      - compact
        
        only store most recent value for each key in the topic
        
        :!: MUST have non-null key
- - - - topic-level: replication.factor
      - broker-level: default.replication.factor
    - - broker.rack
    - - default unclean.leader.election.enable = true allows out-of-sync replicas to become leaders. this will risk data loss and data inconsistency
      - unclean.leader.election.enable = false disallows out-of-sync replicas to become leaders. this will risk availability if all in-sync replicas are down
    - - both topic/broker levels - min.insync.replicas
      - when not enough in-sync replicas online
        
        brokers stop accepting produce requests and return NotEnoughReplicasException to producer client
        
        remaining in-sync replicas become read-only which consumers can still consume
        
        this prevents data loss when unclean election occurs
- - - - number configurable. takes requests from client connections and places them in a request queue,
        and picks up responses from a response queue and sending responses back to clients
    - - broker runs one acceptor thread on each port it listens. create connection and handover to one processor thread
    - - picking requests from request queue and process them.
  - - - sent by producers, contains messages the client wants to write to brokers
      - check 1) client has write access? 2) acks is valid number? 3) if acks=all, are there enough in-sync replicas?
      - broker writes to Linux filesystem cache, NO guarantee when written to disk, Kafka does NOT wait for data to persist to disk
      - if acks=all, request is stored in buffer called purgatory until the leader observes followers replicated msg, then respond to client
    - - sent by consumers or follower replicas when they read messages from brokers,
        containing what offsets of which partitions of what topics to return
      - :star: ZERO-COPY - Kafka store msgs from producers in filesystem cache, and then directly send to network channel without intermediate buffers
        if producing rate == consuming rate, no need to persist msgs to disk, it's all done in filesystem cache.
        this removes the overhead of copying bytes and managing buffers in memory
      - data size upperbound - maximum data size broker can return.
        because clients need to allocate memory to hold returned messages, too large message could cause OutOfMemory issue in clients
      - data size lowerbound - broker only responds to client when it has at least that much data ready or timeout is reached
        reduce CPU and network utilization, so improving throughput
    - - can be sent to any broker. returns list of topics the client interested in, partitions of each topic, replicas of each partition and which is the leader replica
      - client cache metadata and refresh periodically, controlled by metadata.max.age.ms
- - - - the replica that was the leader when the topic was originally created
      - because leader partitions are balanced across brokers when originally creating topic partitions.
        When the preferred leader is actually the real leader, load is balanced
      - auto.leader.rebalance.enable=true
        when preferred leader is in-sync but not leader, trigger an election to make it the leader
      - :!: spread preferred leader partitions across brokers to avoid overloading brokers
  - - - the follower hasn't sent Fetch request to leader in more than 10 seconds
      - the follower requested within 10 sec but hasn't caught up to the most recent message in more than 10 sec
      - out-of-sync follower cannot become leader
      - time controlled by replica.lag.time.max.ms
      - :star: rapid flip between in-sync and out-of-sync is a sign of misconfiguration in cluster.
        mostly due to wrong Java GC config on a broker causing frequent long pause
- - - - Kafka use txid and seqNo/time to detect duplicate publish and ignore
      - publishers can safely retry idemponently
      - use Property "enable.idempotence": "true" and "transactional.id": "your tx id"
    - - consumer knows the offset of a message, thus can persist them together transactionally e.g. store in the same row in a RDBMS
        Then after crash recovery, resume consuming from the latest stored offsets
      - use Property "isolation.level", "read_committed"
        then messages written to the topic in uncommitted tx is not visible to consumers
      - messages in the same tx can span multiple parititions, being read by different consumers.
        so Kafka broker maintains a list of all updated partitions for a tx