Please enable JavaScript.
Coggle requires JavaScript to display documents.
Building data pipelines - Coggle Diagram
Building data pipelines
considerations when building
data pipelines
Timeliness
delay between producers and consumers
Kafka is considered as buffer
near real-time
batches
Reliability
delivery guarantees
at least once
exactly once
using external datastore for keys
throughput
scale dynamically when needed
kafka connect parallelling
several types of compression
Data Formats
different data formats support with different serializers
useful for the difference in the format in different frameworks
Transformations
ETL
process as it passes through
save time and storage
con't get the raw data or modify later
ELT
process after storage
flexibility to users
ingest raw data
more CPU and storage space
security
encrypting data
authentication
Failure Handling
data retention
coupling and agility
ad-hoc pipelines
each two frameworks together and many couples
mess of integration
effort to deploy and maintain and monitor
loss of metadata
extreme processing
store raw data and let everyone use data as he wants
Kafka Connect
VS
Producer and Consumer
Kafka Clients
want to be able to modify the code later
push or poll from kafka
Kafka Connect
connect kafka to a datasource
can't modify the code
directly dealing with external datastores
Kafka Connect
Architecture
source
from source convert then to worker
Sink
from worker to target system
Worker
configurations
group.id
bootstrap.servers
value converter
key converter
A Deeper look
connectors
how many tasks to run
split the data-copying between tasks
getting configurations for the tasks
tasks
get data
Workers
executes connectors and tasks
auto commit offsets
REST API configuration
Converters and connect data model
source connector read data and generate schema
sink connector gets schema and convert the data back
converts data to the desired format
Use Cases
Kafka as an endpoint
Intermediate between frameworks