Diving into the Deep End - Kafka Connect

Diving into the Deep End:
Kafka Connect
Dennis Wittekind
Customer Success Engineer
Special Thanks to Jean Louis Bodart and Vincent de Saboulin

About Me
2
● Customer Success Engineer
@ Conﬂuent
● Middleware Core Engineer in
a past Life
● Infrastructure as Code Nerd
● Gear Head
● Overland Camping Enthusiast

Agenda
3
01. Overview
Overview of Kafka Connect and Connector Hub
02. Concepts
Describing concepts of Kafka Connect
03. Management
Configuration management using REST API or Control Center
04. Deployment Model
Installation overview for self managed connectors and quick glance at fully managed connectors
05. Connect Configuration
Overview of the important configuration parameters
06. Security
Security aspects for Kafka Connect
07. Monitoring
Monitor your Kafka Connect clusters using JMX
08. Tips and Tricks
Various tips and tricks to be aware of

Kafka Connect
No-Code way of connecting known systems (databases, object
storage, queues, etc) to Apache Kafka
Kafka Connect Kafka Connect
Data
sources
Data
sinks

Kafka
Cluster
Kafka Connect
Durable Data
Pipelines
Schema
Registry
Worker
Integrate upstream and
downstream systems with Apache
Kafka®
• Capture schema from sources, use
schema to inform data sinks
• Highly Available workers ensure
data pipelines aren’t interrupted
• Extensible framework API for
building custom connectors
Kafka Connect
Worker
Worker
Worker

Instantly Connect Popular Data Sources & Sinks
Data Diode
100+
pre-built
connectors

Confluent HUB
Easily browse connectors by:
• Source vs Sinks
• Confluent vs Partner supported
• Commercial vs Free
• Available in Confluent Cloud
confluent.io/hub
Instantly Connect
Popular Data
Sources & Sinks

Conﬂuent Hub - Connector Page
10
- Source or Sink ?
- Free or Commercial ?
- Supported by Conﬂuent or
partners
- Can download plugin
- Link to documentation
- License type
- Link to source code (if open
source)

Kafka Connect
Terminology
12
Connector
- Source
- Sink
Connect Worker
- Distributed
- Standalone
Tasks Converters &
Transforms

13
Kafka Cluster
Connect
Worker
Connect
Worker
Connect
Worker
Connect
Worker
Connect
Worker
Connect
Worker
Connect
Worker
Connect
Worker
Connect
Worker
Connect
Worker
Connect Standalone Workers
Connect Distributed Cluster
Workers: The JVM(s) that run connectors and tasks. Can be run in either
standalone or distributed mode.

14
Connector
High level abstraction that
coordinates data streaming
by managing tasks
Don’t get confused by “connectors”
Connector plugin :
• The jar containing all the classes
that implement or use by a
connector instance
Connector instance :
• Logical job responsible for
coordinating tasks
• Instantiated inside of a Worker
• Offset management logic and
partition distribution
• A class implementing the
Connector interface
• Single instance

15
Connectors
Connectors (monitoring the source or sink system for changes that require
re-conﬁguring tasks) and tasks (copying a subset of a connector’s data) are
automatically balanced across the active workers. The division of work between
tasks is shown by the partitions that each task is assigned.
15
A three-node Kafka Connect distributed mode cluster

16
Tasks
Task
Source
Data
internal.converter
key/value converter
Connect
Internal Offsets
Converted Data
Distributed?
Yes
No
Local Disk
Tasks are the main actor in the data model for Connect.
Each connector instance coordinates a set of tasks that actually copy the
data. These tasks have no state stored within them.
16

Workers, Connectors and Tasks (1/3)
17
JDBC Source Oracle DB1
task 2task 1
Kafka Connect Worker 1 Kafka Connect Worker 2 Kafka Connect Worker 3
HDFS Sink Connector
task 2task 1
task 4task 3
HDFS Sink Connector
task 4task 3
task 6task 5
HDFS Sink Connector
task 6task 5
17
JDBC Sink Oracle DB2
task 1

18
task 2task 1
HDFS Sink Connector
task 2task 1
task 4task 3
HDFS Sink Connector
task 4task 3
task 6task 5
HDFS Sink Connector
task 6task 5
18
task 1

19
task 2task 1
HDFS Sink Connector
task 2task 1
task 4task 3
HDFS Sink Connector
task 4task 3
HDFS Sink Connector
task 6task 5
19
task 6task 5
task 1

Tasks - a few more details
20
• Number of tasks is limited only through Connector conﬁguration ‘tasks.max’
• Workers will spawn as many as they are told
• Tasks rebalance just like consumers
• Since Apache Kafka 2.3, KIP-415 (Incremental Cooperative Rebalancing) improved
greatly impacts of connector re-conﬁguration / stop / start by rebalancing only
those tasks that need to be started, stopped, or moved
• Placement is done automatically by connect, however there is no guarantee that two
tasks will be spread on different machines.
• Can be an issue when a connector listens on a network port, example Syslog source
(either run it in standalone mode or a single distributed worker)

Kafka Connect Converters
Convert between the source and sink
ConnectRecord objects and the binary format
(byte[]) used to persist them in Kafka.
String, JSON, Avro, Protobuf, JSON Schema, and others
Reference: https://docs.conﬂuent.io/current/connect/concepts.html#connect-converters

Kafka Connect Transforms
Single Message Transformations (SMTs) are applied to messages as they ﬂow through
Connect.
• SMTs transform inbound messages after a source connector has produced them, but
before they are written to Kafka.
• SMTs transform outbound messages before they are sent to a sink connector.
Reference: https://docs.conﬂuent.io/current/connect/transforms/index.html

Kafka Connect Transforms - What for ?
23
● Data Masking - Mask sensitive information ahead of sending it to
Kafka
● Event Routing - Modify an event destination based on contents of
the event
● Event Enhancement - Add additional ﬁelds to event
● Partitioning - Set the key for the event based on contents of the
event before sending to Kafka
● Timestamp Conversion - Time based data conversion when
integrating different systems (ISO8601 vs Unix Epoch)

Kafka Connect Transforms - Example
2424
JDBC
Connector
MySQL
MaskField
ssn: 123-45-6789 ssn: “”

Chaining Transforms
25
Transforms can be chained together for more power:
25
null {"c1":{"foo":1},"c2":{"string":"bar"},"create_ts":1501796305000,"update_ts":1501796305000}
null {"c1":{"foo":2},"c2":{"string":"bar"},"create_ts":1501796665000,"update_ts":1501796665000}
{
"transforms":"createKey,extractFoo",
"transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields":"c1",
"transforms.extractFoo.type":"org.apache.kafka.connect.transforms.ExtractField$Key"
,
"transforms.extractFoo.field":"foo"
}
1 {"c1":{"foo":1},"c2":{"string":"bar"},"create_ts":1501796305000,"update_ts":1501796305000}
2 {"c1":{"foo":2},"c2":{"string":"bar"},"create_ts":1501796665000,"update_ts":1501796665000}
Reference: https://docs.conﬂuent.io/current/connect/transforms/index.html

Transforms - When not use?
2626
● Use chaining sparingly, don’t rely on too many transforms: it’s hard to
read and reason about after more than a couple
● Don’t attempt to enrich events in transforms
● Use the right tool for the job: Complicated transformations, joins,
aggregations should be done using Kafka Streams or ksqlDB (KSQL)
Reference: https://kafka-summit.org/sessions/single-message-transformations-not-transformations-youre-looking/

Management Interfaces
28
$ curl -i -X POST -H "Accept:application/json" -H
"Content-Type:application/json"
http://kafkaconnect:8083/connectors/ -d
‘{
"name" : "RabbitMQSourceConnector1",
"config" : {
"connector.class" :
"io.confluent.connect.rabbitmq.RabbitMQSourceConnector",
"tasks.max" : "1",
"kafka.topic" : "rabbitmq",
"rabbitmq.queue" : "myqueue",
"rabbitmq.host" : "localhost",
"rabbitmq.username" : "guest",
"rabbitmq.password" : "guest"
}
}’
28
Kafka Connect REST API Conﬂuent Control Center

Kafka Connect REST API
2929
Reference: Kafka Connect REST Interface

Kafka Connect REST API - Tips
3030
● Use PUT /connectors/(string: name)/config method instead of POST /connectors/(string: name)
as it can create or update existing connector
● Use GET /connectors/(string: name)/status for current status of the connector, including
whether it is running, failed or paused, which worker it is assigned to, error information if
it has failed, and the state of all its tasks
● Before updating a connector’s conﬁg, you can validate it using PUT
/connector-plugins/(string: name)/config/validate
Example: missing topic.prefix
in JDBC source connector 👉
● ⚠ POST /connectors/(string: name)/restart does not restart tasks:
○ To restart all tasks:
■ Pause connector using PUT /connectors/(string: name)/pause
■ Resume connector using PUT /connectors/(string: name)/resume
○ Tasks can also be restarted individually using POST /connectors/(string:
name)/tasks/(int: taskid)/restart

Error Handling & Dead letter queue (DLQ)
31
● Motivations : Allow users to configure how bad data should be handled during all phases
of processing records.
○ Failure to deserialize
○ Failure during convert
○ Failure during transforms
○ Lack of availability of external components
● Introduced in KIP-298 with Kafka Connect 2.0
● By default connect will fail immediately when an error occurs (error.tolerance=none)
● Error Handling must be configured in the individual connector configurations
○ Retry on failure (errors.retry.timeout and errors.retry.delay.max.ms)
○ Task tolerance limits (error.tolerance=all)
○ Log error context (errors.log.enable and errors.log.include.messages)
○ DLQ: produce error context to a Kafka Topic (errors.deadletterqueue.*)
● Can be monitored via JMX:
kafka.connect:type=task-error-metrics,connector=([-.w]+),task=([-.w]+)
○ deadletterqueue-produce-requests and deadletterqueue-produce-failures

Dead letter queue - Gotchas
32
● Setting up DLQ is only possible with Sink connectors (not possible with Source connectors
FF-506)
● Only works for:
○ Failure to deserialize
○ Failure during convert
○ Failure during transforms
Note: in AK 2.6, KIP-610 adds possibility for sink connectors to report individual records as being
problematic and they will be sent to the DLQ
● Will not work when messages are put in external system (example: column too large in Oracle
DB, or wrong mapping in ElasticSearch) for the moment
● Some connectors implement their own DLQ (reporter.error.topic.name):
○ HTTP Sink
○ ServiceNow Sink
○ Google Cloud Functions Sink
○ Azure Search Sink
● Before 5.3.2, DLQ requires AdminClient and Producer security conﬁgurations if your Kafka
cluster is setup with security (KAFKA-9046). See the following Conﬂuent Support
Knowledge Base article for full details.

Error Handling - Example
33
$ curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json"
http://kafkaconnect:8083/connectors/ -d
‘{
"name" : "jdbc-vertica-sink",
"config" : {
"connector.class" : "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max" : "1",
"connection.url" : "jdbc:vertica://vertica:5433/docker?user=dbadmin&password=",
"topics" : "mytopic",
"errors.log.enable" : "true",
"errors.log.include.messages" : "true",
"error.tolerance" : "all",
"errors.deadletterqueue.topic.name" : "dlq",
"errors.deadletterqueue.topic.replication.factor" : "3",
"errors.deadletterqueue.context.headers.enable" : "true"
}
}’

Changing log level dynamically
34
● Introduced in AK 2.5 ( KIP-495), it’s possible to leave the Kafka Connect worker running and change log
levels dynamically
● New endpoint admin/logger to query current loggers levels:
$ curl -s http: //localhost:8083/admin/loggers/ | jq
{
"org.apache.kafka.connect.runtime.rest" : {
"level" : "WARN"
},
"org.reflections" : {
"level" : "ERROR"
},
"root": {
"level" : "INFO"
}
}
● Update logger org.apache.kafka.connect.runtime.WorkerSourceTask level:
$ curl -s -X PUT -H "Content-Type:application/json"
http: //localhost:8083/admin/loggers/org.apache.kafka.connect.runtime.WorkerSourceTask
-d '{ "level" : "TRACE" }'
● Good blog post article with details

Installation
36
● Zip
● RPM (RHEL and Centos)
● DEB (Ubuntu and Debian
● cp-ansible
● 🐳Docker
● Kubernetes:
○ Open-source cp-helm-charts
○ Conﬂuent Operator
● Terraform (for GCP/AWS)

Connect Worker conﬁguration - must set:
These are properties that you must set to ensure correct operation
• bootstrap.servers={{CLUSTER_URL}}
• group.id=connect-cluster
• key.converter=org.apache.kafka.connect.json.JsonConverter
• value.converter=org.apache.kafka.connect.json.JsonConverter
• key.converter.schemas.enable=false
• value.converter.schemas.enable=false
• internal.key.converter=org.apache.kafka.connect.json.JsonConverter
• internal.value.converter=org.apache.kafka.connect.json.JsonConverter
• internal.key.converter.schemas.enable=false
• internal.value.converter.schemas.enable=false
Deprecated as part of KAFKA-5540
• plugin.path=/usr/share/java
• rest.port=8083
38

Connect Worker conﬁguration - should set:
If you run Kafka Connect in distributed mode you need to make sure that they are accessible between each
others on the rest.port, the port needs to be opened.
Depending on your setup, you should set set the following properties to make sure each worker present to
each others with the appropriate endpoint.
• rest.advertised.host.name
• rest.advertised.port
Caution : Connect rest.advertised.port must be accessible between each workers ! 39

Connect Worker conﬁguration - could set:
These are properties that you could set depending on your particular use case
Connect clusters creates automatically three internal topics to manage offsets, conﬁgs, and status
information. Note that these contribute towards the total partition limit quota.
• offset.storage.topic=connect-offsets
• offset.storage.replication.factor=3
• offset.storage.partitions=3
• config.storage.topic=connect-configs
• config.storage.replication.factor=3
• status.storage.topic=connect-status
• status.storage.replication.factor=3
If you need to run multiple connect cluster make sure you use different group.id and
different internal topics names !
40

Connect Worker conﬁguration - internal topics
• Internal topics are automatically created. If you want to create them by yourself, make sure to create
them as compacted topic.
• The following example commands show how to manually create compacted and replicated Kafka
topics before starting Connect. Make sure to adhere to the distributed worker guidelines when entering
parameters:
# config.storage.topic=connect-configs
$ bin/kafka-topics --create --bootstrap-server localhost:9092 --topic connect-configs --replication-factor 3
--partitions 1 --config cleanup.policy=compact
# offset.storage.topic=connect-offsets
$ bin/kafka-topics --create --bootstrap-server localhost:9092 --topic connect-offsets --replication-factor 3
# status.storage.topic=connect-status
$ bin/kafka-topics --create --bootstrap-server localhost:9092 --topic connect-status --replication-factor 3
•
41

Authentication
43
At Kafka level :
Like any clients, support all
authentication provided by Kafka (SASL,
mTLS, OAuth, Kerberos)
At REST API level :
• Basic Auth
• Client Certiﬁcate

Security config at worker level
For connect internal topics :
• ssl.endpoint.identification.algorithm=https
• sasl.mechanism=PLAIN
• sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required
• username="<api-key>" password="<api-secret>";
• security.protocol=SASL_SSL
For Sink Connectors :
• consumer.ssl.endpoint.identification.algorithm=https
• consumer.sasl.mechanism=PLAIN
• consumer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required
• consumer.security.protocol=SASL_SSL
For Source Connectors :
• producer.ssl.endpoint.identification.algorithm=https
• producer.sasl.mechanism=PLAIN
• producer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required
• producer.security.protocol=SASL_SSL
44
All connectors use same credentials !

At the worker level used for internal topics :
• ssl.endpoint.identification.algorithm=https
• sasl.mechanism=PLAIN
• sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="<api-key>" password="<api-secret>";
• security.protocol=SASL_SSL
Allow overrides at worker level :
• connector.client.config.override.policy=All
For source connectors, at the connector level :
• producer.override.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="<api-key>"
password="<api-secret>";
• producer.override.sasl.mechanism=PLAIN
• producer.override.security.protocol=SASL_SSL
For sink connectors, at the connector level :
• consumer.override.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="<api-key>"
password="<api-secret>";
• consumer.override.sasl.mechanism=PLAIN
• consumer.override.security.protocol=SASL_SSL
Security config with connector overrides
45Each connectors use dedicated
credentials

Externalizing Secrets - Example with
FileConfigProvider
46
● Set up your credentials file, e.g. data/foo_credentials.properties:
FOO_USERNAME="rick"
FOO_PASSWORD="n3v3r_g0nn4_g1ve_y0u_up"
● Add the ConfigProvider to your Kafka Connect worker:
kafka-connect:
environment:
CONNECT_CONFIG_PROVIDERS: 'file'
CONNECT_CONFIG_PROVIDERS_FILE_CLASS: 'org.apache.kafka.common.config.provider.FileConfigProvider'
volumes:
- ./data:/data
● Now simply replace the credentials in your connector config with placeholders for the values:
Before:
"name": "source-activemq-01",
"config": {
"connector.class": "io.confluent.connect.activemq.ActiveMQSourceConnector",
"activemq.username": "rick",
"activemq.password": "n3v3r_g0nn4_g1ve_y0u_up",
After:
"name": "source-activemq-01",
"config": {
"connector.class": "io.confluent.connect.activemq.ActiveMQSourceConnector",
"activemq.username": "${file:/data/foo_credentials.properties:FOO_USERNAME}",
"activemq.password": "${file:/data/foo_credentials.properties:FOO_PASSWORD}",

Monitoring - JMX Connect Metrics
In addition to system metrics (cpu, memory, IO) and JVM metrics, you should collect Kafka
Connect metrics using JMX (part of KIP-196).
JMX metrics are provided on a per worker basis, so it needs to be aggregated to get a
complete cluster view.
Those metrics should be graphed over time to help you to understand the load of the cluster
before you decide adding more tasks/workers.
48

Offsets management
Kafka Connect automatically and periodically commits the progress of connectors.
Connectors restart at last commited position.
Source connectors track offsets in a dedicated topic conﬁgurable with
offset.storage.topic property at worker level.
● Needs to be a compacted topic with high number of partition (25 or 50) and highly
replicated (replication factor >= 3).
● This topics will track offset for remote position (on the source side)
Sink connectors track offsets like a “standard consumer” in Kafka’s built-in
__consumer_offsets topic.
51

Delivery Guarantees
Kafka Connect processes each record once under normal operations.
But things can go wrong, so it can only guarantee at-least-once delivery.
No support of exactly once on source connectors, but there is a jira for it (KAFKA-6080).
Sink connectors can achieve exactly once when they store offsets in the sink system.
Is the sink connector at least once or exactly once ?
Some are exactly once (HDFS, S3, Elasticsearch), it’s most of the time written in the
documentation of the connector itself.
But many at-least-once connectors can be conﬁgured to be idempotent.
52

Tips ?
• Use schemas to enforce forward and/or backward compatibility.
• Install connectors plugins on all workers in the cluster not just one of the workers.
● Tune the producers and consumers.
○ producers and consumers config
○ offset commit intervals
○ poll intervals, batch sizes, linger, compression, # of tasks
• By default, source and sink connectors inherit their client configurations from the
worker configuration.
• You can enable per-connector client configuration properties (ie compression, batch
size, credentials ) and override the default worker properties, setting the
connector.client.config.override.policy configuration parameter All or
Principal
53

Thank you!
cnfl.io/meetups cnfl.io/slackcnfl.io/blog

Diving into the Deep End - Kafka Connect

Diving into the Deep End - Kafka Connect

More Related Content

What's hot

Similar to Diving into the Deep End - Kafka Connect

More from confluent

Recently uploaded

In this document

Diving into the Deep End - Kafka Connect