Diving into the Deep End:
Kafka Connect
Dennis Wittekind
Customer Success Engineer
Special Thanks to Jean Louis Bodart and Vincent de Saboulin
About Me
2
● Customer Success Engineer
@ Confluent
● Middleware Core Engineer in
a past Life
● Infrastructure as Code Nerd
● Gear Head
● Overland Camping Enthusiast
Agenda
3
01. Overview
Overview of Kafka Connect and Connector Hub
02. Concepts
Describing concepts of Kafka Connect
03. Management
Configuration management using REST API or Control Center
04. Deployment Model
Installation overview for self managed connectors and quick glance at fully managed connectors
05. Connect Configuration
Overview of the important configuration parameters
06. Security
Security aspects for Kafka Connect
07. Monitoring
Monitor your Kafka Connect clusters using JMX
08. Tips and Tricks
Various tips and tricks to be aware of
Overview
Kafka Connect
No-Code way of connecting known systems (databases, object
storage, queues, etc) to Apache Kafka
Kafka Connect Kafka Connect
Data
sources
Data
sinks
Kafka Connect (ETL like)
6
Kafka
Cluster
Kafka Connect
Durable Data
Pipelines
Schema
Registry
Worker
Integrate upstream and
downstream systems with Apache
Kafka®
• Capture schema from sources, use
schema to inform data sinks
• Highly Available workers ensure
data pipelines aren’t interrupted
• Extensible framework API for
building custom connectors
Kafka Connect
Worker
Worker
Worker
Instantly Connect Popular Data Sources & Sinks
Data Diode
100+
pre-built
connectors
Confluent HUB
Easily browse connectors by:
• Source vs Sinks
• Confluent vs Partner supported
• Commercial vs Free
• Available in Confluent Cloud
confluent.io/hub
Instantly Connect
Popular Data
Sources & Sinks
Confluent Hub - Connector Page
10
- Source or Sink ?
- Free or Commercial ?
- Supported by Confluent or
partners
- Can download plugin
- Link to documentation
- License type
- Link to source code (if open
source)
Concepts
Kafka Connect
Terminology
12
Connector
- Source
- Sink
Connect Worker
- Distributed
- Standalone
Tasks Converters &
Transforms
13
Kafka Cluster
Connect
Worker
Connect
Worker
Connect
Worker
Connect
Worker
Connect
Worker
Connect
Worker
Connect
Worker
Connect
Worker
Connect
Worker
Connect
Worker
Connect Standalone Workers
Connect Distributed Cluster
Workers: The JVM(s) that run connectors and tasks. Can be run in either
standalone or distributed mode.
14
Connector
High level abstraction that
coordinates data streaming
by managing tasks
Don’t get confused by “connectors”
Connector plugin :
• The jar containing all the classes
that implement or use by a
connector instance
Connector instance :
• Logical job responsible for
coordinating tasks
• Instantiated inside of a Worker
• Offset management logic and
partition distribution
• A class implementing the
Connector interface
• Single instance
15
Connectors
Connectors (monitoring the source or sink system for changes that require
re-configuring tasks) and tasks (copying a subset of a connector’s data) are
automatically balanced across the active workers. The division of work between
tasks is shown by the partitions that each task is assigned.
15
A three-node Kafka Connect distributed mode cluster
16
Tasks
Task
Source
Data
internal.converter
key/value converter
Connect
Internal Offsets
Converted Data
Distributed?
Yes
No
Local Disk
Tasks are the main actor in the data model for Connect.
Each connector instance coordinates a set of tasks that actually copy the
data. These tasks have no state stored within them.
16
Workers, Connectors and Tasks (1/3)
17
JDBC Source Oracle DB1
task 2task 1
Kafka Connect Worker 1 Kafka Connect Worker 2 Kafka Connect Worker 3
HDFS Sink Connector
task 2task 1
JDBC Source Oracle DB1
task 4task 3
HDFS Sink Connector
task 4task 3
JDBC Source Oracle DB1
task 6task 5
HDFS Sink Connector
task 6task 5
17
JDBC Sink Oracle DB2
task 1
JDBC Sink Oracle DB2
Workers, Connectors and Tasks (2/3)
18
JDBC Source Oracle DB1
task 2task 1
Kafka Connect Worker 1 Kafka Connect Worker 2 Kafka Connect Worker 3
HDFS Sink Connector
task 2task 1
JDBC Source Oracle DB1
task 4task 3
HDFS Sink Connector
task 4task 3
JDBC Source Oracle DB1
task 6task 5
HDFS Sink Connector
task 6task 5
18
task 1
Workers, Connectors and Tasks (3/3)
19
JDBC Source Oracle DB1
task 2task 1
Kafka Connect Worker 1 Kafka Connect Worker 2 Kafka Connect Worker 3
HDFS Sink Connector
task 2task 1
JDBC Source Oracle DB1
task 4task 3
HDFS Sink Connector
task 4task 3
HDFS Sink Connector
task 6task 5
19
JDBC Source Oracle DB1
task 6task 5
JDBC Sink Oracle DB2
task 1
Tasks - a few more details
20
• Number of tasks is limited only through Connector configuration ‘tasks.max’
• Workers will spawn as many as they are told
• Tasks rebalance just like consumers
• Since Apache Kafka 2.3, KIP-415 (Incremental Cooperative Rebalancing) improved
greatly impacts of connector re-configuration / stop / start by rebalancing only
those tasks that need to be started, stopped, or moved
• Placement is done automatically by connect, however there is no guarantee that two
tasks will be spread on different machines.
• Can be an issue when a connector listens on a network port, example Syslog source
(either run it in standalone mode or a single distributed worker)
Kafka Connect Converters
Convert between the source and sink
ConnectRecord objects and the binary format
(byte[]) used to persist them in Kafka.
String, JSON, Avro, Protobuf, JSON Schema, and others
Reference: https://docs.confluent.io/current/connect/concepts.html#connect-converters
Kafka Connect Transforms
Single Message Transformations (SMTs) are applied to messages as they flow through
Connect.
• SMTs transform inbound messages after a source connector has produced them, but
before they are written to Kafka.
• SMTs transform outbound messages before they are sent to a sink connector.
Reference: https://docs.confluent.io/current/connect/transforms/index.html
Kafka Connect Transforms - What for ?
23
● Data Masking - Mask sensitive information ahead of sending it to
Kafka
● Event Routing - Modify an event destination based on contents of
the event
● Event Enhancement - Add additional fields to event
● Partitioning - Set the key for the event based on contents of the
event before sending to Kafka
● Timestamp Conversion - Time based data conversion when
integrating different systems (ISO8601 vs Unix Epoch)
Kafka Connect Transforms - Example
2424
JDBC
Connector
MySQL
MaskField
ssn: 123-45-6789 ssn: “”
Chaining Transforms
25
Transforms can be chained together for more power:
25
null {"c1":{"foo":1},"c2":{"string":"bar"},"create_ts":1501796305000,"update_ts":1501796305000}
null {"c1":{"foo":2},"c2":{"string":"bar"},"create_ts":1501796665000,"update_ts":1501796665000}
{
"transforms":"createKey,extractFoo",
"transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields":"c1",
"transforms.extractFoo.type":"org.apache.kafka.connect.transforms.ExtractField$Key"
,
"transforms.extractFoo.field":"foo"
}
1 {"c1":{"foo":1},"c2":{"string":"bar"},"create_ts":1501796305000,"update_ts":1501796305000}
2 {"c1":{"foo":2},"c2":{"string":"bar"},"create_ts":1501796665000,"update_ts":1501796665000}
Reference: https://docs.confluent.io/current/connect/transforms/index.html
Transforms - When not use?
2626
● Use chaining sparingly, don’t rely on too many transforms: it’s hard to
read and reason about after more than a couple
● Don’t attempt to enrich events in transforms
● Use the right tool for the job: Complicated transformations, joins,
aggregations should be done using Kafka Streams or ksqlDB (KSQL)
Reference: https://kafka-summit.org/sessions/single-message-transformations-not-transformations-youre-looking/
Management
Management Interfaces
28
$ curl -i -X POST -H "Accept:application/json" -H
"Content-Type:application/json"
http://kafkaconnect:8083/connectors/ -d
‘{
"name" : "RabbitMQSourceConnector1",
"config" : {
"connector.class" :
"io.confluent.connect.rabbitmq.RabbitMQSourceConnector",
"tasks.max" : "1",
"kafka.topic" : "rabbitmq",
"rabbitmq.queue" : "myqueue",
"rabbitmq.host" : "localhost",
"rabbitmq.username" : "guest",
"rabbitmq.password" : "guest"
}
}’
28
Kafka Connect REST API Confluent Control Center
Kafka Connect REST API
2929
Reference: Kafka Connect REST Interface
Kafka Connect REST API - Tips
3030
● Use PUT /connectors/(string: name)/config method instead of POST /connectors/(string: name)
as it can create or update existing connector
● Use GET /connectors/(string: name)/status for current status of the connector, including
whether it is running, failed or paused, which worker it is assigned to, error information if
it has failed, and the state of all its tasks
● Before updating a connector’s config, you can validate it using PUT
/connector-plugins/(string: name)/config/validate
Example: missing topic.prefix
in JDBC source connector 👉
● ⚠ POST /connectors/(string: name)/restart does not restart tasks:
○ To restart all tasks:
■ Pause connector using PUT /connectors/(string: name)/pause
■ Resume connector using PUT /connectors/(string: name)/resume
○ Tasks can also be restarted individually using POST /connectors/(string:
name)/tasks/(int: taskid)/restart
Error Handling & Dead letter queue (DLQ)
31
● Motivations : Allow users to configure how bad data should be handled during all phases
of processing records.
○ Failure to deserialize
○ Failure during convert
○ Failure during transforms
○ Lack of availability of external components
● Introduced in KIP-298 with Kafka Connect 2.0
● By default connect will fail immediately when an error occurs (error.tolerance=none)
● Error Handling must be configured in the individual connector configurations
○ Retry on failure (errors.retry.timeout and errors.retry.delay.max.ms)
○ Task tolerance limits (error.tolerance=all)
○ Log error context (errors.log.enable and errors.log.include.messages)
○ DLQ: produce error context to a Kafka Topic (errors.deadletterqueue.*)
● Can be monitored via JMX:
kafka.connect:type=task-error-metrics,connector=([-.w]+),task=([-.w]+)
○ deadletterqueue-produce-requests and deadletterqueue-produce-failures
Dead letter queue - Gotchas
32
● Setting up DLQ is only possible with Sink connectors (not possible with Source connectors
FF-506)
● Only works for:
○ Failure to deserialize
○ Failure during convert
○ Failure during transforms
Note: in AK 2.6, KIP-610 adds possibility for sink connectors to report individual records as being
problematic and they will be sent to the DLQ
● Will not work when messages are put in external system (example: column too large in Oracle
DB, or wrong mapping in ElasticSearch) for the moment
● Some connectors implement their own DLQ (reporter.error.topic.name):
○ HTTP Sink
○ ServiceNow Sink
○ Google Cloud Functions Sink
○ Azure Search Sink
● Before 5.3.2, DLQ requires AdminClient and Producer security configurations if your Kafka
cluster is setup with security (KAFKA-9046). See the following Confluent Support
Knowledge Base article for full details.
Error Handling - Example
33
$ curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json"
http://kafkaconnect:8083/connectors/ -d
‘{
"name" : "jdbc-vertica-sink",
"config" : {
"connector.class" : "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max" : "1",
"connection.url" : "jdbc:vertica://vertica:5433/docker?user=dbadmin&password=",
"topics" : "mytopic",
"errors.log.enable" : "true",
"errors.log.include.messages" : "true",
"error.tolerance" : "all",
"errors.deadletterqueue.topic.name" : "dlq",
"errors.deadletterqueue.topic.replication.factor" : "3",
"errors.deadletterqueue.context.headers.enable" : "true"
}
}’
Changing log level dynamically
34
● Introduced in AK 2.5 ( KIP-495), it’s possible to leave the Kafka Connect worker running and change log
levels dynamically
● New endpoint admin/logger to query current loggers levels:
$ curl -s http: //localhost:8083/admin/loggers/ | jq
{
"org.apache.kafka.connect.runtime.rest" : {
"level" : "WARN"
},
"org.reflections" : {
"level" : "ERROR"
},
"root": {
"level" : "INFO"
}
}
● Update logger org.apache.kafka.connect.runtime.WorkerSourceTask level:
$ curl -s -X PUT -H "Content-Type:application/json" 
http: //localhost:8083/admin/loggers/org.apache.kafka.connect.runtime.WorkerSourceTask 
-d '{ "level" : "TRACE" }'
● Good blog post article with details
Deployment Model
Installation
36
● Zip
● RPM (RHEL and Centos)
● DEB (Ubuntu and Debian
● cp-ansible
● 🐳Docker
● Kubernetes:
○ Open-source cp-helm-charts
○ Confluent Operator
● Terraform (for GCP/AWS)
Connect Configuration
Connect Worker configuration - must set:
These are properties that you must set to ensure correct operation
• bootstrap.servers={{CLUSTER_URL}}
• group.id=connect-cluster
• key.converter=org.apache.kafka.connect.json.JsonConverter
• value.converter=org.apache.kafka.connect.json.JsonConverter
• key.converter.schemas.enable=false
• value.converter.schemas.enable=false
• internal.key.converter=org.apache.kafka.connect.json.JsonConverter
• internal.value.converter=org.apache.kafka.connect.json.JsonConverter
• internal.key.converter.schemas.enable=false
• internal.value.converter.schemas.enable=false
Deprecated as part of KAFKA-5540
• plugin.path=/usr/share/java
• rest.port=8083
38
Connect Worker configuration - should set:
If you run Kafka Connect in distributed mode you need to make sure that they are accessible between each
others on the rest.port, the port needs to be opened.
Depending on your setup, you should set set the following properties to make sure each worker present to
each others with the appropriate endpoint.
• rest.advertised.host.name
• rest.advertised.port
Caution : Connect rest.advertised.port must be accessible between each workers ! 39
Connect Worker configuration - could set:
These are properties that you could set depending on your particular use case
Connect clusters creates automatically three internal topics to manage offsets, configs, and status
information. Note that these contribute towards the total partition limit quota.
• offset.storage.topic=connect-offsets
• offset.storage.replication.factor=3
• offset.storage.partitions=3
• config.storage.topic=connect-configs
• config.storage.replication.factor=3
• status.storage.topic=connect-status
• status.storage.replication.factor=3
If you need to run multiple connect cluster make sure you use different group.id and
different internal topics names !
40
Connect Worker configuration - internal topics
• Internal topics are automatically created. If you want to create them by yourself, make sure to create
them as compacted topic.
• The following example commands show how to manually create compacted and replicated Kafka
topics before starting Connect. Make sure to adhere to the distributed worker guidelines when entering
parameters:
# config.storage.topic=connect-configs
$ bin/kafka-topics --create --bootstrap-server localhost:9092 --topic connect-configs --replication-factor 3
--partitions 1 --config cleanup.policy=compact
# offset.storage.topic=connect-offsets
$ bin/kafka-topics --create --bootstrap-server localhost:9092 --topic connect-offsets --replication-factor 3
--partitions 50 --config cleanup.policy=compact
# status.storage.topic=connect-status
$ bin/kafka-topics --create --bootstrap-server localhost:9092 --topic connect-status --replication-factor 3
--partitions 10 --config cleanup.policy=compact
•
41
Security
Authentication
43
At Kafka level :
Like any clients, support all
authentication provided by Kafka (SASL,
mTLS, OAuth, Kerberos)
At REST API level :
• Basic Auth
• Client Certificate
Security config at worker level
For connect internal topics :
• ssl.endpoint.identification.algorithm=https
• sasl.mechanism=PLAIN
• sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required 
• username="<api-key>" password="<api-secret>";
• security.protocol=SASL_SSL
For Sink Connectors :
• consumer.ssl.endpoint.identification.algorithm=https
• consumer.sasl.mechanism=PLAIN
• consumer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required 
• username="<api-key>" password="<api-secret>";
• consumer.security.protocol=SASL_SSL
For Source Connectors :
• producer.ssl.endpoint.identification.algorithm=https
• producer.sasl.mechanism=PLAIN
• producer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required 
• username="<api-key>" password="<api-secret>";
• producer.security.protocol=SASL_SSL
44
All connectors use same credentials !
At the worker level used for internal topics :
• ssl.endpoint.identification.algorithm=https
• sasl.mechanism=PLAIN
• sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="<api-key>" password="<api-secret>";
• security.protocol=SASL_SSL
Allow overrides at worker level :
• connector.client.config.override.policy=All
For source connectors, at the connector level :
• producer.override.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="<api-key>"
password="<api-secret>";
• producer.override.sasl.mechanism=PLAIN
• producer.override.security.protocol=SASL_SSL
For sink connectors, at the connector level :
• consumer.override.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="<api-key>"
password="<api-secret>";
• consumer.override.sasl.mechanism=PLAIN
• consumer.override.security.protocol=SASL_SSL
Security config with connector overrides
45Each connectors use dedicated
credentials
Externalizing Secrets - Example with
FileConfigProvider
46
● Set up your credentials file, e.g. data/foo_credentials.properties:
FOO_USERNAME="rick"
FOO_PASSWORD="n3v3r_g0nn4_g1ve_y0u_up"
● Add the ConfigProvider to your Kafka Connect worker:
kafka-connect:
environment:
CONNECT_CONFIG_PROVIDERS: 'file'
CONNECT_CONFIG_PROVIDERS_FILE_CLASS: 'org.apache.kafka.common.config.provider.FileConfigProvider'
volumes:
- ./data:/data
● Now simply replace the credentials in your connector config with placeholders for the values:
Before:
"name": "source-activemq-01",
"config": {
"connector.class": "io.confluent.connect.activemq.ActiveMQSourceConnector",
"activemq.username": "rick",
"activemq.password": "n3v3r_g0nn4_g1ve_y0u_up",
After:
"name": "source-activemq-01",
"config": {
"connector.class": "io.confluent.connect.activemq.ActiveMQSourceConnector",
"activemq.username": "${file:/data/foo_credentials.properties:FOO_USERNAME}",
"activemq.password": "${file:/data/foo_credentials.properties:FOO_PASSWORD}",
Monitoring
Monitoring - JMX Connect Metrics
In addition to system metrics (cpu, memory, IO) and JVM metrics, you should collect Kafka
Connect metrics using JMX (part of KIP-196).
JMX metrics are provided on a per worker basis, so it needs to be aggregated to get a
complete cluster view.
Those metrics should be graphed over time to help you to understand the load of the cluster
before you decide adding more tasks/workers.
48
Demo
Tips & Tricks
Offsets management
Kafka Connect automatically and periodically commits the progress of connectors.
Connectors restart at last commited position.
Source connectors track offsets in a dedicated topic configurable with
offset.storage.topic property at worker level.
● Needs to be a compacted topic with high number of partition (25 or 50) and highly
replicated (replication factor >= 3).
● This topics will track offset for remote position (on the source side)
Sink connectors track offsets like a “standard consumer” in Kafka’s built-in
__consumer_offsets topic.
51
Delivery Guarantees
Kafka Connect processes each record once under normal operations.
But things can go wrong, so it can only guarantee at-least-once delivery.
No support of exactly once on source connectors, but there is a jira for it (KAFKA-6080).
Sink connectors can achieve exactly once when they store offsets in the sink system.
Is the sink connector at least once or exactly once ?
Some are exactly once (HDFS, S3, Elasticsearch), it’s most of the time written in the
documentation of the connector itself.
But many at-least-once connectors can be configured to be idempotent.
52
Tips ?
• Use schemas to enforce forward and/or backward compatibility.
• Install connectors plugins on all workers in the cluster not just one of the workers.
● Tune the producers and consumers.
○ producers and consumers config
○ offset commit intervals
○ poll intervals, batch sizes, linger, compression, # of tasks
• By default, source and sink connectors inherit their client configurations from the
worker configuration.
• You can enable per-connector client configuration properties (ie compression, batch
size, credentials ) and override the default worker properties, setting the
connector.client.config.override.policy configuration parameter All or
Principal
53
Thank you!
cnfl.io/meetups cnfl.io/slackcnfl.io/blog
Diving into the Deep End - Kafka Connect

Diving into the Deep End - Kafka Connect

  • 1.
    Diving into theDeep End: Kafka Connect Dennis Wittekind Customer Success Engineer Special Thanks to Jean Louis Bodart and Vincent de Saboulin
  • 2.
    About Me 2 ● CustomerSuccess Engineer @ Confluent ● Middleware Core Engineer in a past Life ● Infrastructure as Code Nerd ● Gear Head ● Overland Camping Enthusiast
  • 3.
    Agenda 3 01. Overview Overview ofKafka Connect and Connector Hub 02. Concepts Describing concepts of Kafka Connect 03. Management Configuration management using REST API or Control Center 04. Deployment Model Installation overview for self managed connectors and quick glance at fully managed connectors 05. Connect Configuration Overview of the important configuration parameters 06. Security Security aspects for Kafka Connect 07. Monitoring Monitor your Kafka Connect clusters using JMX 08. Tips and Tricks Various tips and tricks to be aware of
  • 4.
  • 5.
    Kafka Connect No-Code wayof connecting known systems (databases, object storage, queues, etc) to Apache Kafka Kafka Connect Kafka Connect Data sources Data sinks
  • 6.
  • 7.
    Kafka Cluster Kafka Connect Durable Data Pipelines Schema Registry Worker Integrateupstream and downstream systems with Apache Kafka® • Capture schema from sources, use schema to inform data sinks • Highly Available workers ensure data pipelines aren’t interrupted • Extensible framework API for building custom connectors Kafka Connect Worker Worker Worker
  • 8.
    Instantly Connect PopularData Sources & Sinks Data Diode 100+ pre-built connectors
  • 9.
    Confluent HUB Easily browseconnectors by: • Source vs Sinks • Confluent vs Partner supported • Commercial vs Free • Available in Confluent Cloud confluent.io/hub Instantly Connect Popular Data Sources & Sinks
  • 10.
    Confluent Hub -Connector Page 10 - Source or Sink ? - Free or Commercial ? - Supported by Confluent or partners - Can download plugin - Link to documentation - License type - Link to source code (if open source)
  • 11.
  • 12.
    Kafka Connect Terminology 12 Connector - Source -Sink Connect Worker - Distributed - Standalone Tasks Converters & Transforms
  • 13.
    13 Kafka Cluster Connect Worker Connect Worker Connect Worker Connect Worker Connect Worker Connect Worker Connect Worker Connect Worker Connect Worker Connect Worker Connect StandaloneWorkers Connect Distributed Cluster Workers: The JVM(s) that run connectors and tasks. Can be run in either standalone or distributed mode.
  • 14.
    14 Connector High level abstractionthat coordinates data streaming by managing tasks Don’t get confused by “connectors” Connector plugin : • The jar containing all the classes that implement or use by a connector instance Connector instance : • Logical job responsible for coordinating tasks • Instantiated inside of a Worker • Offset management logic and partition distribution • A class implementing the Connector interface • Single instance
  • 15.
    15 Connectors Connectors (monitoring thesource or sink system for changes that require re-configuring tasks) and tasks (copying a subset of a connector’s data) are automatically balanced across the active workers. The division of work between tasks is shown by the partitions that each task is assigned. 15 A three-node Kafka Connect distributed mode cluster
  • 16.
    16 Tasks Task Source Data internal.converter key/value converter Connect Internal Offsets ConvertedData Distributed? Yes No Local Disk Tasks are the main actor in the data model for Connect. Each connector instance coordinates a set of tasks that actually copy the data. These tasks have no state stored within them. 16
  • 17.
    Workers, Connectors andTasks (1/3) 17 JDBC Source Oracle DB1 task 2task 1 Kafka Connect Worker 1 Kafka Connect Worker 2 Kafka Connect Worker 3 HDFS Sink Connector task 2task 1 JDBC Source Oracle DB1 task 4task 3 HDFS Sink Connector task 4task 3 JDBC Source Oracle DB1 task 6task 5 HDFS Sink Connector task 6task 5 17 JDBC Sink Oracle DB2 task 1
  • 18.
    JDBC Sink OracleDB2 Workers, Connectors and Tasks (2/3) 18 JDBC Source Oracle DB1 task 2task 1 Kafka Connect Worker 1 Kafka Connect Worker 2 Kafka Connect Worker 3 HDFS Sink Connector task 2task 1 JDBC Source Oracle DB1 task 4task 3 HDFS Sink Connector task 4task 3 JDBC Source Oracle DB1 task 6task 5 HDFS Sink Connector task 6task 5 18 task 1
  • 19.
    Workers, Connectors andTasks (3/3) 19 JDBC Source Oracle DB1 task 2task 1 Kafka Connect Worker 1 Kafka Connect Worker 2 Kafka Connect Worker 3 HDFS Sink Connector task 2task 1 JDBC Source Oracle DB1 task 4task 3 HDFS Sink Connector task 4task 3 HDFS Sink Connector task 6task 5 19 JDBC Source Oracle DB1 task 6task 5 JDBC Sink Oracle DB2 task 1
  • 20.
    Tasks - afew more details 20 • Number of tasks is limited only through Connector configuration ‘tasks.max’ • Workers will spawn as many as they are told • Tasks rebalance just like consumers • Since Apache Kafka 2.3, KIP-415 (Incremental Cooperative Rebalancing) improved greatly impacts of connector re-configuration / stop / start by rebalancing only those tasks that need to be started, stopped, or moved • Placement is done automatically by connect, however there is no guarantee that two tasks will be spread on different machines. • Can be an issue when a connector listens on a network port, example Syslog source (either run it in standalone mode or a single distributed worker)
  • 21.
    Kafka Connect Converters Convertbetween the source and sink ConnectRecord objects and the binary format (byte[]) used to persist them in Kafka. String, JSON, Avro, Protobuf, JSON Schema, and others Reference: https://docs.confluent.io/current/connect/concepts.html#connect-converters
  • 22.
    Kafka Connect Transforms SingleMessage Transformations (SMTs) are applied to messages as they flow through Connect. • SMTs transform inbound messages after a source connector has produced them, but before they are written to Kafka. • SMTs transform outbound messages before they are sent to a sink connector. Reference: https://docs.confluent.io/current/connect/transforms/index.html
  • 23.
    Kafka Connect Transforms- What for ? 23 ● Data Masking - Mask sensitive information ahead of sending it to Kafka ● Event Routing - Modify an event destination based on contents of the event ● Event Enhancement - Add additional fields to event ● Partitioning - Set the key for the event based on contents of the event before sending to Kafka ● Timestamp Conversion - Time based data conversion when integrating different systems (ISO8601 vs Unix Epoch)
  • 24.
    Kafka Connect Transforms- Example 2424 JDBC Connector MySQL MaskField ssn: 123-45-6789 ssn: “”
  • 25.
    Chaining Transforms 25 Transforms canbe chained together for more power: 25 null {"c1":{"foo":1},"c2":{"string":"bar"},"create_ts":1501796305000,"update_ts":1501796305000} null {"c1":{"foo":2},"c2":{"string":"bar"},"create_ts":1501796665000,"update_ts":1501796665000} { "transforms":"createKey,extractFoo", "transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey", "transforms.createKey.fields":"c1", "transforms.extractFoo.type":"org.apache.kafka.connect.transforms.ExtractField$Key" , "transforms.extractFoo.field":"foo" } 1 {"c1":{"foo":1},"c2":{"string":"bar"},"create_ts":1501796305000,"update_ts":1501796305000} 2 {"c1":{"foo":2},"c2":{"string":"bar"},"create_ts":1501796665000,"update_ts":1501796665000} Reference: https://docs.confluent.io/current/connect/transforms/index.html
  • 26.
    Transforms - Whennot use? 2626 ● Use chaining sparingly, don’t rely on too many transforms: it’s hard to read and reason about after more than a couple ● Don’t attempt to enrich events in transforms ● Use the right tool for the job: Complicated transformations, joins, aggregations should be done using Kafka Streams or ksqlDB (KSQL) Reference: https://kafka-summit.org/sessions/single-message-transformations-not-transformations-youre-looking/
  • 27.
  • 28.
    Management Interfaces 28 $ curl-i -X POST -H "Accept:application/json" -H "Content-Type:application/json" http://kafkaconnect:8083/connectors/ -d ‘{ "name" : "RabbitMQSourceConnector1", "config" : { "connector.class" : "io.confluent.connect.rabbitmq.RabbitMQSourceConnector", "tasks.max" : "1", "kafka.topic" : "rabbitmq", "rabbitmq.queue" : "myqueue", "rabbitmq.host" : "localhost", "rabbitmq.username" : "guest", "rabbitmq.password" : "guest" } }’ 28 Kafka Connect REST API Confluent Control Center
  • 29.
    Kafka Connect RESTAPI 2929 Reference: Kafka Connect REST Interface
  • 30.
    Kafka Connect RESTAPI - Tips 3030 ● Use PUT /connectors/(string: name)/config method instead of POST /connectors/(string: name) as it can create or update existing connector ● Use GET /connectors/(string: name)/status for current status of the connector, including whether it is running, failed or paused, which worker it is assigned to, error information if it has failed, and the state of all its tasks ● Before updating a connector’s config, you can validate it using PUT /connector-plugins/(string: name)/config/validate Example: missing topic.prefix in JDBC source connector 👉 ● ⚠ POST /connectors/(string: name)/restart does not restart tasks: ○ To restart all tasks: ■ Pause connector using PUT /connectors/(string: name)/pause ■ Resume connector using PUT /connectors/(string: name)/resume ○ Tasks can also be restarted individually using POST /connectors/(string: name)/tasks/(int: taskid)/restart
  • 31.
    Error Handling &Dead letter queue (DLQ) 31 ● Motivations : Allow users to configure how bad data should be handled during all phases of processing records. ○ Failure to deserialize ○ Failure during convert ○ Failure during transforms ○ Lack of availability of external components ● Introduced in KIP-298 with Kafka Connect 2.0 ● By default connect will fail immediately when an error occurs (error.tolerance=none) ● Error Handling must be configured in the individual connector configurations ○ Retry on failure (errors.retry.timeout and errors.retry.delay.max.ms) ○ Task tolerance limits (error.tolerance=all) ○ Log error context (errors.log.enable and errors.log.include.messages) ○ DLQ: produce error context to a Kafka Topic (errors.deadletterqueue.*) ● Can be monitored via JMX: kafka.connect:type=task-error-metrics,connector=([-.w]+),task=([-.w]+) ○ deadletterqueue-produce-requests and deadletterqueue-produce-failures
  • 32.
    Dead letter queue- Gotchas 32 ● Setting up DLQ is only possible with Sink connectors (not possible with Source connectors FF-506) ● Only works for: ○ Failure to deserialize ○ Failure during convert ○ Failure during transforms Note: in AK 2.6, KIP-610 adds possibility for sink connectors to report individual records as being problematic and they will be sent to the DLQ ● Will not work when messages are put in external system (example: column too large in Oracle DB, or wrong mapping in ElasticSearch) for the moment ● Some connectors implement their own DLQ (reporter.error.topic.name): ○ HTTP Sink ○ ServiceNow Sink ○ Google Cloud Functions Sink ○ Azure Search Sink ● Before 5.3.2, DLQ requires AdminClient and Producer security configurations if your Kafka cluster is setup with security (KAFKA-9046). See the following Confluent Support Knowledge Base article for full details.
  • 33.
    Error Handling -Example 33 $ curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" http://kafkaconnect:8083/connectors/ -d ‘{ "name" : "jdbc-vertica-sink", "config" : { "connector.class" : "io.confluent.connect.jdbc.JdbcSinkConnector", "tasks.max" : "1", "connection.url" : "jdbc:vertica://vertica:5433/docker?user=dbadmin&password=", "topics" : "mytopic", "errors.log.enable" : "true", "errors.log.include.messages" : "true", "error.tolerance" : "all", "errors.deadletterqueue.topic.name" : "dlq", "errors.deadletterqueue.topic.replication.factor" : "3", "errors.deadletterqueue.context.headers.enable" : "true" } }’
  • 34.
    Changing log leveldynamically 34 ● Introduced in AK 2.5 ( KIP-495), it’s possible to leave the Kafka Connect worker running and change log levels dynamically ● New endpoint admin/logger to query current loggers levels: $ curl -s http: //localhost:8083/admin/loggers/ | jq { "org.apache.kafka.connect.runtime.rest" : { "level" : "WARN" }, "org.reflections" : { "level" : "ERROR" }, "root": { "level" : "INFO" } } ● Update logger org.apache.kafka.connect.runtime.WorkerSourceTask level: $ curl -s -X PUT -H "Content-Type:application/json" http: //localhost:8083/admin/loggers/org.apache.kafka.connect.runtime.WorkerSourceTask -d '{ "level" : "TRACE" }' ● Good blog post article with details
  • 35.
  • 36.
    Installation 36 ● Zip ● RPM(RHEL and Centos) ● DEB (Ubuntu and Debian ● cp-ansible ● 🐳Docker ● Kubernetes: ○ Open-source cp-helm-charts ○ Confluent Operator ● Terraform (for GCP/AWS)
  • 37.
  • 38.
    Connect Worker configuration- must set: These are properties that you must set to ensure correct operation • bootstrap.servers={{CLUSTER_URL}} • group.id=connect-cluster • key.converter=org.apache.kafka.connect.json.JsonConverter • value.converter=org.apache.kafka.connect.json.JsonConverter • key.converter.schemas.enable=false • value.converter.schemas.enable=false • internal.key.converter=org.apache.kafka.connect.json.JsonConverter • internal.value.converter=org.apache.kafka.connect.json.JsonConverter • internal.key.converter.schemas.enable=false • internal.value.converter.schemas.enable=false Deprecated as part of KAFKA-5540 • plugin.path=/usr/share/java • rest.port=8083 38
  • 39.
    Connect Worker configuration- should set: If you run Kafka Connect in distributed mode you need to make sure that they are accessible between each others on the rest.port, the port needs to be opened. Depending on your setup, you should set set the following properties to make sure each worker present to each others with the appropriate endpoint. • rest.advertised.host.name • rest.advertised.port Caution : Connect rest.advertised.port must be accessible between each workers ! 39
  • 40.
    Connect Worker configuration- could set: These are properties that you could set depending on your particular use case Connect clusters creates automatically three internal topics to manage offsets, configs, and status information. Note that these contribute towards the total partition limit quota. • offset.storage.topic=connect-offsets • offset.storage.replication.factor=3 • offset.storage.partitions=3 • config.storage.topic=connect-configs • config.storage.replication.factor=3 • status.storage.topic=connect-status • status.storage.replication.factor=3 If you need to run multiple connect cluster make sure you use different group.id and different internal topics names ! 40
  • 41.
    Connect Worker configuration- internal topics • Internal topics are automatically created. If you want to create them by yourself, make sure to create them as compacted topic. • The following example commands show how to manually create compacted and replicated Kafka topics before starting Connect. Make sure to adhere to the distributed worker guidelines when entering parameters: # config.storage.topic=connect-configs $ bin/kafka-topics --create --bootstrap-server localhost:9092 --topic connect-configs --replication-factor 3 --partitions 1 --config cleanup.policy=compact # offset.storage.topic=connect-offsets $ bin/kafka-topics --create --bootstrap-server localhost:9092 --topic connect-offsets --replication-factor 3 --partitions 50 --config cleanup.policy=compact # status.storage.topic=connect-status $ bin/kafka-topics --create --bootstrap-server localhost:9092 --topic connect-status --replication-factor 3 --partitions 10 --config cleanup.policy=compact • 41
  • 42.
  • 43.
    Authentication 43 At Kafka level: Like any clients, support all authentication provided by Kafka (SASL, mTLS, OAuth, Kerberos) At REST API level : • Basic Auth • Client Certificate
  • 44.
    Security config atworker level For connect internal topics : • ssl.endpoint.identification.algorithm=https • sasl.mechanism=PLAIN • sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required • username="<api-key>" password="<api-secret>"; • security.protocol=SASL_SSL For Sink Connectors : • consumer.ssl.endpoint.identification.algorithm=https • consumer.sasl.mechanism=PLAIN • consumer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required • username="<api-key>" password="<api-secret>"; • consumer.security.protocol=SASL_SSL For Source Connectors : • producer.ssl.endpoint.identification.algorithm=https • producer.sasl.mechanism=PLAIN • producer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required • username="<api-key>" password="<api-secret>"; • producer.security.protocol=SASL_SSL 44 All connectors use same credentials !
  • 45.
    At the workerlevel used for internal topics : • ssl.endpoint.identification.algorithm=https • sasl.mechanism=PLAIN • sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="<api-key>" password="<api-secret>"; • security.protocol=SASL_SSL Allow overrides at worker level : • connector.client.config.override.policy=All For source connectors, at the connector level : • producer.override.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="<api-key>" password="<api-secret>"; • producer.override.sasl.mechanism=PLAIN • producer.override.security.protocol=SASL_SSL For sink connectors, at the connector level : • consumer.override.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="<api-key>" password="<api-secret>"; • consumer.override.sasl.mechanism=PLAIN • consumer.override.security.protocol=SASL_SSL Security config with connector overrides 45Each connectors use dedicated credentials
  • 46.
    Externalizing Secrets -Example with FileConfigProvider 46 ● Set up your credentials file, e.g. data/foo_credentials.properties: FOO_USERNAME="rick" FOO_PASSWORD="n3v3r_g0nn4_g1ve_y0u_up" ● Add the ConfigProvider to your Kafka Connect worker: kafka-connect: environment: CONNECT_CONFIG_PROVIDERS: 'file' CONNECT_CONFIG_PROVIDERS_FILE_CLASS: 'org.apache.kafka.common.config.provider.FileConfigProvider' volumes: - ./data:/data ● Now simply replace the credentials in your connector config with placeholders for the values: Before: "name": "source-activemq-01", "config": { "connector.class": "io.confluent.connect.activemq.ActiveMQSourceConnector", "activemq.username": "rick", "activemq.password": "n3v3r_g0nn4_g1ve_y0u_up", After: "name": "source-activemq-01", "config": { "connector.class": "io.confluent.connect.activemq.ActiveMQSourceConnector", "activemq.username": "${file:/data/foo_credentials.properties:FOO_USERNAME}", "activemq.password": "${file:/data/foo_credentials.properties:FOO_PASSWORD}",
  • 47.
  • 48.
    Monitoring - JMXConnect Metrics In addition to system metrics (cpu, memory, IO) and JVM metrics, you should collect Kafka Connect metrics using JMX (part of KIP-196). JMX metrics are provided on a per worker basis, so it needs to be aggregated to get a complete cluster view. Those metrics should be graphed over time to help you to understand the load of the cluster before you decide adding more tasks/workers. 48
  • 49.
  • 50.
  • 51.
    Offsets management Kafka Connectautomatically and periodically commits the progress of connectors. Connectors restart at last commited position. Source connectors track offsets in a dedicated topic configurable with offset.storage.topic property at worker level. ● Needs to be a compacted topic with high number of partition (25 or 50) and highly replicated (replication factor >= 3). ● This topics will track offset for remote position (on the source side) Sink connectors track offsets like a “standard consumer” in Kafka’s built-in __consumer_offsets topic. 51
  • 52.
    Delivery Guarantees Kafka Connectprocesses each record once under normal operations. But things can go wrong, so it can only guarantee at-least-once delivery. No support of exactly once on source connectors, but there is a jira for it (KAFKA-6080). Sink connectors can achieve exactly once when they store offsets in the sink system. Is the sink connector at least once or exactly once ? Some are exactly once (HDFS, S3, Elasticsearch), it’s most of the time written in the documentation of the connector itself. But many at-least-once connectors can be configured to be idempotent. 52
  • 53.
    Tips ? • Useschemas to enforce forward and/or backward compatibility. • Install connectors plugins on all workers in the cluster not just one of the workers. ● Tune the producers and consumers. ○ producers and consumers config ○ offset commit intervals ○ poll intervals, batch sizes, linger, compression, # of tasks • By default, source and sink connectors inherit their client configurations from the worker configuration. • You can enable per-connector client configuration properties (ie compression, batch size, credentials ) and override the default worker properties, setting the connector.client.config.override.policy configuration parameter All or Principal 53
  • 54.