Kafka 是一个分布式流式平台,它有三个关键能力
Kafka 可以建立流数据管道,可靠性的在系统或应用之间获取数据。
建立流式应用传输和响应数据。
Kafka 作为消息系统,它有三个基本组件
在大型系统中,会需要和很多子系统做交互,也需要消息传递,在诸如此类系统中,你会找到源系统(消息发送方)和 目的系统(消息接收方)。为了在这样的消息系统中传输数据,你需要有合适的数据管道
这种数据的交互看起来就很混乱,如果我们使用消息传递系统,那么系统就会变得更加简单和整洁
Kafka 有四个核心API,它们分别是
Kafka 作为一个高度可扩展可容错的消息系统,它有很多基本概念,下面就来认识一下这些 Kafka 专属的概念
Topic 被称为主题,在 kafka 中,使用一个类别属性来划分消息的所属类,划分消息的这个类称为 topic。topic 相当于消息的分配标签,是一个逻辑概念。主题好比是数据库的表,或者文件系统中的文件夹。
partition 译为分区,topic 中的消息被分割为一个或多个的 partition,它是一个物理概念,对应到系统上的就是一个或若干个目录,一个分区就是一个 提交日志。消息以追加的形式写入分区,先后以顺序的方式读取。
注意:由于一个主题包含无数个分区,因此无法保证在整个 topic 中有序,但是单个 Partition 分区可以保证有序。消息被迫加写入每个分区的尾部。Kafka 通过分区来实现数据冗余和伸缩性
分区可以分布在不同的服务器上,也就是说,一个主题可以跨越多个服务器,以此来提供比单个服务器更强大的性能。
Segment 被译为段,将 Partition 进一步细分为若干个 segment,每个 segment 文件的大小相等。
Kafka 集群包含一个或多个服务器,每个 Kafka 中服务器被称为 broker。broker 接收来自生产者的消息,为消息设置偏移量,并提交消息到磁盘保存。broker 为消费者提供服务,对读取分区的请求作出响应,返回已经提交到磁盘上的消息。
broker 是集群的组成部分,每个集群中都会有一个 broker 同时充当了 集群控制器(Leader)的角色,它是由集群中的活跃成员选举出来的。每个集群中的成员都有可能充当 Leader,Leader 负责管理工作,包括将分区分配给 broker 和监控 broker。集群中,一个分区从属于一个 Leader,但是一个分区可以分配给多个 broker(非Leader),这时候会发生分区复制。这种复制的机制为分区提供了消息冗余,如果一个 broker 失效,那么其他活跃用户会重新选举一个 Leader 接管。
生产者,即消息的发布者,其会将某 topic 的消息发布到相应的 partition 中。生产者在默认情况下把消息均衡地分布到主题的所有分区上,而并不关心特定消息会被写到哪个分区。不过,在某些情况下,生产者会把消息直接写到指定的分区。
消费者,即消息的使用者,一个消费者可以消费多个 topic 的消息,对于某一个 topic 的消息,其只会消费同一个 partition 中的消息
mkdir /opt/soft
cd /opt/soft
wget https://downloads.apache.org/kafka/3.6.1/kafka_2.13-3.6.1.tgz
tar -zxvf kafka_2.13-3.6.1.tgz
mv kafka_2.13-3.6.1 kafka
vim /etc/profile.d/my_env.sh
export KAFKA_HOME=/opt/soft/kafka
export PATH=$PATH:$KAFKA_HOME/bin
配置文件存放在 kafka/config目录
vim /opt/soft/kafka/config/server.properties
主要修改以下三个参数:
broker.id=1 注意不同的节点id号不同
log.dirs=/tmp/kafka-logs 修改为 log.dirs=/opt/soft/kafka/kafka-logs
zookeeper.connect=localhost:2181 修改为
zookeeper.connect=spark01:2181,spark02:2181,spark03:2181/kafka
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# This configuration file is intended for use in ZK-based mode, where Apache ZooKeeper is required.
# See kafka.server.KafkaConfig for additional details and defaults
#
############################# Server Basics #############################
# The id of the broker. This must be set to a unique integer for each broker.
broker.id=1
############################# Socket Server Settings #############################
# The address the socket server listens on. If not configured, the host name will be equal to the value of
# java.net.InetAddress.getCanonicalHostName(), with PLAINTEXT listener name, and port 9092.
# FORMAT:
# listeners = listener_name://host_name:port
# EXAMPLE:
# listeners = PLAINTEXT://your.host.name:9092
#listeners=PLAINTEXT://:9092
# Listener name, hostname and port the broker will advertise to clients.
# If not set, it uses the value for "listeners".
#advertised.listeners=PLAINTEXT://your.host.name:9092
# Maps listener names to security protocols, the default is for them to be the same. See the config documentation for more details
#listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
# The number of threads that the server uses for receiving requests from the network and sending responses to the network
num.network.threads=3
# The number of threads that the server uses for processing requests, which may include disk I/O
num.io.threads=8
# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=102400
# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=102400
# The maximum size of a request that the socket server will accept (protection against OOM)
socket.request.max.bytes=104857600
############################# Log Basics #############################
# A comma separated list of directories under which to store log files
log.dirs=/opt/soft/kafka/kafka-logs
# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=1
# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
num.recovery.threads.per.data.dir=1
############################# Internal Topic Settings #############################
# The replication factor for the group metadata internal topics "__consumer_offsets" and "__transaction_state"
# For anything other than development testing, a value greater than 1 is recommended to ensure availability such as 3.
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
############################# Log Flush Policy #############################
# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs here:
# 1. Durability: Unflushed data may be lost if you are not using replication.
# 2. Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
# 3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to excessive seeks.
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.
# The number of messages to accept before forcing a flush of data to disk
#log.flush.interval.messages=10000
# The maximum amount of time a message can sit in a log before we force a flush
#log.flush.interval.ms=1000
############################# Log Retention Policy #############################
# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log.
# The minimum age of a log file to be eligible for deletion due to age
log.retention.hours=168
# A size-based retention policy for logs. Segments are pruned from the log unless the remaining
# segments drop below log.retention.bytes. Functions independently of log.retention.hours.
#log.retention.bytes=1073741824
# The maximum size of a log segment file. When this size is reached a new log segment will be created.
#log.segment.bytes=1073741824
# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies
log.retention.check.interval.ms=300000
############################# Zookeeper #############################
# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=spark01:2181,spark02:2181,spark03:2181/kafka
# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=18000
############################# Group Coordinator Settings #############################
# The following configuration specifies the time, in milliseconds, that the GroupCoordinator will delay the initial consumer rebalance.
# The rebalance will be further delayed by the value of group.initial.rebalance.delay.ms as new members join the group, up to a maximum of max.poll.interval.ms.
# The default value for this is 3 seconds.
# We override this to 0 here as it makes for a better out-of-the-box experience for development and testing.
# However, in production environments the default value of 3 seconds is more suitable as this will help to avoid unnecessary, and potentially expensive, rebalances during application startup.
group.initial.rebalance.delay.ms=0
scp -r /opt/soft/kafka root@spark02:/opt/soft
scp -r /opt/soft/kafka root@spark03:/opt/soft
scp /etc/profile.d/my_env.sh root@spark02:/etc/profile.d
scp /etc/profile.d/my_env.sh root@spark03:/etc/profile.d
在所有节点刷新环境变量
source /etc/profile
在每个节点分别启动
kafka-server-start.sh -daemon /opt/soft/kafka/config/server.properties
kafka-server-stop.sh
vim kafka-service.sh
#!/bin/bash
case $1 in
"start"){
for i in spark01 spark02 spark03
do
echo ------------- kafka $i 启动 ------------
ssh $i "/opt/soft/kafka/bin/kafka-server-start.sh -daemon /opt/soft/kafka/config/server.properties"
done
}
;;
"stop"){
for i in spark01 spark02 spark03
do
echo ------------- kafka $i 停止 ------------
ssh $i "/opt/soft/kafka/bin/kafka-server-stop.sh"
done
}
esac
mkdir /opt/soft
cd /opt/soft
wget https://downloads.apache.org/kafka/3.4.0/kafka_2.13-3.6.1.tgz
tar -zxvf kafka_2.13-3.6.1.tgz
mv kafka_2.13-3.6.1 kafka
vim /etc/profile.d/my_env.sh
export KAFKA_HOME=/opt/soft/kafka
export PATH=$PATH:$KAFKA_HOME/bin
配置文件存放在 kafka/config/kraft目录
vim /opt/soft/kafka/config/kraft/server.properties
主要修改以下三个参数:
- process.roles=broker,controller
- node.id=1 注意不同的节点id号不同
- controller.quorum.voters=controller.quorum.voters=1@localhost:9093 修改为 controller.quorum.voters=controller.quorum.voters=1@spark01:9093,2@spark02:9093,3@spark03:9093
- advertised.listeners=PLAINTEXT://localhost:9092 修改为 advertised.listeners=PLAINTEXT://spark01:9092
- log.dirs=/tmp/kraft-combined-logs 修改为 log.dirs=/opt/soft/kafka/kraft-combined-logs
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# This configuration file is intended for use in KRaft mode, where
# Apache ZooKeeper is not present. See config/kraft/README.md for details.
#
############################# Server Basics #############################
# The role of this server. Setting this puts us in KRaft mode
process.roles=broker,controller
# The node id associated with this instance's roles
node.id=1
# The connect string for the controller quorum
controller.quorum.voters=1@spark01:9093,2@spark02:9093,3@spark03:9093
############################# Socket Server Settings #############################
# The address the socket server listens on.
# Combined nodes (i.e. those with `process.roles=broker,controller`) must list the controller listener here at a minimum.
# If the broker listener is not defined, the default listener will use a host name that is equal to the value of java.net.InetAddress.getCanonicalHostName(),
# with PLAINTEXT listener name, and port 9092.
# FORMAT:
# listeners = listener_name://host_name:port
# EXAMPLE:
# listeners = PLAINTEXT://your.host.name:9092
listeners=PLAINTEXT://:9092,CONTROLLER://:9093
# Name of listener used for communication between brokers.
inter.broker.listener.name=PLAINTEXT
# Listener name, hostname and port the broker will advertise to clients.
# If not set, it uses the value for "listeners".
advertised.listeners=PLAINTEXT://spark01:9092
# A comma-separated list of the names of the listeners used by the controller.
# If no explicit mapping set in `listener.security.protocol.map`, default will be using PLAINTEXT protocol
# This is required if running in KRaft mode.
controller.listener.names=CONTROLLER
# Maps listener names to security protocols, the default is for them to be the same. See the config documentation for more details
listener.security.protocol.map=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
# The number of threads that the server uses for receiving requests from the network and sending responses to the network
num.network.threads=3
# The number of threads that the server uses for processing requests, which may include disk I/O
num.io.threads=8
# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=102400
# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=102400
# The maximum size of a request that the socket server will accept (protection against OOM)
socket.request.max.bytes=104857600
############################# Log Basics #############################
# A comma separated list of directories under which to store log files
log.dirs=/opt/soft/kafka/kraft-combined-logs
# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=1
# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
num.recovery.threads.per.data.dir=1
############################# Internal Topic Settings #############################
# The replication factor for the group metadata internal topics "__consumer_offsets" and "__transaction_state"
# For anything other than development testing, a value greater than 1 is recommended to ensure availability such as 3.
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
############################# Log Flush Policy #############################
# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs here:
# 1. Durability: Unflushed data may be lost if you are not using replication.
# 2. Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
# 3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to excessive seeks.
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.
# The number of messages to accept before forcing a flush of data to disk
#log.flush.interval.messages=10000
# The maximum amount of time a message can sit in a log before we force a flush
#log.flush.interval.ms=1000
############################# Log Retention Policy #############################
# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log.
# The minimum age of a log file to be eligible for deletion due to age
log.retention.hours=168
# A size-based retention policy for logs. Segments are pruned from the log unless the remaining
# segments drop below log.retention.bytes. Functions independently of log.retention.hours.
#log.retention.bytes=1073741824
# The maximum size of a log segment file. When this size is reached a new log segment will be created.
log.segment.bytes=1073741824
# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies
log.retention.check.interval.ms=300000
scp -r /opt/soft/kafka root@spark02:/opt/soft
scp -r /opt/soft/kafka root@spark03:/opt/soft
scp /etc/profile.d/my_env.sh root@spark02:/etc/profile.d
scp /etc/profile.d/my_env.sh root@spark03:/etc/profile.d
在所有节点刷新环境变量
source /etc/profile
kafka-storage.sh random-uuid
生成结果:
JfRaZDSORA2xK8pMSCa9AQ
注意:在每个节点都要执行一次
kafka-storage.sh format -t JfRaZDSORA2xK8pMSCa9AQ \
-c /opt/soft/kafka/config/kraft/server.properties
执行结果:
Formatting /opt/soft/kraft-combined-logs with metadata.version 3.4-IV0.
在每个节点分别启动
kafka-server-start.sh -daemon /opt/soft/kafka/config/kraft/server.properties
kafka-server-stop.sh
vim kafka-service.sh
#!/bin/bash
case $1 in
"start"){
for i in spark01 spark02 spark03
do
echo ------------- kafka $i 启动 ------------
ssh $i "/opt/soft/kafka/bin/kafka-server-start.sh -daemon /opt/soft/kafka/config/kraft/server.properties"
done
}
;;
"stop"){
for i in spark01 spark02 spark03
do
echo ------------- kafka $i 停止 ------------
ssh $i "/opt/soft/kafka/bin/kafka-server-stop.sh"
done
}
esac
kafka-topics.sh
参数 | 描述 |
---|---|
–bootstrap-server <String: server toconnect to> | 连接的 Kafka Broker 主机名称和端口号 |
–topic <String: topic> | 操作的 topic 名称 |
–create | 创建主题 |
–delete | 删除主题 |
–alter | 修改主题 |
–list | 查看所有主题 |
–describe | 查看主题详细描述 |
–partitions <Integer: # of partitions> | 设置分区数 |
–replication-factor<Integer: replication factor> | 设置分区副本 |
–config <String: name=value> | 更新系统默认的配置 |
kafka-topics.sh --bootstrap-server spark01:9092,spark02:9092,spark03:9092 --list
选项说明:
–topic 定义 topic 名
–partitions 定义分区数
–replication-factor 定义副本数
kafka-topics.sh --bootstrap-server spark01:9092,spark02:9092,spark03:9092 \
--topic lihaozhe --create --partitions 1 --replication-factor 3
kafka-topics.sh --bootstrap-server spark01:9092,spark02:9092,spark03:9092 \
--describe --topic lihaozhe
执行结果:
Topic: lihaozhe TopicId: kJWVrG0xQQSaFcrWGMYEGg PartitionCount: 1 ReplicationFactor: 3 Configs:
Topic: lihaozhe Partition: 0 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
注意:
? 分区数只能增加,不能减少
? 不能通过命令行的方式修改副本
kafka-topics.sh --bootstrap-server spark01:9092,spark02:9092,spark03:9092 \
--alter --topic lihaozhe --partitions 3
执行成功后再次查看主题详细信息结果如下:
Topic: lihaozhe TopicId: kJWVrG0xQQSaFcrWGMYEGg PartitionCount: 3 ReplicationFactor: 3 Configs:
Topic: lihaozhe Partition: 0 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
Topic: lihaozhe Partition: 1 Leader: 2 Replicas: 2,3,1 Isr: 2,3,1
Topic: lihaozhe Partition: 2 Leader: 3 Replicas: 3,1,2 Isr: 3,1,2
kafka-console-producer.sh
参数 | 描述 |
---|---|
–bootstrap-server <String: server toconnect to> | 连接的 Kafka Broker 主机名称和端口号 |
–topic <String: topic> | 操作的 topic 名称 |
–key.serializer | 指定发送消息的 key 的序列化类 一定要写全类名 |
–value.serializer | 指定发送消息的 value 的序列化类 一定要写全类名 |
–buffer.memory | RecordAccumulator 缓冲区总大小,默认 32Mb |
–batch.size | 缓冲区一批数据最大值,默认 16Kb。 适当增加该值,可以提高吞吐量, 但是如果该值设置太大,会导致数据传输延迟增加 |
–linger.ms | 如果数据迟迟未达到 batch.size,sender 等待 linger.time之后就会发送数据。 单位 ms,默认值是 0ms,表示没有延迟。 生产环境建议该值大小为 5-100ms 之间。 |
–acks | 0:生产者发送过来的数据,不需要等数据落盘应答 1:生产者发送过来的数据,Leader 收到数据后应答 -1(all):生产者发送过来的数据,Leader+和 isr 队列里面的所有节点收齐数据后应答 默认值是-1,-1 和all 是等价的 |
–max.in.flight.requests.per.connection | 允许最多没有返回 ack 的次数,默认为 5, 开启幂等性要保证该值是 1-5 的数字 |
–retries | 当消息发送出现错误的时候,系统会重发消息 retries表示重试次数。默认是 int 最大值,2147483647 如果设置了重试,还想保证消息的有序性,需要设置 MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION=1 否则在重试此失败消息的时候,其他的消息可能发送成功了 |
–retry.backoff.ms | 两次重试之间的时间间隔,默认是 100ms |
–enable.idempotence | 是否开启幂等性,默认 true,开启幂等性。 |
–compression.type | 生产者发送的所有数据的压缩方式。 默认是 none,也就是不压缩 支持压缩类型:none、gzip、snappy、lz4 和 zstd |
kafka-console-producer.sh --bootstrap-server spark01:9092,spark02:9092,spark03:9092 --topic lihaozhe
kafka-console-consumer.sh
参数 | 描述 |
---|---|
–bootstrap-server <String: server toconnect to> | 连接的 Kafka Broker 主机名称和端口号 |
–topic <String: topic> | 操作的 topic 名称 |
–from-beginning | 从头开始消费 |
–group <String: consumer group id> | 指定消费者组名称 |
kafka-console-consumer.sh --bootstrap-server spark01:9092,spark02:9092,spark03:9092 \
--topic lihaozhe
包括历史数据
kafka-console-consumer.sh --bootstrap-server spark01:9092,spark02:9092,spark03:9092 \
--topic lihaozhe --from-beginning
RecordAccumulator:每一个是生产上都会维护一个固定大小的内存空间,主要用于合并单条消息,进行批量发送,提高吞吐量,减少带宽消耗。
RecordAccumulator的大小是可配置的,可以配置buffer.memory来修改缓冲区大小,默认值为:33554432(32M)
RecordAccumulator内存结构分为两部分
第一部分为已经使用的内存,这一部分主要存放了很多的队列。
每一个主题的每一个分区都会创建一个队列,来存放当前分区下待发送的消息集合。
第二部分为未使用的内存,这一部分分为已经池化后的内存和未池化的整个剩余内存(nonPooledAvailableMemory)。
池化的内存的会根据batch.size(默认值为16K)的配置进行池化多个ByteBuffer,
放入一个队列中。所有的剩余空间会形成一个未池化的剩余空间。
vim file2kafka.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /root/data/app.*
a1.sources.r1.positionFile = /root/flume/taildir_positon.json
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = spark01:9092,spark02:9092,spark03:9092
a1.sinks.k1.kafka.topic = lihaozhe
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动 flume
flume-ng agent -n a1 -c conf -f file2kafka.conf
vim kafka2log.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 50
a1.sources.r1.batchDurationMillis = 200
a1.sources.r1.kafka.bootstrap.servers = spark01:9092,spark02:9092,spark03:9092
a1.sources.r1.kafka.topics = lihaozhe
a1.sources.r1.kafka.consumer.group.id = custom.g.id
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动 flume
flume-ng agent -n a1 -c conf -f kafka2log.conf -Dflume.root.logger=INFO,console
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.lihaozhe</groupId>
<artifactId>kafka-code</artifactId>
<version>1.0.0</version>
<packaging>jar</packaging>
<name>kafka</name>
<url>http://maven.apache.org</url>
<properties>
<jdk.version>8</jdk.version>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<maven.test.failure.ignore>true</maven.test.failure.ignore>
<maven.test.skip>true</maven.test.skip>
</properties>
<dependencies>
<!-- junit-jupiter-api -->
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-api</artifactId>
<version>5.10.1</version>
<scope>test</scope>
</dependency>
<!-- junit-jupiter-engine -->
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-engine</artifactId>
<version>5.10.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.18.20</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
<version>2.20.0</version>
</dependency>
<dependency>
<groupId>com.alibaba.fastjson2</groupId>
<artifactId>fastjson2</artifactId>
<version>2.0.31</version>
</dependency>
<dependency>
<groupId>com.github.binarywang</groupId>
<artifactId>java-testdata-generator</artifactId>
<version>1.1.2</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>3.6.1</version>
</dependency>
</dependencies>
<build>
<finalName>${project.name}</finalName>
<!--<outputDirectory>../package</outputDirectory>-->
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.11.0</version>
<configuration>
<!-- 设置编译字符编码 -->
<encoding>UTF-8</encoding>
<!-- 设置编译jdk版本 -->
<source>${jdk.version}</source>
<target>${jdk.version}</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-clean-plugin</artifactId>
<version>3.2.0</version>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-resources-plugin</artifactId>
<version>3.3.1</version>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-war-plugin</artifactId>
<version>3.3.2</version>
</plugin>
<!-- 编译级别 -->
<!-- 打包的时候跳过测试junit begin -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>3.2.2</version>
<configuration>
<skip>true</skip>
</configuration>
</plugin>
</plugins>
</build>
</project>
com.lihaozhe.producer.AsyncProducer
package com.lihaozhe.producer;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import java.util.Properties;
/**
* producer 异步发送数据到 topic 不带回调函数
* 提前在控制台 打开消费者监听 命令如下
* kafka-console-consumer.sh --bootstrap-server spark01:9092,spark02:9092,spark03:9092 --topic lihaozhe
*
* @author 李昊哲
* @version 1.0.0
*/
public class AsyncProducer {
public static void main(String[] args) {
// 1、基础配置
Properties properties = new Properties();
// 连接集群 bootstrap.servers
properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "spark01:9092,spark02:9092,spark03:9092");
// 指定对应的key和value的序列化类型 key.serializer
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
// 2、创建kafka生产者对象
KafkaProducer<String, String> producer = new KafkaProducer<>(properties);
// 3、发送数据
for (int i = 0; i < 5; i++) {
producer.send(new ProducerRecord<>("lihaozhe", "李昊哲" + i));
}
// 4、释放资源
producer.close();
System.out.println("success");
}
}
com.lihaozhe.producer.SyncProducer
package com.lihaozhe.producer;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import java.util.Properties;
import java.util.concurrent.ExecutionException;
/**
* producer 同步发送数据到 topic 不带回调函数
* 提前在控制台 打开消费者监听 命令如下
* kafka-console-consumer.sh --bootstrap-server spark01:9092,spark02:9092,spark03:9092 --topic lihaozhe
*
* @author 李昊哲
* @version 1.0.0
*/
public class syncProducer {
public static void main(String[] args) throws ExecutionException, InterruptedException {
// 1、基础配置
Properties properties = new Properties();
// 连接集群 bootstrap.servers
properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "spark01:9092,spark02:9092,spark03:9092");
// 指定对应的key和value的序列化类型 key.serializer
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
// 2、创建kafka生产者对象
KafkaProducer<String, String> producer = new KafkaProducer<>(properties);
// 3、发送数据
for (int i = 0; i < 5; i++) {
producer.send(new ProducerRecord<>("lihaozhe", "李昊哲" + i)).get();
}
// 4、释放资源
producer.close();
System.out.println("success");
}
}
com.lihaozhe.producer.AsyncProducerCallback
package com.lihaozhe.producer;
import org.apache.kafka.clients.producer.*;
import org.apache.kafka.common.serialization.StringSerializer;
import java.util.Properties;
/**
* producer 异步发送数据到 topic 回调函数
* 提前在控制台 打开消费者监听 命令如下
* kafka-console-consumer.sh --bootstrap-server spark01:9092,spark02:9092,spark03:9092 --topic lihaozhe
*
* @author 李昊哲
* @version 1.0.0
*/
public class AsyncProducerCallback {
public static void main(String[] args) throws InterruptedException {
// 1、基础配置
Properties properties = new Properties();
// 连接集群 bootstrap.servers
properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "spark01:9092,spark02:9092,spark03:9092");
// 指定对应的key和value的序列化类型 key.serializer
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
// 2、创建kafka生产者对象
KafkaProducer<String, String> producer = new KafkaProducer<>(properties);
// 3、发送数据
for (int i = 0; i < 500; i++) {
producer.send(new ProducerRecord<>("lihaozhe", "李昊哲" + i), (metadata, exception) -> {
if (exception == null){
System.out.println("topic: " + metadata.topic() + "\tpartition: " + metadata.partition());
}
});
Thread.sleep(2);
}
// 4、释放资源
producer.close();
System.out.println("success");
}
}
com.lihaozhe.producer.AsyncProducerCallbackPartitions01
package com.lihaozhe.producer;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import java.util.Properties;
/**
* producer 异步发送数据到 topic 带回调函数
* 指定分区号
* 提前在控制台 打开消费者监听 命令如下
* kafka-console-consumer.sh --bootstrap-server spark01:9092,spark02:9092,spark03:9092 --topic lihaozhe
*
* @author 李昊哲
* @version 1.0.0
*/
public class AsyncProducerCallbackPartitions01 {
public static void main(String[] args) throws InterruptedException {
// 1、基础配置
Properties properties = new Properties();
// 连接集群 bootstrap.servers
properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "spark01:9092,spark02:9092,spark03:9092");
// 指定对应的key和value的序列化类型 key.serializer
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
// 2、创建kafka生产者对象
KafkaProducer<String, String> producer = new KafkaProducer<>(properties);
// 3、发送数据
for (int i = 0; i < 500; i++) {
// topic partion key value
producer.send(new ProducerRecord<>("lihaozhe", 0, null, "李昊哲" + i), (metadata, exception) -> {
if (exception == null) {
System.out.println("topic: " + metadata.topic() + "\tpartition: " + metadata.partition());
}
});
Thread.sleep(2);
}
// 4、释放资源
producer.close();
System.out.println("success");
}
}
com.lihaozhe.producer.AsyncProducerCallbackPartitions02
package com.lihaozhe.producer;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import java.util.Properties;
/**
* producer 异步发送数据到 topic 带回调函数
* 根据指定的 key 的 hash 值 对分区数取模
* 提前在控制台 打开消费者监听 命令如下
* kafka-console-consumer.sh --bootstrap-server spark01:9092,spark02:9092,spark03:9092 --topic lihaozhe
*
* @author 李昊哲
* @version 1.0.0
*/
public class AsyncProducerCallbackPartitions02 {
public static void main(String[] args) throws InterruptedException {
// 1、基础配置
Properties properties = new Properties();
// 连接集群 bootstrap.servers
properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "spark01:9092,spark02:9092,spark03:9092");
// 指定对应的key和value的序列化类型 key.serializer
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
// 2、创建kafka生产者对象
KafkaProducer<String, String> producer = new KafkaProducer<>(properties);
// 3、发送数据
for (int i = 0; i < 500; i++) {
// 字符 a 的 hash 值为 97
producer.send(new ProducerRecord<>("lihaozhe", "a", "李昊哲" + i), (metadata, exception) -> {
if (exception == null) {
System.out.println("topic: " + metadata.topic() + "\tpartition: " + metadata.partition());
}
});
Thread.sleep(2);
}
// 4、释放资源
producer.close();
System.out.println("success");
}
}
自定义分区类
com.lihaozhe.producer.MyPartitioner
package com.lihaozhe.producer;
import org.apache.kafka.clients.producer.Partitioner;
import org.apache.kafka.common.Cluster;
import java.util.Map;
/**
* 自定义分区器
*
* @author 李昊哲
* @version 1.0.0
*/
public class MyPartitioner implements Partitioner {
/**
* @param topic The topic name
* @param key The key to partition on (or null if no key)
* @param keyBytes The serialized key to partition on( or null if no key)
* @param value The value to partition on or null
* @param valueBytes The serialized value to partition on or null
* @param cluster The current cluster metadata
* @return partition
*/
@Override
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
String msg = value.toString();
if (msg.contains("李哲")) {
return 0;
} else if (msg.contains("李昊哲")) {
return 1;
} else {
return 2;
}
}
@Override
public void close() {
}
@Override
public void configure(Map<String, ?> configs) {
}
}
com.lihaozhe.producer.AsyncProducerCallbackPartitions03
package com.lihaozhe.producer;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import java.util.Arrays;
import java.util.List;
import java.util.Properties;
/**
* producer 异步发送数据到 topic 带回调函数
* 关联自定义分区器
* 提前在控制台 打开消费者监听 命令如下
* kafka-console-consumer.sh --bootstrap-server spark01:9092,spark02:9092,spark03:9092 --topic lihaozhe
*
* @author 李昊哲
* @version 1.0.0
*/
public class AsyncProducerCallbackPartitions03 {
public static void main(String[] args) throws InterruptedException {
// 1、基础配置
Properties properties = new Properties();
// 连接集群 bootstrap.servers
properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "spark01:9092,spark02:9092,spark03:9092");
// 指定对应的key和value的序列化类型 key.serializer
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
// 关联自定义分区器 注意必须些完整类名字
properties.put(ProducerConfig.PARTITIONER_CLASS_CONFIG, MyPartitioner.class.getName());
// 2、创建kafka生产者对象
KafkaProducer<String, String> producer = new KafkaProducer<>(properties);
// 3、发送数据
List<String> names = Arrays.asList("李昊哲", "李哲", "李大宝");
for (int i = 0; i < 500; i++) {
// topic partion key value
producer.send(new ProducerRecord<>("lihaozhe", names.get(i % names.size())), (metadata, exception) -> {
if (exception == null) {
System.out.println("topic: " + metadata.topic() + "\tpartition: " + metadata.partition());
}
});
Thread.sleep(2);
}
// 4、释放资源
producer.close();
System.out.println("success");
}
}
com.lihaozhe.producer.AsyncProducerParameters
package com.lihaozhe.producer;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import java.util.Properties;
/**
* 调整生产者发送参数
* 提前在控制台 打开消费者监听 命令如下
* kafka-console-consumer.sh --bootstrap-server spark01:9092,spark02:9092,spark03:9092 --topic lihaozhe
*
* @author 李昊哲
* @version 1.0.0
*/
public class AsyncProducerParameters {
public static void main(String[] args) {
// 1、基础配置
Properties properties = new Properties();
// 连接集群 bootstrap.servers
properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "spark01:9092,spark02:9092,spark03:9092");
// 指定对应的key和value的序列化类型 key.serializer
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
// 缓冲区大小
properties.put(ProducerConfig.BUFFER_MEMORY_CONFIG,33554432);
// 批次大小
properties.put(ProducerConfig.BATCH_SIZE_CONFIG,16384);
// linger.ms
properties.put(ProducerConfig.LINGER_MS_CONFIG, 1);
// 压缩 none, gzip, snappy, lz4, zstd
properties.put(ProducerConfig.COMPRESSION_TYPE_CONFIG,"snappy");
// 2、创建kafka生产者对象
KafkaProducer<String, String> producer = new KafkaProducer<>(properties);
// 3、发送数据
for (int i = 0; i < 5; i++) {
producer.send(new ProducerRecord<>("lihaozhe", "李昊哲" + i));
}
// 4、释放资源
producer.close();
System.out.println("success");
}
}
com.lihaozhe.producer.AsyncProducerAck
package com.lihaozhe.producer;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import java.util.Properties;
/**
* producer 异步发送数据到 topic 带回调函数
* 修改 ack retries
* 提前在控制台 打开消费者监听 命令如下
* kafka-console-consumer.sh --bootstrap-server spark01:9092,spark02:9092,spark03:9092 --topic lihaozhe
*
* @author 李昊哲
* @version 1.0.0
*/
public class AsyncProducerAck {
public static void main(String[] args) throws InterruptedException {
// 1、基础配置
Properties properties = new Properties();
// 连接集群 bootstrap.servers
properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "spark01:9092,spark02:9092,spark03:9092");
// 指定对应的key和value的序列化类型 key.serializer
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
// acks
properties.put(ProducerConfig.ACKS_CONFIG, "1");
// retries 重试次数
properties.put(ProducerConfig.RETRIES_CONFIG, 3);
// 2、创建kafka生产者对象
KafkaProducer<String, String> producer = new KafkaProducer<>(properties);
// 3、发送数据
for (int i = 0; i < 500; i++) {
producer.send(new ProducerRecord<>("lihaozhe", "李昊哲" + i), (metadata, exception) -> {
if (exception == null) {
System.out.println("topic: " + metadata.topic() + "\tpartition: " + metadata.partition());
}
});
Thread.sleep(2);
}
// 4、释放资源
producer.close();
System.out.println("success");
}
}
com.lihaozhe.producer.AsyncProducerTransactions
package com.lihaozhe.producer;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import java.util.Properties;
/**
* producer 异步发送数据到 topic 不带回调函数
* 提前在控制台 打开消费者监听 命令如下
* kafka-console-consumer.sh --bootstrap-server spark01:9092,spark02:9092,spark03:9092 --topic lihaozhe
*
* @author 李昊哲
* @version 1.0.0
*/
public class AsyncProducerTransactions {
public static void main(String[] args) {
// 1、基础配置
Properties properties = new Properties();
// 连接集群 bootstrap.servers
properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "spark01:9092,spark02:9092,spark03:9092");
// 指定对应的key和value的序列化类型 key.serializer
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
// 指定事务id
properties.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "transactional_id_01");
// 2、创建kafka生产者对象
KafkaProducer<String, String> producer = new KafkaProducer<>(properties);
producer.initTransactions();
producer.beginTransaction();
try {
// 3、发送数据
for (int i = 0; i < 5; i++) {
producer.send(new ProducerRecord<>("lihaozhe", "李昊哲" + i));
}
// int i = 1 / 0;
producer.commitTransaction();
System.out.println("success");
} catch (Exception e) {
System.out.println("failed");
producer.abortTransaction();
} finally {
// 4、释放资源
producer.close();
}
}
}
```java
/**
* 提前在控制台 打开消费者监听 命令如下
* kafka-console-consumer.sh --bootstrap-server spark01:9092,spark02:9092,spark03:9092 --topic lihaozhe
*
* @author 李昊哲
* @version 1.0.0
*/
public class AsyncProducerTransactions {
public static void main(String[] args) {
// 1、基础配置
Properties properties = new Properties();
// 连接集群 bootstrap.servers
properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "spark01:9092,spark02:9092,spark03:9092");
// 指定对应的key和value的序列化类型 key.serializer
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
// 指定事务id
properties.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "transactional_id_01");
// 2、创建kafka生产者对象
KafkaProducer<String, String> producer = new KafkaProducer<>(properties);
producer.initTransactions();
producer.beginTransaction();
try {
// 3、发送数据
for (int i = 0; i < 5; i++) {
producer.send(new ProducerRecord<>("lihaozhe", "李昊哲" + i));
}
// int i = 1 / 0;
producer.commitTransaction();
System.out.println("success");
} catch (Exception e) {
System.out.println("failed");
producer.abortTransaction();
} finally {
// 4、释放资源
producer.close();
}
}
}