Flink入门：读取Kafka实时数据流，实现WordCount

本文主要介绍Flink接收一个Kafka文本数据流，进行WordCount词频统计，然后输出到标准输出上。通过本文你可以了解如何编写和运行Flink程序。

代码拆解

首先要设置Flink的执行环境：

// 创建Flink执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

设置Kafka相关参数，连接对应的服务器和端口号，读取名为Shakespeare的Topic中的数据源，将数据源命名为stream：

// Kafka参数
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "flink-group");
String inputTopic = "Shakespeare";
String outputTopic = "WordCount";

// Source
FlinkKafkaConsumer<String> consumer =
                new FlinkKafkaConsumer<String>(inputTopic, new SimpleStringSchema(), properties);
DataStream<String> stream = env.addSource(consumer);

使用Flink算子处理这个数据流：

// Transformations
// 使用Flink算子对输入流的文本进行操作
// 按空格切词、计数、分区、设置时间窗口、聚合
DataStream<Tuple2<String, Integer>> wordCount = stream
    .flatMap((String line, Collector<Tuple2<String, Integer>> collector) -> {
      String[] tokens = line.split("\s");
      // 输出结果 (word, 1)
      for (String token : tokens) {
        if (token.length() > 0) {
          collector.collect(new Tuple2<>(token, 1));
        }
      }
    })
    .returns(Types.TUPLE(Types.STRING, Types.INT))
    .keyBy(0)
    .timeWindow(Time.seconds(5))
    .sum(1);

这里使用的是Flink提供的DataStream级别的API，主要包括转换、分组、窗口和聚合等操作。

将数据流打印：

// Sink
wordCount.print();

最后执行这个程序：

// execute
env.execute("kafka streaming word count");

env.execute 是启动Flink作业所必需的，只有在execute()被调用时，之前调用的各个操作才会在提交到集群上或本地计算机上执行。

完整代码如下：

import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.util.Collector;

import java.util.Properties;

public class WordCountKafkaInStdOut {

    public static void main(String[] args) throws Exception {

        // 创建Flink执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // Kafka参数
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "localhost:9092");
        properties.setProperty("group.id", "flink-group");
        String inputTopic = "Shakespeare";
        String outputTopic = "WordCount";

        // Source
        FlinkKafkaConsumer<String> consumer =
                new FlinkKafkaConsumer<String>(inputTopic, new SimpleStringSchema(), properties);
        DataStream<String> stream = env.addSource(consumer);

        // Transformations
        // 使用Flink算子对输入流的文本进行操作
        // 按空格切词、计数、分区、设置时间窗口、聚合
        DataStream<Tuple2<String, Integer>> wordCount = stream
            .flatMap((String line, Collector<Tuple2<String, Integer>> collector) -> {
                String[] tokens = line.split("\s");
                // 输出结果 (word, 1)
                for (String token : tokens) {
                    if (token.length() > 0) {
                        collector.collect(new Tuple2<>(token, 1));
                    }
                }
            })
            .returns(Types.TUPLE(Types.STRING, Types.INT))
            .keyBy(0)
            .timeWindow(Time.seconds(5))
            .sum(1);

        // Sink
        wordCount.print();

        // execute
        env.execute("kafka streaming word count");

    }
}

执行程序

我们在 Kafka入门简介这篇文章中曾提到如何启动一个Kafka集群，并向某个Topic内发送数据流。在本次Flink作业启动之前，我们还要按照那篇文章中提到的方式启动一个Kafka集群，创建对应的Topic，并向Topic中写入数据。

Intellij Idea调试执行

在IntelliJ Idea中，点击绿色按钮，执行这个程序。下图中任意两个绿色按钮都可以启动程序。

IntelliJ Idea下方会显示程序中输出到标准输出上的内容，包括本次需要打印的结果。

恭喜你，你的第一个Flink程序运行成功！

在集群上提交作业

第一步中我们已经下载并搭建了本地集群，接着我们在模板的基础上添加了代码，并可以在IntelliJ Idea中调试运行。在生产环境，一般需要将代码编译打包，提交到集群上。

注意，这里涉及两个目录，一个是我们存放我们刚刚编写代码的工程目录，简称工程目录，另一个是从Flink官网下载解压的Flink主目录，主目录下的bin目录中有Flink提供好的命令行工具。

进入工程目录，使用Maven命令行编译打包：

# 使用Maven将自己的代码编译打包
# 打好的包一般放在工程目录的target子文件夹下
$ mvn clean package

回到刚刚下载解压的Flink主目录，使用Flink提供的命令行工具flink，将我们刚刚打包好的作业提交到集群上。命令行的参数--class用来指定哪个主类作为入口。我们之后会介绍命令行的具体使用方法。

$ bin/flink run --class com.flink.tutorials.java.api.projects.wordcount.WordCountKafkaInStdOut /Users/luweizheng/Projects/big-data/flink-tutorials/target/flink-tutorials-0.1.jar

这时，仪表盘上就多了一个Flink程序。

程序的输出会打到Flink主目录下面的log目录下的.out文件中，使用下面的命令查看结果：

$ tail -f log/flink-*-taskexecutor-*.out

停止本地集群：

$ ./bin/stop-cluster.sh

Flink开发和调试过程中，一般有几种方式执行程序：

使用IntelliJ Idea内置的运行按钮。这种方式主要在本地调试时使用。
使用Flink提供的标准命令行工具向集群提交作业，包括Java和Scala程序。这种方式更适合生产环境。
使用Flink提供的其他命令行工具，比如针对Scala、Python和SQL的交互式环境。这种方式也是在调试时使用。

版权声明：本文来源CSDN，感谢博主原创文章，遵循 CC 4.0 by-sa 版权协议，转载请附上原文出处链接和本声明。
原文链接：https://blog.csdn.net/qq_42596142/article/details/104325765
站方申明：本站部分内容来自社区用户分享，若涉及侵权，请联系站方删除。

发表于 2020-03-06 22:49:22

阅读 ( 2328 )

分类：