（四）Spark Streaming 運算元梳理 — Kafka createDirectStream

目錄
天小天：（一）Spark Streaming 運算元梳理 — 簡單介紹streaming運行邏輯天小天：（二）Spark Streaming 運算元梳理 — flatMap和mapPartitions

天小天：（三）Spark Streaming 運算元梳理 — transform運算元
天小天：（四）Spark Streaming 運算元梳理 — Kafka createDirectStream

前言

本文主要介紹KafkaUtils.createDirectStream的實現過程，包括實現的結構及如何消費kafka數據。

例子

object DirectKafkaWordCount { def main(args: Array[String]) { if (args.length < 2) { System.err.println(s""" |Usage: DirectKafkaWordCount <brokers> <topics> | <brokers> is a list of one or more Kafka brokers | <topics> is a list of one or more kafka topics to consume from | """.stripMargin) System.exit(1) }

StreamingExamples.setStreamingLogLevels()

val Array(brokers, topics) = args

// Create context with 2 second batch interval
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))

// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))

// Get the lines, split them into words, count the words and print
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()

// Start the computation
ssc.start()
ssc.awaitTermination()
}
}

這裡的例子是Spark源碼example中的例子，主要實現的是拉取Kafka數據，並計算work count的過程。

入參解釋

入參為

ssc: StreamingContext: Streaming上下文
locationStrategy:LocationStrategy：Kafka消費者的分布策略
consumerStrategy: ConsumerStrategy[K, V]：Kafka配置

一共三個參數，第一個和第三個參數比較好理解，只要使用過應該沒什麼問題。第二個參數可能不是很理解，這裡會詳細講解下LocationStrategy.

下面看下LocationStrategy類的源碼：

object LocationStrategies { /** * :: Experimental :: * Use this only if your executors are on the same nodes as your Kafka brokers. */ @Experimental def PreferBrokers: LocationStrategy = org.apache.spark.streaming.kafka010.PreferBrokers

/**
* :: Experimental ::
* Use this in most cases, it will consistently distribute partitions across all executors.
*/
@Experimental
def PreferConsistent: LocationStrategy =
org.apache.spark.streaming.kafka010.PreferConsistent

/**
* :: Experimental ::
* Use this to place particular TopicPartitions on particular hosts if your load is uneven.
* Any TopicPartition not specified in the map will use a consistent location.
*/
@Experimental
def PreferFixed(hostMap: collection.Map[TopicPartition, String]): LocationStrategy =
new PreferFixed(new ju.HashMap[TopicPartition, String](hostMap.asJava))

/**
* :: Experimental ::
* Use this to place particular TopicPartitions on particular hosts if your load is uneven.
* Any TopicPartition not specified in the map will use a consistent location.
*/
@Experimental
def PreferFixed(hostMap: ju.Map[TopicPartition, String]): LocationStrategy =
new PreferFixed(hostMap)
}

這裡一共提供了三種位置策略，策略名和使用時機分別為：