Big Industries recently entered into a partnership with Confluent.
Big Industries' blog
The aim of this post is to help you getting started with creating a data pipeline using flume, kafka and spark streaming that will enable you to fetch twitter data and analyze it in hive.
Apache Kafka is a distributed publish-subscribe messaging system that is designed to be fast, scalable, and durable. This open source project – licensed under the Apache license – has gained popularity within the Hadoop ecosystem, across multiple industries. Its key strength is the ability to make high volume data available as a real-time stream for consumption in systems with very different requirements—from batch systems like Hadoop, to real-time systems that require low-latency access, to stream processing engines like Apache Spark Streaming that transform the data streams as they arrive. Kafka’s flexibility makes it ideal for a wide variety of use cases, from replacing traditional message brokers, to collecting user activity data, aggregating logs, operational application metrics and device instrumentation.