Hadoop Matters Blog

Big Industries' blog

Creating a Data Pipeline using Flume, Kafka, Spark and Hive

tweets-by-hashtag-hive-hiveql

The aim of this post is to help you getting started with creating a data pipeline using flume, kafka and spark streaming that will enable you to fetch twitter data and analyze it in hive.

Read More →

Building Real Time Data Pipelines with Apache Kafka

Introduction

Apache Kafka is a distributed publish-subscribe messaging system that is designed to be fast, scalable, and durable. This open source project – licensed under the Apache license – has gained popularity within the Hadoop ecosystem, across multiple industries. Its key strength is the ability to make high volume data available as a real-time stream for consumption in systems with very different requirements—from batch systems like Hadoop, to real-time systems that require low-latency access, to stream processing engines like Apache Spark Streaming that transform the data streams as they arrive. Kafka’s flexibility makes it ideal for a wide variety of use cases, from replacing traditional message brokers, to collecting user activity data, aggregating logs, operational application metrics and device instrumentation.

Read More →