Running Apache Mesos on Cloudera Hadoop

Big Industries' blog

Learn about the benefits of running Apache Mesos on a Cloudera Hadoop cluster, how to set it up and get your first service deployed.

Big Industries, Cloudera systems integration and reseller partner for Belgium and Luxembourg, has been working to put together an integration of Apache Mesos and the Cloudera Distribution for Hadoop, that can be deployed and managed through Cloudera Manager. In this post, Big Industries’ Rob Gibbon will explain what Mesos offers, the benefits of deploying Mesos to your cluster and walk through the process of setting it up.

Mesos
Apache Mesos is a distributed, generic grid workload manager. Similar to Hadoop YARN, Mesos is designed to run the generic tasks and services that your Hadoop cluster might not otherwise be able to manage – its ideal for running stuff like Memcached, MySQL Server, Apache httpd, Nginx, HAProxy, Snort, ActiveMQ, or whatever it is that you need to run, for as long as you need to run it. Mesos is designed to scale, like YARN is, and Mesos services can be deployed on clusters of up to 10,000s of nodes.

mesos_logo

Why would you want to run things like web servers, proxies and caches on a Hadoop cluster though? Well when assembling a technical solution, especially an off-the-shelf solution, it is common that the buyer expects the vendor to provide a complete, ready to go platform, with a single bill-of-materials. Solutions often make use of operational, front-end serving components (reverse proxies, load balancers, web servers, application servers) and middle-tier components (object caches, JMS, workflow engine etc.) in addition to backend components and while Hadoop is great at solving backend data processing challenges, until Mesos it has been pretty difficult to deploy and operate front-end and middle tier components in a consistent manner as part of a complete, Hadoop-powered solution.

mesos-overview

Apache Mesos components

For example, building a Security Information and Event Management (SIEM) solution on top of Cloudera Hadoop would mean including a live traffic inspection layer as well as an active archive of security related event logs, and reporting tools; while Hadoop perfectly fits the needs for the active archiving element, without Mesos integration to run a live traffic inspection system and a reporting server, it would be quite difficult to deliver on the complete system requirements in a consistent way from a single platform.

mesos-example

A conceptual SIEM solution running on Mesos

Mesos comes with a framework called Marathon to launch tasks on the cluster, and a scheduler framework called Chronos, which offers a highly available, fault tolerant alternative to Unix cron.

 

Docker
In order to launch an application, Mesos Marathon uses Docker, an application virtualization system that enables portable, standardized and containerized deployment of applications and components across the cluster. The engineer writes a dockerfile, which is a text file containing a set of automation instructions for deploying and configuring the application.

There are other ways to launch applications on Mesos, but Docker offers a robust solution with extensive features.

 

Putting it together: Mesos on Cloudera
In order to get Apache Mesos running on a Cloudera environment, we put together Custom Service Descriptors and custom deployment parcels for CentOS 6.5, RHEL 6.5 and Ubuntu 14.04 LTS that can be installed into Cloudera Manager. With this approach, deploying Mesos and Docker is a similar experience to deploying other Hadoop components like YARN, Impala or Hive.

Note that we made two parcels (one for Mesos and one for Docker) because Docker needs to be run as root, whilst Mesos can run under a dedicated user account.

First, add the Big Industries parcel repo to Cloudera Manager. You can find the parcel repo at http://www.bigindustries.be/parcels.

CM parcel repo source

Add the Mesos parcel repo to Cloudera Manager via CM’s “Settings” screen

In order to get Cloudera Manager to work with the Mesos and Docker parcels, you need to copy the CSD files to the Cloudera Manager CSD repository:

scp MESOS-1.0.jar myhost.com:/opt/cloudera/csd/.
scp DOCKER-1.0.jar myhost.com:/opt/cloudera/csd/.

Once done you’ll need to restart CM to pick up the changes:

service cloudera-scm-server restart

The next step is to download, distribute and activate the Mesos and Docker parcels via Cloudera Manager.

Download, distribute and activate the Mesos and Docker parcels in Cloudera Manager

Download, distribute and activate the Mesos and Docker parcels in Cloudera Manager

We can now set up and configure a new Mesos service on our cluster from Cloudera Manager, in the same way we would set up any other hadoop service.

Add Mesos and Docker services using the "Add Service" wizard in Cloudera Manager

Add Mesos and Docker services using the “Add Service” wizard in Cloudera Manager

You can choose which nodes of the cluster to use as Mesos slaves, Mesos masters, and where to deploy the Marathon service. You should deploy and run Docker on each node that will run as a Mesos slave.
In order to ensure solid resource isolation, you can use Cloudera Manager’s Linux Control Groups integration to allocate appropriate system resource shares to the Mesos framework; this way Mesos and other Hadoop components like YARN and Impala can coexist.

Set up the hosts to roles mappings in CM

Set up the hosts to roles mappings in CM

 

Running Docker images in Marathon
Marathon has a REST API. Docker images can be started with a POST to:

http://[host]:[port]/v2/apps

containing a configuration file in JSON format.

 

Editing the JSON files
Various settings can be configured in the JSON file. They are used to configure Marathon and the service the Docker image contains. For example:

 {
   "id": "example",
   "instances": 1,
   "cpus": 0.5,
   "mem": 1024.0,
   "disk": 128,
   "constraints": [["hostname", "CLUSTER", "[desired-hostname]"]],
   "container": {
     "docker": {
       "type": "DOCKER",
       "image": "repo/example:1.0",
       "network": "BRIDGE",
       "parameters": [],
       "portMappings": [
         {
           "containerPort": 5000,
           "hostPort": 0,
           "protocol": "tcp",
           "servicePort": 5000
         }
       ]
     },
     "volumes": [
       {
         "hostPath": "/docker/packages",
         "containerPath": "/storage",
         "mode": "RW"
       }
     ]
   },
   "env": {
     "SETTINGS_FLAVOR": "local",
     "STORAGE_PATH": "/storage"
   },
   "ports": [ 0 ]
 }

Further documentation can be found here. To find the exposed ports, volumes and enviroment variables, check the Dockerfile for the following commands:

EXPOSE
ENV
VOLUME

 

Setting up a Docker Registry containing the images
Docker images must be made available to the Docker daemons on the cluster.

The best way to do this is to provide a Docker Registry, which is comparable to a Git-repository for Docker images. A json file to setup a registry using Marathon is included in the project. To move docker images from one host to another use the following commands.

sudo docker save [imagename] > [imagename].tar
sudo docker load [imagename] > [imagename].tar

To put these images on the docker registry first tag them, then push them. IP address and port of the registry can be found in the Marathon UI.

sudo docker tag imagename [registry_ip]:[port]/[image_name]:[version]
sudo docker push [registry_ip]:[port]/[image_name]:[version] 

Note that when using an insecure private registry, like the one from the json file, it is important to add the –insecure-registry argument to the start command.

 

Example: launching Memcached on Mesos, on Cloudera Hadoop
To launch memcached on a cluster, we need a docker image for Memcached – we’ll use sameersbn/memcached:latest from the public registry.

You need to create a marathon configuration file like the one below:

{
   "id": "memcached",
   "cmd": "",
   "cpus": 0.5,
   "mem": 512,
   "instances": 5,
   "container": {
     "type": "DOCKER",
     "docker": {
       "image": "sameersbn/memcached:latest",
       "network": "BRIDGE",
       "portMappings": [
         {
           "containerPort": 11211,
           "hostPort": 0,
           "servicePort": 0,
           "protocol": "tcp"
         }
       ]
     }
   }
 }

Then you need to launch it on the cluster via the Marathon REST API:

curl -H "Content-Type: application/json" -X POST --data @memcached.json http://[host]:[port]/v2/apps

 

Troubleshooting
If the Marathon app stays in the deploying state:

* Make sure there are enough resources on the slaves.

* Make sure the ip address and the port number of the registry are set correctly and the registry is added as insecure registry on the Docker daemon.

* Make sure the image name has a version if required.

 

Conclusions
In this article we have explained some of the features and benefits of Apache Mesos, seen how to deploy Mesos and Docker under Cloudera Hadoop using Cloudera Manager and custom parcels, and had a look at launching an application component (Memcached) across the cluster using Mesos Marathon.

The source code for the Cloudera Manager Mesos and Docker extensions is available on github and its Apache v2 licensed. The blog is also published on the Cloudera website.

Rob Gibbon is architect, manager and partner at Big Industries, the industry leading Hadoop SI partner for Belgium and Luxembourg.www.bigindustries.be

Posted by Robert Gibbon on Jun 29, 2016 6:57:50 PM

Robert Gibbon

About the Author

Rob is a Hadoop and large-scale distributed computing evangelist. Solution Architect by trade, Rob is a managing partner at Big Industries - the premiere Hadoop & Big Data systems integrator for Belgium and Luxembourg.