Hello there.

Let’s talk about big data stack and hadoop in particular. The sad thing about it is there are too many products linked together, you need several months only to learn basic stuff. Let’s list some of these products:

  • Hadoop/HDFS
  • MapReduce
  • Yarn
  • Tez
  • Hive
  • Oozie
  • Spark
  • Kafaka
  • HBase
  • Cassandra
  • Apache NiFi
  • and many more :)

Apache Ambari

It should be much easier if you will have some nice UI that allows you to quickly install and configure all this hadoop related products together, and it’s called Apache Ambari. The main thing you should remember it does not work out of the box. Well you could install and run it, but you will not be able to do anything because UI will be just empty. That is because Ambari by default is an empty tool you have to configure it with the appropriate package set and provide this package set to make everything work.

So how to deal with empty Ambari? This is where Cloudera (Hortonworks) comes into business. Cloudera provides Hadoop Data Processing and Hadoop Data Flow stacks for Apache Ambari. They assemble together all compatible software packages and provide them via repositories. Unfortunately access to latest HDP/HDF stacks repositories is not fee anymore, but you can still use Ambari 2.7.4 configuration free of charge.

Ambari and Docker issues

If you try to find Docker images that works you will definitely fail. I was not able to find anything on docker hub.

Ambari setup does not align with Docker approach because it was developed to run inside real OS. So in order to make it work you should actually build Docker image that could run systemd. Such approach would do the trick Docker will run systemd and systemd will run any service you need. It will be almost like running inside VM but with shared resources as in Docker.

By the way this all means that there is a reason to run all this setup only if your OS has proper Docker support. Which means you should have Windows with WSL2 or Linux. In a case if you have crappy macOs than you should probably use Virtualbox based Ambari HDP setup.

Preparing Docker configuration

Here is what we are about to do:

  • create 3 Dockerfiles for:
    • base image
    • ambari server
    • ambari client
  • write docker-compose.yml
  • write some knotty scripts to fix stuff

Keep in mind that you must put all these files in a single directory.

centos7.systemd.Dockerfile -> centos7-systemd:latest

Despite that Ambari support many OSes in my setup it was able to work without issues only on Centos 7 based images. Most of the tricks here are for systemd support only.

FROM centos:7

ENV container docker

RUN echo "Setting up root password to \"ambari\""
RUN echo "root:ambari" | chpasswd

## Install some basic utilities that aren't in the default image
RUN yum clean all -y && yum update -y

RUN yum -y install deltarpm 
RUN yum -y install systemd sudo vim wget curl which tar initscripts chrony python

RUN yum clean all && rm -rf /var/cache/yum

ENV HOME /root

RUN (cd /lib/systemd/system/sysinit.target.wants/; for i in *; do [ "$i" = systemd-tmpfiles-setup.service ] || rm -f $i; done); \
rm -f /lib/systemd/system/multi-user.target.wants/*; \
rm -f /etc/systemd/system/*.wants/*; \
rm -f /lib/systemd/system/local-fs.target.wants/*; \
rm -f /lib/systemd/system/sockets.target.wants/*udev*; \
rm -f /lib/systemd/system/sockets.target.wants/*initctl*; \
rm -f /lib/systemd/system/basic.target.wants/*; \
rm -f /lib/systemd/system/anaconda.target.wants/*;

RUN /usr/bin/systemctl enable chronyd.service

VOLUME [ "/sys/fs/cgroup" ]

CMD ["/usr/sbin/init"]

Now we have to build our base ambari image.

docker image build -f centos7.systemd.Dockerfile -t centos7-systemd:latest .

BTW you can skip this step, but it would require to copy all centos7.systemd.Dockerfile file content in next dockerfiles.

ambari.server.centos7.Dockerfile

This dockerfile is for building ambari server image.

# You can just pase all content of centos7.systemd.Dockerfile instead next line
FROM centos7-systemd:latest

#
# Ambari server specific config
#

RUN wget -nv http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.7.4.0/ambari.repo -O /etc/yum.repos.d/ambari.repo

RUN yum update -y && yum -y install ambari-server
RUN yum clean all && rm -rf /var/cache/yum

RUN mkdir -p /usr/share/java/
# Mysql connector that could be used by agent
# RUN wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.19-1.el7.noarch.rpm
# RUN /usr/bin/rpm -i ./mysql-connector-java-8.0.19-1.el7.noarch.rpm
# RUN rm ./mysql-connector-java-8.0.19-1.el7.noarch.rpm
# RUN ambari-server setup --jdbc-db=mysql --jdbc-driver=/usr/share/java/mysql-connector-java-8.0.19.jar

#Postgres connector
RUN wget https://jdbc.postgresql.org/download/postgresql-42.2.9.jar
RUN /usr/bin/mv ./postgresql-42.2.9.jar /usr/share/java/postgresql-42.2.9.jar && chmod 644 /usr/share/java/postgresql-42.2.9.jar
RUN ambari-server setup --jdbc-db=postgres --jdbc-driver=/usr/share/java/postgresql-42.2.9.jar

WORKDIR /var/lib/ambari-server/resources/
RUN wget http://public-repo-1.hortonworks.com/ARTIFACTS/jdk-8u112-linux-x64.tar.gz
RUN wget http://public-repo-1.hortonworks.com/ARTIFACTS/jce_policy-8.zip

WORKDIR /root
# NOTICE !
COPY ./init-server.sh /root/init.sh
RUN chmod 0751 /root/init.sh
RUN chown root:root /root/init.sh

RUN echo -e "\\n/root/init.sh" >> /etc/rc.d/rc.local
RUN chmod +x /etc/rc.d/rc.local
RUN /usr/bin/systemctl enable rc-local.service

EXPOSE 5432 8440 8441 8440

Also, it requires additional script to be executed at startup.

Create init-server.sh script into your docker build directory with next content:

#!/usr/bin/env sh

FLAG_FILE=/ambari-ok

# It would run only once at first start up. It will run:
# - Ambari server initial setup
# - configure PostreSQL with database "hive" and user/password hive/hive
if [ ! -f "$FLAG_FILE" ]; then
    echo "Ambari initial setup"
    echo -e "n\n1\ny\nn\n" | ambari-server setup
    sed -i "s/ambari,mapred/all/g" /var/lib/pgsql/data/pg_hba.conf
    service postgresql restart
    ambari-server start
    su - postgres -c "psql -c \"create database hive\""
    su - postgres -c "psql -c \"create user hive with password 'hive';"\"
    su - postgres -c "psql -c \"grant all privileges on database hive to hive;"\"
    touch $FLAG_FILE
else
    echo "Ambari already intialized"
fi

ambari.agent.centos7.Dockerfile

Agent image is simpler than server. You just need to have ambari agent properly installed.

# You can just pase all content of centos7.systemd.Dockerfile instead next line
FROM centos7-systemd:latest

#
# Ambari agent specific config
#

RUN wget -nv http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.7.4.0/ambari.repo -O /etc/yum.repos.d/ambari.repo
RUN yum update -y && yum -y install ambari-agent
RUN yum clean all && rm -rf /var/cache/yum

# bugfix when agent messing up with hostname value  
RUN E=! echo -e "#$E/usr/bin/env sh\necho \`hostname -f\`" > /var/lib/ambari-agent/public_hostname.sh; chmod +x /var/lib/ambari-agent/public_hostname.sh
RUN sed -i "s/\[agent\]/[agent]\npublic_hostname_script=\/var\/lib\/ambari-agent\/public_hostname.sh/g" /etc/ambari-agent/conf/ambari-agent.ini

# NOTICE !
COPY ./init-agent.sh /root/agent.sh
RUN chmod 0751 /root/agent.sh
RUN chown root:root /root/agent.sh 

And again we need tricky script to do some configuration magic.

Create init-agent.sh script in dockerbuld directory with next content:

#!/usr/bin/env sh
FLAG_FILE=/ambari-agent-ok
AMBARI_SERVER="$1"
if [ -z "$AMBARI_SERVER" ]; then
    echo "Hostname not provided"
    exit 1;
fi

if [ ! -f "$FLAG_FILE" ]; then
    echo "Ambari agent setup with host $AMBARI_SERVER"

    sed -i "s/hostname=localhost/hostname=${AMBARI_SERVER}/g" /etc/ambari-agent/conf/ambari-agent.ini
    /usr/sbin/ambari-agent restart

    touch $FLAG_FILE
else
    echo "Ambari agent already configured"
fi

If you aren’t stupid developer you should notice that this script will be available as /root/agent.sh and it is not executed by default.

docker-compose.yml

Now it is time to create our docker-compose.yml it will use your dockerfiles and sh scripts to build and launch ambari server docker container and 3 ambari agent containers. Also, all containers will be in the same network “ambari”. Please notice what ports are exposed in local environment.

version: "3"

networks:
  ambari:

services:
  ambari-server:
    container_name: "ambari-server"
    hostname: ambari-server
    image: "ambari-srerver-274"
    privileged: true
    volumes:
      - "/sys/fs/cgroup:/sys/fs/cgroup:ro"
    build:
      context: .
      dockerfile: "ambari.server.centos7.Dockerfile"
    ports:
      - "80:80"
      - "5005:5005"
      - "8080:8080"
    networks:
      - ambari

  ambari-agent1:
    container_name: "ambari-agent1"
    hostname: ambari-agent1
    image: "ambari-agent-274"
    privileged: true
    volumes:
      - "/sys/fs/cgroup:/sys/fs/cgroup:ro"
    build:
      context: .
      dockerfile: "ambari.agent.centos7.Dockerfile"
    depends_on:
      - "ambari-server"
    ports:
      - "3001:3000"
    networks:
      - ambari


  ambari-agent2:
    container_name: "ambari-agent2"
    hostname: ambari-agent2
    image: "ambari-agent-274"
    privileged: true
    volumes:
      - "/sys/fs/cgroup:/sys/fs/cgroup:ro"
    build:
      context: .
      dockerfile: "ambari.agent.centos7.Dockerfile"
    depends_on:
      - "ambari-server"
      - "ambari-agent1"
    ports:
      - "3002:3000"
    networks:
      - ambari


  ambari-agent3:
    container_name: "ambari-agent3"
    hostname: ambari-agent3
    image: "ambari-agent-274"
    privileged: true
    volumes:
      - "/sys/fs/cgroup:/sys/fs/cgroup:ro"
    build:
      context: .
      dockerfile: "ambari.agent.centos7.Dockerfile"
    depends_on:
      - "ambari-server"
      - "ambari-agent1"
      - "ambari-agent2"
    ports:
      - "3003:3000"
    networks:
      - ambari

Time to launch our cluster

Create docker containers by executing

docker-compose up

It should take a time because Docker has to build images first. After build will be complete you should see all your docker containers up and running.

docker ps -a

CONTAINER ID   IMAGE                COMMAND            CREATED         STATUS         PORTS                                                                                         NAMES
82f5cc67dec9   ambari-agent-274     "/usr/sbin/init"   3 minutes ago   Up 3 minutes   0.0.0.0:3003->3000/tcp                                                                        ambari-agent3
c7a32bbc9844   ambari-agent-274     "/usr/sbin/init"   3 minutes ago   Up 3 minutes   0.0.0.0:3002->3000/tcp                                                                        ambari-agent2
0faaab93d6a1   ambari-agent-274     "/usr/sbin/init"   3 minutes ago   Up 3 minutes   0.0.0.0:3001->3000/tcp                                                                        ambari-agent1
1b40b5f19cd8   ambari-srerver-274   "/usr/sbin/init"   3 minutes ago   Up 3 minutes   0.0.0.0:80->80/tcp, 0.0.0.0:5005->5005/tcp, 5432/tcp, 0.0.0.0:8080->8080/tcp, 8440-8441/tcp   ambari-server

Also, you should see ambari login page at http://localhost:8080 .

Do not forget that now you could use docker-compose start and docker-compose stop to start and stop your ambari containers.

Still should do some magic ! Considering that we have 3 ambari agents with names like ambari-agent# and server with name ambari-server. Now we have to execute the next script, let’s call it agents-configure.sh.

#!/usr/bin/env sh
AMBARI_SERVER="ambari-server"
AGENTS_COUNT=3
for i in $(seq $AGENTS_COUNT); do 
    echo -e "\nConfigure agent ${i}\n"
    docker exec -t "ambari-agent${i}" /root/agent.sh "$AMBARI_SERVER"
done 

It will configure our agents to accept requests from our ambari-server instance.

Ambari (HDP) initial cluster configuration

After everything was configured you could go to http://localhost:8080 and login as admin/admin :) Start installation wizard and define whatever name for your cluster.

Choose latest HDP Stack version and verify that in repositories section hortonworks.com domain is used.

Now you have to tell your server where your agents located. You must provide Docker hostnames: ambari-agent1, ambari-agent2, ambari-agent3. After that press register button and wait for agents setup to be completed.

Now you have to choose what services will be provided in your cluster. I suggest you initially install only HDFS and Zookeper. It takes hell lot of time to install even single service moreover there is no need for you in all of them at the same time.

Leave ambari-agent3 as a data processing node and do not install on it server/slave component.

After about 30 min or maybe just 1 hour :) of installation you will be able to observe Ambari dashboard. The first thing you should do is to stop and delete SmartSense service. It would not work properly in your local environment anyway.

Now you can install any other service you need and play with your local Ambari cluster.

If you want to install Hive you should notice that you are already have PostgreSQL DB on ambari-server container with database named “hive” and same user/password. So when you will configure Hive server during installation use these credentials.

Install HDF stack on existing HDP cluster

If you want to try data processing platform Apache NiFi in your setup you should install additional ambari management pack. Hortonworks DataFlow (HDF) installation looks easy, but it is not. Say hello to this issue and dump developer who are not bother themselves by pointing out in manual versions compatibility issues. It seems that for your Ambari v2.7.4 you could use at most HDF mpack v3.2.0. You could find useful installation instructions here:

And here is my cli shortcut to install HDF on your dockerized Ambari server.

# kinda log in into ambari server
docker exec -ti ambari-server  /bin/bash

# Inside docker container
cd /tmp

wget http://public-repo-1.hortonworks.com/HDF/amazonlinux2/3.x/updates/3.2.0.0/tars/hdf_ambari_mp/hdf-ambari-mpack-3.2.0.0-520.tar.gz

ambari-server install-mpack --mpack=/tmp/hdf-ambari-mpack-3.2.0.0-520.tar.gz --verbose

ambari-server restart

That is all folks. After Ambari restarted you should be able to see new components available for installation such as Apache NiFi.

Takeaway

Installing Ambari on Docker is not an easy thing, because it was not meant to run in such environment. But it is a cool way to emulate all hadoop cluster on a single machine.

You will suffer if your RAM size is only 16Gb. All this hadoop stack is extremely memory hungry. It is almost impossible to configure yarn to be able to execute at least 3 applications at the same time on 16Gb. Default Ambari memory configurations does not work well in such Docker deployment, because any container sees all your PC RAM as available.

About suffering with Yarn. If it configured wrongly, which is true by default, you will not receive any errors. Yarn would wait for free memory indefinitely and job would never be launched.

Components compatibility issues would make you cry too. Current HDP/HDF stack assembled in tricky way. For example your Spark application will require you to provide an exact version of hdfs dependencies because default hdfs library shipped with spark would not work in this setup. Launching Spark from Oozie here is a nightmare because oozie has its own vision for spark job and its dependencies. So you should pay attention to exact versions of all HDP components if you are about to write your application.

Ambari as hadoop administration tool looks cool, but unfortunately Cloudera closed free access to HDP and HDF stacks. This is why you should probably look into another deployment tools like Helm.