Hello there.
Let’s talk about big data stack and hadoop in particular.
The sad thing about it is there are too many products linked together,
you need several months only to learn basic stuff.
Let’s list some of these products:
- Hadoop/HDFS
- MapReduce
- Yarn
- Tez
- Hive
- Oozie
- Spark
- Kafaka
- HBase
- Cassandra
- Apache NiFi
- and many more :)
Apache Ambari
It should be much easier if you will have some nice UI that allows you to quickly install and configure all this hadoop related products together, and it’s called Apache Ambari.
The main thing you should remember it does not work out of the box.
Well you could install and run it, but you will not be able to do anything because UI will be just empty.
That is because Ambari by default is an empty tool you have to configure it with the appropriate package set
and provide this package set to make everything work.
So how to deal with empty Ambari?
This is where Cloudera (Hortonworks) comes into business.
Cloudera provides Hadoop Data Processing and Hadoop Data Flow stacks for Apache Ambari.
They assemble together all compatible software packages and provide them via repositories.
Unfortunately access to latest HDP/HDF stacks repositories is not fee anymore, but you can still use Ambari 2.7.4 configuration free of charge.
Ambari and Docker issues
If you try to find Docker images that works you will definitely fail.
I was not able to find anything on docker hub.
Ambari setup does not align with Docker approach because it was developed to run inside real OS.
So in order to make it work you should actually build Docker image that could run systemd.
Such approach would do the trick Docker will run systemd and systemd will run any service you need.
It will be almost like running inside VM but with shared resources as in Docker.
By the way this all means that there is a reason to run all this setup only if your OS has proper Docker support.
Which means you should have Windows with WSL2 or Linux.
In a case if you have crappy macOs than you should probably use Virtualbox based Ambari HDP setup.
Preparing Docker configuration
Here is what we are about to do:
- create 3 Dockerfiles for:
- base image
- ambari server
- ambari client
- write docker-compose.yml
- write some knotty scripts to fix stuff
Keep in mind that you must put all these files in a single directory.
centos7.systemd.Dockerfile -> centos7-systemd:latest
Despite that Ambari support many OSes in my setup it was able to work without issues only on Centos 7 based images.
Most of the tricks here are for systemd support only.
FROM centos:7
ENV container docker
RUN echo "Setting up root password to \"ambari\""
RUN echo "root:ambari" | chpasswd
## Install some basic utilities that aren't in the default image
RUN yum clean all -y && yum update -y
RUN yum -y install deltarpm
RUN yum -y install systemd sudo vim wget curl which tar initscripts chrony python
RUN yum clean all && rm -rf /var/cache/yum
ENV HOME /root
RUN (cd /lib/systemd/system/sysinit.target.wants/; for i in *; do [ "$i" = systemd-tmpfiles-setup.service ] || rm -f $i; done); \
rm -f /lib/systemd/system/multi-user.target.wants/*; \
rm -f /etc/systemd/system/*.wants/*; \
rm -f /lib/systemd/system/local-fs.target.wants/*; \
rm -f /lib/systemd/system/sockets.target.wants/*udev*; \
rm -f /lib/systemd/system/sockets.target.wants/*initctl*; \
rm -f /lib/systemd/system/basic.target.wants/*; \
rm -f /lib/systemd/system/anaconda.target.wants/*;
RUN /usr/bin/systemctl enable chronyd.service
VOLUME [ "/sys/fs/cgroup" ]
CMD ["/usr/sbin/init"]
Now we have to build our base ambari image.
docker image build -f centos7.systemd.Dockerfile -t centos7-systemd:latest .
BTW you can skip this step, but it would require to copy all centos7.systemd.Dockerfile
file content in next dockerfiles.
ambari.server.centos7.Dockerfile
This dockerfile is for building ambari server image.
# You can just pase all content of centos7.systemd.Dockerfile instead next line
FROM centos7-systemd:latest
#
# Ambari server specific config
#
RUN wget -nv http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.7.4.0/ambari.repo -O /etc/yum.repos.d/ambari.repo
RUN yum update -y && yum -y install ambari-server
RUN yum clean all && rm -rf /var/cache/yum
RUN mkdir -p /usr/share/java/
# Mysql connector that could be used by agent
# RUN wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.19-1.el7.noarch.rpm
# RUN /usr/bin/rpm -i ./mysql-connector-java-8.0.19-1.el7.noarch.rpm
# RUN rm ./mysql-connector-java-8.0.19-1.el7.noarch.rpm
# RUN ambari-server setup --jdbc-db=mysql --jdbc-driver=/usr/share/java/mysql-connector-java-8.0.19.jar
#Postgres connector
RUN wget https://jdbc.postgresql.org/download/postgresql-42.2.9.jar
RUN /usr/bin/mv ./postgresql-42.2.9.jar /usr/share/java/postgresql-42.2.9.jar && chmod 644 /usr/share/java/postgresql-42.2.9.jar
RUN ambari-server setup --jdbc-db=postgres --jdbc-driver=/usr/share/java/postgresql-42.2.9.jar
WORKDIR /var/lib/ambari-server/resources/
RUN wget http://public-repo-1.hortonworks.com/ARTIFACTS/jdk-8u112-linux-x64.tar.gz
RUN wget http://public-repo-1.hortonworks.com/ARTIFACTS/jce_policy-8.zip
WORKDIR /root
# NOTICE !
COPY ./init-server.sh /root/init.sh
RUN chmod 0751 /root/init.sh
RUN chown root:root /root/init.sh
RUN echo -e "\\n/root/init.sh" >> /etc/rc.d/rc.local
RUN chmod +x /etc/rc.d/rc.local
RUN /usr/bin/systemctl enable rc-local.service
EXPOSE 5432 8440 8441 8440
Also, it requires additional script to be executed at startup.
Create init-server.sh
script into your docker build directory with next content:
#!/usr/bin/env sh
FLAG_FILE=/ambari-ok
# It would run only once at first start up. It will run:
# - Ambari server initial setup
# - configure PostreSQL with database "hive" and user/password hive/hive
if [ ! -f "$FLAG_FILE" ]; then
echo "Ambari initial setup"
echo -e "n\n1\ny\nn\n" | ambari-server setup
sed -i "s/ambari,mapred/all/g" /var/lib/pgsql/data/pg_hba.conf
service postgresql restart
ambari-server start
su - postgres -c "psql -c \"create database hive\""
su - postgres -c "psql -c \"create user hive with password 'hive';"\"
su - postgres -c "psql -c \"grant all privileges on database hive to hive;"\"
touch $FLAG_FILE
else
echo "Ambari already intialized"
fi
ambari.agent.centos7.Dockerfile
Agent image is simpler than server.
You just need to have ambari agent properly installed.
# You can just pase all content of centos7.systemd.Dockerfile instead next line
FROM centos7-systemd:latest
#
# Ambari agent specific config
#
RUN wget -nv http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.7.4.0/ambari.repo -O /etc/yum.repos.d/ambari.repo
RUN yum update -y && yum -y install ambari-agent
RUN yum clean all && rm -rf /var/cache/yum
# bugfix when agent messing up with hostname value
RUN E=! echo -e "#$E/usr/bin/env sh\necho \`hostname -f\`" > /var/lib/ambari-agent/public_hostname.sh; chmod +x /var/lib/ambari-agent/public_hostname.sh
RUN sed -i "s/\[agent\]/[agent]\npublic_hostname_script=\/var\/lib\/ambari-agent\/public_hostname.sh/g" /etc/ambari-agent/conf/ambari-agent.ini
# NOTICE !
COPY ./init-agent.sh /root/agent.sh
RUN chmod 0751 /root/agent.sh
RUN chown root:root /root/agent.sh
And again we need tricky script to do some configuration magic.
Create init-agent.sh
script in dockerbuld directory with next content:
#!/usr/bin/env sh
FLAG_FILE=/ambari-agent-ok
AMBARI_SERVER="$1"
if [ -z "$AMBARI_SERVER" ]; then
echo "Hostname not provided"
exit 1;
fi
if [ ! -f "$FLAG_FILE" ]; then
echo "Ambari agent setup with host $AMBARI_SERVER"
sed -i "s/hostname=localhost/hostname=${AMBARI_SERVER}/g" /etc/ambari-agent/conf/ambari-agent.ini
/usr/sbin/ambari-agent restart
touch $FLAG_FILE
else
echo "Ambari agent already configured"
fi
If you aren’t stupid developer you should notice that this script will be available as /root/agent.sh
and it is not executed by default.
docker-compose.yml
Now it is time to create our docker-compose.yml
it will use your dockerfiles and sh scripts to build and launch ambari server docker container and 3 ambari agent containers.
Also, all containers will be in the same network “ambari”. Please notice what ports are exposed in local environment.
version: "3"
networks:
ambari:
services:
ambari-server:
container_name: "ambari-server"
hostname: ambari-server
image: "ambari-srerver-274"
privileged: true
volumes:
- "/sys/fs/cgroup:/sys/fs/cgroup:ro"
build:
context: .
dockerfile: "ambari.server.centos7.Dockerfile"
ports:
- "80:80"
- "5005:5005"
- "8080:8080"
networks:
- ambari
ambari-agent1:
container_name: "ambari-agent1"
hostname: ambari-agent1
image: "ambari-agent-274"
privileged: true
volumes:
- "/sys/fs/cgroup:/sys/fs/cgroup:ro"
build:
context: .
dockerfile: "ambari.agent.centos7.Dockerfile"
depends_on:
- "ambari-server"
ports:
- "3001:3000"
networks:
- ambari
ambari-agent2:
container_name: "ambari-agent2"
hostname: ambari-agent2
image: "ambari-agent-274"
privileged: true
volumes:
- "/sys/fs/cgroup:/sys/fs/cgroup:ro"
build:
context: .
dockerfile: "ambari.agent.centos7.Dockerfile"
depends_on:
- "ambari-server"
- "ambari-agent1"
ports:
- "3002:3000"
networks:
- ambari
ambari-agent3:
container_name: "ambari-agent3"
hostname: ambari-agent3
image: "ambari-agent-274"
privileged: true
volumes:
- "/sys/fs/cgroup:/sys/fs/cgroup:ro"
build:
context: .
dockerfile: "ambari.agent.centos7.Dockerfile"
depends_on:
- "ambari-server"
- "ambari-agent1"
- "ambari-agent2"
ports:
- "3003:3000"
networks:
- ambari
Time to launch our cluster
Create docker containers by executing
It should take a time because Docker has to build images first.
After build will be complete you should see all your docker containers up and running.
docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
82f5cc67dec9 ambari-agent-274 "/usr/sbin/init" 3 minutes ago Up 3 minutes 0.0.0.0:3003->3000/tcp ambari-agent3
c7a32bbc9844 ambari-agent-274 "/usr/sbin/init" 3 minutes ago Up 3 minutes 0.0.0.0:3002->3000/tcp ambari-agent2
0faaab93d6a1 ambari-agent-274 "/usr/sbin/init" 3 minutes ago Up 3 minutes 0.0.0.0:3001->3000/tcp ambari-agent1
1b40b5f19cd8 ambari-srerver-274 "/usr/sbin/init" 3 minutes ago Up 3 minutes 0.0.0.0:80->80/tcp, 0.0.0.0:5005->5005/tcp, 5432/tcp, 0.0.0.0:8080->8080/tcp, 8440-8441/tcp ambari-server
Also, you should see ambari login page at http://localhost:8080 .
Do not forget that now you could use docker-compose start
and docker-compose stop
to start and stop your ambari containers.
Still should do some magic !
Considering that we have 3 ambari agents with names like ambari-agent#
and server with name ambari-server
.
Now we have to execute the next script, let’s call it agents-configure.sh
.
#!/usr/bin/env sh
AMBARI_SERVER="ambari-server"
AGENTS_COUNT=3
for i in $(seq $AGENTS_COUNT); do
echo -e "\nConfigure agent ${i}\n"
docker exec -t "ambari-agent${i}" /root/agent.sh "$AMBARI_SERVER"
done
It will configure our agents to accept requests from our ambari-server instance.
Ambari (HDP) initial cluster configuration
After everything was configured you could go to http://localhost:8080 and login as admin/admin :)
Start installation wizard and define whatever name for your cluster.
Choose latest HDP Stack version and verify that in repositories section hortonworks.com domain is used.
Now you have to tell your server where your agents located.
You must provide Docker hostnames: ambari-agent1, ambari-agent2, ambari-agent3.
After that press register button and wait for agents setup to be completed.
Now you have to choose what services will be provided in your cluster.
I suggest you initially install only HDFS and Zookeper.
It takes hell lot of time to install even single service moreover there is no need for you in all of them at the same time.
Leave ambari-agent3 as a data processing node and do not install on it server/slave component.
After about 30 min or maybe just 1 hour :) of installation you will be able to observe Ambari dashboard.
The first thing you should do is to stop and delete SmartSense service.
It would not work properly in your local environment anyway.
Now you can install any other service you need and play with your local Ambari cluster.
If you want to install Hive you should notice that you are already have PostgreSQL DB on ambari-server container with database named “hive” and same user/password. So when you will configure Hive server during installation use these credentials.
Install HDF stack on existing HDP cluster
If you want to try data processing platform Apache NiFi in your setup you should install additional ambari management pack.
Hortonworks DataFlow (HDF) installation looks easy, but it is not.
Say hello to this issue
and dump developer who are not bother themselves by pointing out in manual versions compatibility issues.
It seems that for your Ambari v2.7.4 you could use at most HDF mpack v3.2.0.
You could find useful installation instructions here:
And here is my cli shortcut to install HDF on your dockerized Ambari server.
# kinda log in into ambari server
docker exec -ti ambari-server /bin/bash
# Inside docker container
cd /tmp
wget http://public-repo-1.hortonworks.com/HDF/amazonlinux2/3.x/updates/3.2.0.0/tars/hdf_ambari_mp/hdf-ambari-mpack-3.2.0.0-520.tar.gz
ambari-server install-mpack --mpack=/tmp/hdf-ambari-mpack-3.2.0.0-520.tar.gz --verbose
ambari-server restart
That is all folks.
After Ambari restarted you should be able to see new components available for installation such as Apache NiFi.
Takeaway
Installing Ambari on Docker is not an easy thing, because it was not meant to run in such environment.
But it is a cool way to emulate all hadoop cluster on a single machine.
You will suffer if your RAM size is only 16Gb.
All this hadoop stack is extremely memory hungry.
It is almost impossible to configure yarn to be able to execute at least 3 applications at the same time on 16Gb.
Default Ambari memory configurations does not work well in such Docker deployment,
because any container sees all your PC RAM as available.
About suffering with Yarn. If it configured wrongly, which is true by default, you will not receive any errors.
Yarn would wait for free memory indefinitely and job would never be launched.
Components compatibility issues would make you cry too.
Current HDP/HDF stack assembled in tricky way.
For example your Spark application will require you to provide an exact version of hdfs dependencies because default hdfs library shipped with spark would not work in this setup.
Launching Spark from Oozie here is a nightmare because oozie has its own vision for spark job and its dependencies.
So you should pay attention to exact versions of all HDP components if you are about to write your application.
Ambari as hadoop administration tool looks cool, but unfortunately Cloudera closed free access to HDP and HDF stacks.
This is why you should probably look into another deployment tools like Helm.