How to Benchmark Apache Hive on Hadoop Node using ‘hive-testbench’

Dawood Munavar Feb 26 - 10 min read

Audio : Listen to This Blog.

Apache Hive is a data warehouse software project built on top of Apache Hadoop for the querying of large data systems in the open-source Hadoop platform.

Hive gives a SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

The three important features for which Hive is deployed are
1) Data summarization
2) Data analysis
3) Data query

The query language, exclusively supported by Hive, is HiveQL. This language translates SQL-like queries into MapReduce jobs for deploying them on Hadoop nodes.

Install Hadoop

Step 1:

I’ve built a virtual machine with CentOS 7, with below software specifications:

vSphere web client: Version 6.7
Hypervisor: VMware ESXi, 6.7.0,
Datastore Backend: SAN volume connected via FC
Datastore Type: VMFS6
VM version: 14

Software versions used:

  • Java version “1.8.0_221”
  • Hadoop version: 3.2.1
  • Hive version: 3.1.2

Step 2: Prerequisites:

  • Prepare a new VM with CentOS7 installed.
  • Make sure VM have proper IP and HOSTNAME entries in /etc/hosts.
  • Setup Passwordless SSH on VM using ssh-keygen (copy id_rsa.pub to /root/.ssh/authorized_keys on the same hosts)
  • Disable IPv6 by editing /etc/sysctl.conf
 
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Step 3: Install Java on CentOS7

Download Tar or RPM file of Java 8:

tar -xvf jdk-8u221-linux-x64.tar.gz
mkdir /usr/java
mv jdk1.8.0_221/ /usr/java/

alternatives --install /usr/bin/java java /usr/java/jdk1.8.0_221/bin/java 2
alternatives --install /usr/bin/javaws javaws /usr/java/jdk1.8.0_221/bin/javaws 2
alternatives --install /usr/bin/javac javac /usr/java/jdk1.8.0_221/bin/javac 2
alternatives --install /usr/bin/jar jar /usr/java/jdk1.8.0_221/bin/jar 2
alternatives --install /usr/bin/jps jps /usr/java/jdk1.8.0_221/bin/jps 2

[root@hive ~]# java -version
java version "1.8.0_221"
Java(TM) SE Runtime Environment (build 1.8.0_221-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.221-b11, mixed mode)

Update /etc/profile:

echo "" >> /etc/profile
echo "## Setting JAVA_HOME and PATH for all USERS ##" >> /etc/profile
echo "export JAVA_HOME=/usr/java/jdk1.8.0_221" >> /etc/profile
echo "export PATH=\$PATH:\$JAVA_HOME/bin" >> /etc/profile
source /etc/profile

Step 4: Download Hadoop 3.2.1 and extract it to /opt/ directory

wget http://mirrors.estointernet.in/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
tar -xvf hadoop-3.2.1.tar.gz

Update /etc/profile:

echo "" >> /etc/profile
echo "### HADOOP Variables ###" >> /etc/profile
echo "export HADOOP_HOME=/opt/hadoop-3.2.1" >> /etc/profile
echo "export HADOOP_MAPRED_HOME=\$HADOOP_HOME" >> /etc/profile
echo "export HADOOP_COMMON_HOME=\$HADOOP_HOME" >> /etc/profile
echo "export HADOOP_HDFS_HOME=\$HADOOP_HOME" >> /etc/profile
echo "export YARN_HOME=\$HADOOP_HOME" >> /etc/profile
echo "export HADOOP_COMMON_LIB_NATIVE_DIR=\$HADOOP_HOME/lib/native" >> /etc/profile
echo "export PATH=\$PATH:\$HADOOP_HOME/sbin:\$HADOOP_HOME/bin" >> /etc/profile
source /etc/profile

Step 5: Setting up Hadoop Environment

mkdir -p /data/hadoop-data/nn
mkdir -p /data/hadoop-data/snn
mkdir -p /data/hadoop-data/dn
mkdir -p /data/hadoop-data/mapred/system
mkdir -p /data/hadoop-data/mapred/local

edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh file and set JAVA_HOME environment variable

export JAVA_HOME=/usr/java/jdk1.8.0_221/

Step 6: Edit Hadoop XML Configuration files

hdfs-site.xml

    
        dfs.replication
        1
    
    
        dfs.name.dir
        file:///data/hadoop-data/nn
    
    
        dfs.data.dir
        file:///data/hadoop-data/dn
    
    
        dfs.namenode.checkpoint.dir
        file:///data/hadoop-data/snn
    

core-site.xml

    
        fs.default.name
        hdfs://hive:9000
    

Note: hive is the hostname of VM

mapred-site.xml

  yarn.app.mapreduce.am.env
  HADOOP_MAPRED_HOME=/opt/hadoop-3.2.1


  mapreduce.map.env
  HADOOP_MAPRED_HOME=/opt/hadoop-3.2.1


  mapreduce.reduce.env
  HADOOP_MAPRED_HOME=/opt/hadoop-3.2.1


  mapreduce.framework.name
  yarn


yarn-site.xml

    
        yarn.resourcemanager.hostname
        hive
    
    
        yarn.nodemanager.aux-services
        mapreduce_shuffle
    
    
        yarn.nodemanager.aux-services.mapreduce.shuffle.class
        org.apache.hadoop.mapred.ShuffleHandler
    

Step 7: We need to format Hadoop NameNode using the below command prior tobefore starting the Hadoop cluster.

hdfs namenode -format

Sample output:
WARNING: /home/hadoop/hadoop/logs does not exist. Creating.
2018-05-02 17:52:09,678 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hive/xxx.xxx.xxx.xxx
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 3.2.1
...
...
...
2018-05-02 17:52:13,717 INFO common.Storage: Storage directory /opt/hadoop/hadoopdata/hdfs/namenode has been successfully formatted.
2018-05-02 17:52:13,806 INFO namenode.FSImageFormatProtobuf: Saving image file /opt/hadoop/hadoopdata/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 using no compression
2018-05-02 17:52:14,161 INFO namenode.FSImageFormatProtobuf: Image file /opt/hadoop/hadoopdata/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 of size 391 bytes saved in 0 seconds .
2018-05-02 17:52:14,224 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2018-05-02 17:52:14,282 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hive/xxx.xxx.xxx.xxx
************************************************************/

Step 8: Commands for starting and stopping Hadoop Cluster

	cd $HADOOP_HOME/sbin/
./start-dfs.sh
./start-yarn.sh

If you’rer using a root user to start hadoop on VM, then following commands needs to be executed as root user prior to the above commands.
export HDFS_NAMENODE_USER="root"
export HDFS_DATANODE_USER="root"
export HDFS_SECONDARYNAMENODE_USER="root"
export YARN_RESOURCEMANAGER_USER="root"
export YARN_NODEMANAGER_USER="root"

Step 9: Access Hadoop Services in Browser

  • Hadoop NameNode started on port 9870 default. Access your server on port 9870 in your favorite web browser.
  • Now access port 8042 for getting the information about the cluster and all applications.
  • Access port 9864 to get details about your Hadoop node.
Port Filtering in Firewall by updating below entries in /etc/sysconfig/iptables file.
Append below lines in /etc/sysconfig/iptables file
-A INPUT -m state --state NEW -m tcp -p tcp --dport 9870 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 8042 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 9864 -j ACCEPT

(Or)

sudo service iptables stop - To disable all the port

Now you can access Hadoop Services in Web Browser.

Install Apache Hive:

Step 1: Download the appropriate version of Hive & Extract it.

wget http://apachemirror.wuchna.com/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz

tar -xvf apache-hive-3.1.2-bin.tar.gz

Update /etc/profile

echo "### Hive Variables ###" >> /etc/profile
cd /opt/hadoop-3.2.1/bin/hadoop
echo "export HADOOP=/opt/hadoop-3.2.1/bin/hadoop" >> /etc/profile
echo "export HIVE_HOME=/opt/apache-hive-3.1.2-bin" >> /etc/profile
echo "export PATH=\$HIVE_HOME/bin:\$PATH" >> /etc/profile
source /etc/profile

Step 2: Create hive directories on HDFS (Hadoop FileSystem)

	1) Start hadoop if not started yet:
cd $HADOOP_HOME/sbin
./start-all.sh

2) Check if all below 5 daemons are active and running using the command JPS:
jps

Hadoop daemons:
ResourceManager
SecondaryNameNode
DataNode
NodeManager
NameNode

3) Create directories:

hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -mkdir /tmp
hdfs dfs -chmod g+w /tmp
hdfs dfs -chmod g+w /user/hive/warehouse

4) Inform Hive about home directory of Hadoop:

vi $HIVE_HOME/bin/hive-config.sh
export HADOOP_HOME=/opt/hadoop-3.2.1

Step 3: Open hive shell

	
[root@hive ~]# $HIVE_HOME/bin/hive
which: no hbase in (/opt/apache-hive-3.1.2-bin/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/java/jdk1.8.0_221/bin:/opt/hadoop-3.2.1/sbin:/opt/hadoop-3.2.1/bin:/root/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apache-hive-3.1.2-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = eb69b2ba-b509-4ae6-94f9-b7b82281b587

Logging initialized using configuration in jar:file:/opt/apache-hive-3.1.2-bin/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true
hive>

Install hive-testbench

A testbench for experimenting with Apache Hive at any data scale.

Overview: The hive-testbench is a data generator and set of queries that lets you experiment with Apache Hive at scale. The testbench allows you to experience base Hive performance on large datasets, and gives an easy way to see the impact of Hive tuning parameters and advanced settings.

hive-testbench comes with data generators and sample queries based on both the TPC-DS and TPC-H benchmarks.

	
Download hive-testbench-hdp3.zip and extract it,

[root@hive hive-testbench-hdp3]# ls -l
total 5116
drwxr-xr-x. 6 root root     107 Oct  3 04:22 apache-maven-3.0.5
-rw-r--r--. 1 root root 5144659 Oct  3 04:22 apache-maven-3.0.5-bin.tar.gz
drwxr-xr-x. 4 root root      41 Jun 21 14:11 ddl-tpcds
drwxr-xr-x. 4 root root      45 Jun 21 14:11 ddl-tpch
-rw-r--r--. 1 root root   22213 Oct  3 06:00 derby.log
drwxr-xr-x. 5 root root     133 Oct  3 06:00 metastore_db
-rw-r--r--. 1 root root    4550 Jun 21 14:11 README.md
-rwxr-xr-x. 1 root root    1887 Jun 21 14:11 runSuite.pl
drwxr-xr-x. 3 root root    4096 Oct  3 05:59 sample-queries-tpcds
drwxr-xr-x. 2 root root    4096 Jun 21 14:11 sample-queries-tpch
drwxr-xr-x. 2 root root      71 Jun 21 14:11 settings
drwxr-xr-x. 2 root root    4096 Jun 21 14:11 spark-queries-tpcds
-rwxr-xr-x. 1 root root    1111 Jun 21 14:11 tpcds-build.sh
drwxr-xr-x. 5 root root     115 Oct  3 04:22 tpcds-gen
-rwxr-xr-x. 1 root root    3725 Jun 21 14:11 tpcds-setup.sh
-rwxr-xr-x. 1 root root    1107 Jun 21 14:11 tpch-build.sh
drwxr-xr-x. 5 root root      91 Jun 21 14:11 tpch-gen
-rwxr-xr-x. 1 root root    2448 Jun 21 14:11 tpch-setup.sh

In addition to Hadoop and Hive, before you begin ensure gcc is installed on VM

sudo yum group install "Development Tools"
sudo yum install man-pages

[root@hive ~]# gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Decide which test suite(s) you want to use
“hive-testbench comes with data generators and sample queries based on both the TPC-DS and TPC-H benchmarks”

Compile and package the appropriate data generator (TPC-DS or TPC-H)
For TPC-DS, run ./tpcds-build.sh - downloads, compiles and packages the TPC-DS data generator. 
For TPC-H, run ./tpch-build.sh - downloads, compiles and packages the TPC-H data generator

Generate and load the data
Example:
Build 1 TB of TPC-DS data: ./tpcds-setup.sh 1000
Build 1 TB of TPC-H data: ./tpch-setup.sh 1000
Sample tests:
[root@hive hive-testbench-hdp3]# ./tpcds-setup.sh 10
2019-10-03 04:38:21,137 INFO mapreduce.Job: The url to track the job: http://hive:8088/proxy/application_1570081767440_0002/
2019-10-03 04:38:21,138 INFO mapreduce.Job: Running job: job_1570081767440_0002
2019-10-03 04:38:27,255 INFO mapreduce.Job: Job job_1570081767440_0002 running in uber mode : false
2019-10-03 04:38:27,256 INFO mapreduce.Job:  map 0% reduce 0%
2019-10-03 04:42:18,571 INFO mapreduce.Job:  map 10% reduce 0%
2019-10-03 04:42:21,595 INFO mapreduce.Job:  map 20% reduce 0%
2019-10-03 04:42:28,713 INFO mapreduce.Job:  map 30% reduce 0%
2019-10-03 04:42:34,839 INFO mapreduce.Job:  map 40% reduce 0%
2019-10-03 04:42:39,900 INFO mapreduce.Job:  map 50% reduce 0%
2019-10-03 04:42:46,952 INFO mapreduce.Job:  map 60% reduce 0%
2019-10-03 04:45:56,411 INFO mapreduce.Job:  map 70% reduce 0%
2019-10-03 04:46:10,552 INFO mapreduce.Job:  map 80% reduce 0%
2019-10-03 04:46:15,568 INFO mapreduce.Job:  map 90% reduce 0%
2019-10-03 04:50:25,305 INFO mapreduce.Job:  map 100% reduce 0%
2019-10-03 04:50:26,311 INFO mapreduce.Job: Job job_1570081767440_0002 completed successfully
2019-10-03 04:50:26,408 INFO mapreduce.Job: Counters: 34
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=2278570
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=4349
                HDFS: Number of bytes written=12194709083
                HDFS: Number of read operations=297
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=183
                HDFS: Number of bytes read erasure-coded=0
        Job Counters
                Killed map tasks=5
                Launched map tasks=14
                Other local map tasks=14
                Total time spent by all maps in occupied slots (ms)=3202768
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=3202768
                Total vcore-milliseconds taken by all map tasks=3202768
                Total megabyte-milliseconds taken by all map tasks=3279634432
        Map-Reduce Framework
                Map input records=10
                Map output records=0
                Input split bytes=1030
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=12320
                CPU time spent (ms)=366120
                Physical memory (bytes) snapshot=4094967808
                Virtual memory (bytes) snapshot=28064542720
                Total committed heap usage (bytes)=3325034496
                Peak Map Physical memory (bytes)=465870848
                Peak Map Virtual memory (bytes)=2856607744
        File Input Format Counters
                Bytes Read=3319
        File Output Format Counters
                Bytes Written=0
TPC-DS text data generation complete.
Loading text data into external tables.
Optimizing table date_dim (1/24).
Optimizing table catalog_returns (23/24).
Optimizing table inventory (24/24).
Loading constraints
Data loaded into database tpcds_bin_partitioned_orc_10.

References:

https://github.com/hortonworks/hive-testbench
https://github.com/abzetdin-adamov/install-hive-on-centos-7

Leave a Reply