Leveraging Hadoop’s Strengths with SAP HANA & SAP Lumira

Posted on September 26, 2017 under Blog

By:
Arman Avetisyan
Sr. SAP Basis/BI & Software Developer
aavetisyan@comerit.com

SAP HANA is still very expensive for BIG Data and many organizations trying to leverage  Hadoop in their landscape because it’s running on commodity hardware and able to store huge volumes of data.

Rather than simply archiving HANA’s historical data, we can use a Multi-node Hadoop cluster to store historical data , analyze it, build applications, perform machine learning and more.

In this tutorial, we are going to see what differences are between Hadoop and SAP HANA and how to leverage the strengths of both the Apache Hadoop and SAP HANA platforms.

We will learn how to move data from SAP HANA to Hadoop and perform data visualizations with SAP Lumira.

Finally, we will review how to use Apache Spark (PySpark) to create Applications using the data located in Hadoop.

Overview
  • What is Hadoop?
  • What is Hive & Sqoop?
  • Hadoop vs SAP HANA
  • Hadoop with SAP HANA & SAP Lumira
  • Uploading and transforming data from SAP HANA to a Multi-node Hadoop cluster
  • Importing Data From SAP HANA to Hadoop HDFS using Sqoop
  • Hive as metastore DB for HDFS -> create structured data from unstructured data located in HDFS
  • Moving data from HDFS to Hive
  • Connecting SAP Lumira to Hive for Big Data Analysis
  • Connecting PySpark to Hive for Application development, Analysis, Machine Learning and more.
What is Hadoop?

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. A Hadoop frame-worked application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

What is Hive & Sqoop?

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority.

Sqoop is a tool designed to transfer data between Hadoop and relational database servers or SAP HANA. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases.

Hadoop vs SAP HANAhadoop_vs_hana

Using HADOOP WITH SAP HANA & Lumira
  • Import data from HANA to Hadoop HDFS using SQOOP.
  • Connect SAP Lumira to Hadoop for data visualization using Hive.
  • Use PySpark & Hive to create applications on Hadoop.

hadoop_and_hana

Importing Data From HANA to HDFS using Sqoop

sqoop import –username <USERNAME> –password <PASSWORD> –connect jdbc:sap://<host address>:<porthadoop>/?currentschema=<SCHEMA_NAME> –driver com.sap.db.jdbc.Driver –table <TABLE_NAME>  –split-by <Column Name>

schema_and_tablename

Check to see if the data was uploaded to HDFS

Hadoop fs –ls /user/hduser/

Hive as metastore DB for HDFS -> create structured data from unstructured data located in HDFS
  • Open Hive
  • Create table
  • Move Data from HDFS to Hive
Creating a Table in Hive

Execute the command bellow in hive to create “customer” table.

Create table customer (

customer_number int,

customer_name String,

city String,

valid_to String,

sales_organization String,

country String)

ROW FROMAT DELIMITED

FIELDS TERMINATED BY ‘,’

LINES TERMINATED BY ‘\n’;

Moving data from HDFS to Hive

LOAD DATA INPATH <filepath>  INTO TABLE <tablename>

Example:

LOAD DATA INPATH /user/hduser/COMERIT_DEMO/customer/part*  INTO TABLE customer;

Run Hive Server

Run hive server to access the hive tables outside of the box.

$HIVE_HOME/bin/hive –service hiveserver2

Connect SAP Lumira to Hive Server for Big Data Analysis

hive_and_lumira

Connect Spark to Hive for Application development, Analysis, Machine Learning and more (with Python)

This is a small console application built in Python using PySpark.

The app takes “Product ID” as an input and returned similar products from the same category which have higher “Sold Quantity” count.

Download the source code

Sign up for updates
Follow Comerit