Sr. SAP Basis/BI & Software Developer
SAP HANA is still very expensive for BIG Data and many organizations trying to leverage Hadoop in their landscape because it’s running on commodity hardware and able to store huge volumes of data.
Rather than simply archiving HANA’s historical data, we can use a Multi-node Hadoop cluster to store historical data , analyze it, build applications, perform machine learning and more.
In this tutorial, we are going to see what differences are between Hadoop and SAP HANA and how to leverage the strengths of both the Apache Hadoop and SAP HANA platforms.
We will learn how to move data from SAP HANA to Hadoop and perform data visualizations with SAP Lumira.
Finally, we will review how to use Apache Spark (PySpark) to create Applications using the data located in Hadoop.
Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. A Hadoop frame-worked application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority.
Sqoop is a tool designed to transfer data between Hadoop and relational database servers or SAP HANA. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases.
sqoop import –username <USERNAME> –password <PASSWORD> –connect jdbc:sap://<host address>:<porthadoop>/?currentschema=<SCHEMA_NAME> –driver com.sap.db.jdbc.Driver –table <TABLE_NAME> –split-by <Column Name>
Check to see if the data was uploaded to HDFS
Hadoop fs –ls /user/hduser/
Execute the command bellow in hive to create “customer” table.
Create table customer (
ROW FROMAT DELIMITED
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’;
LOAD DATA INPATH <filepath> INTO TABLE <tablename>
LOAD DATA INPATH /user/hduser/COMERIT_DEMO/customer/part* INTO TABLE customer;
Run hive server to access the hive tables outside of the box.
$HIVE_HOME/bin/hive –service hiveserver2
This is a small console application built in Python using PySpark.
The app takes “Product ID” as an input and returned similar products from the same category which have higher “Sold Quantity” count.