Business Analytics: Cognos BI integration with BigInsights using BigSQL: Use-cases and How to

In this blog we’ll develop Cognos BI reports using BigInsights (Hadoop distribution) along with warehouse data sources. “Data warehouse augmentation” is a Big Data use case of huge importance to the traditional analytics industry (Visit http://www.ibm.com/developerworks/library/ba-augment-data-warehouse1/index.html to know more). To explore and implement a big data project, you can augment existing data warehouse environments by introducing one or more use cases given below, as the business requires.

Use case 1: Use Hadoop environment as the landing zone to pull in data from various sources, process it, and transfer the processed data to the existing data warehouse or other repositories. For details please visit - http://www.ibm.com/developerworks/library/ba-augment-data-warehouse2/index.html

Use case 2: Use Hadoop environment to set up an active (query-able) archive for the data from the enterprise data warehouse. For details please visit - http://www.ibm.com/developerworks/library/ba-augment-data-warehouse3/index.html

Use case 3: Use Hadoop environment to incrementally augment the existing data warehouse to gain better business insight from unstructured data that was previously unavailable for analysis due to limitations of the traditional data warehouse technology. For details please visit - http://www.ibm.com/developerworks/library/ba-augment-data-warehouse4/index.html

This blog is directly helpful in case-3 however use of BigSQL would be used effectively in all 3 cases. In my previous blogs, we discussed Cognos BI in detail so if you are new probably you can check details here - http://www.ibm.com/software/products/en/business-intelligence. Below I am giving a brief description of BigInsights and BigSQL before we start integration steps.
IBM InfoSphere BigInsights (http://www.ibm.com/software/data/infosphere/biginsights/) combines Apache Hadoop (including the MapReduce framework and the Hadoop Distributed File Systems) with unique, enterprise-ready technologies and capabilities from across IBM, including Big SQL, built-in analytics, visualization, BigSheets, and security. InfoSphere BigInsights is a single platform to manage all of the data. InfoSphere BigInsights offers many benefits:

Provides flexible, enterprise-class support for processing large volumes of data by using streams and MapReduce
Enables applications to work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner
Applies advanced analytics to information in its native form to enable ad hoc analysis
Integrates with enterprise software

Big SQL (http://www.ibm.com/developerworks/library/bd-bigsql/) provides SQL access to data that is stored in InfoSphere BigInsight by using JDBC, ODBC, and other connections. Big SQL supports large ad hoc queries by using IBM SQL/PL support, SQL stored procedures, SQL functions, and IBM Data Server drivers. These queries are low-latency queries that return information quickly to reduce response time and provide improved access to data. Big SQL offers unmatched simplicity, performance and security for SQL on Hadoop. It provides a single point of access and view across all big data, exactly where it lives.

OK, so with this little background we are ready to start with implementations tasks using BigSQL. In case you are interested in Hive based work, please refer - http://www.ibm.com/developerworks/library/ba-cognosbi10-infospherebiginsights/index.html. We’ll complete 3 tasks here –

1) Setting up the environment with BigInsights 3.0, DB2 10.5 Warehouse (BLU) and Cognos BI 10.2.1 FP 3

2) Prepare data sources. Create tables and load data in warehouse and Hadoop environment.

3) Create Cognos data sources, meta-data model and a sample report.

Task 1 - Setting up the environment with BigInsights 3.0, DB2 10.5 (BLU) and Cognos BI 10.2.1 FP 3

In my case, all below software is installed on Radhat Enterprise Linux 6.3. In your case they all can be on different machines as well.

· For Cognos BI 10.2.1 setup you can either download free developer edition for Windows from IBM website (http://www.ibm.com/developerworks/downloads/im/cognosbi/) or use the installation steps given in my previous blog (http://vmanoria.blogspot.in/2014/08/ibm-cognos-bi-installation.html) if you have the software for Linux.

· If you don’t have licensed version for DB2 10.5 please download and install DB2 10.5 express edition (http://www-01.ibm.com/software/data/db2/express-c/download.html). Installation steps are shown here for Windows https://www.youtube.com/watch?v=2AtSEHC6iAQ

· For BigInsights 3.0 setup you can either download free QuickStart edition images from IBM website (http://www.ibm.com/developerworks/downloads/im/biginsightsquick/) or use the installation steps given in my previous blog (http://vmanoria.blogspot.in/2014/08/infosphere-biginsight-30-installation.html) if you have the software. If you are not using images then you also need to follow below steps.

· Copy BigSQL drivers in Cognos library folder and restart Cognos BI services.

cp /opt/ibm/biginsights/bigsql/bigsql1/jdbc/bigsql-jdbc-driver.jar /opt/ibm/cognos/c10_64/webapps/p2pd/WEB-INF/lib/

Task 2 - Prepare data sources. Create tables and load data in warehouse and Hadoop environment.

To keep the things simple we are going to work here with 3 tables – 1) Student 2) Student_Details and 3) Student_Facts. First two tables are being created in DB2 environment. Third table would be created BigInsights HDFS environment using BigSQL. After that we'll create Student_Details table in HDFS environment and load the data from DB2 DB using JDBC driver.

In DB2 BLU, lets create table - 1) Student 2) Student_Details and load data from csv files. Below commands are being run on RHEL shell.

[root@scekvm1 sample]# su db2inst1

[db2inst1@scekvm1 sample]$ ls

ER.jpg Exam.csv Old Performance.csv QBank.csv Student.csv Student_Details.csv StuFact.csv

[db2inst1@scekvm1 sample]$ db2 connect to gs_db

Database Connection Information

Database server = DB2/LINUXX8664 10.5.3

SQL authorization ID = DB2INST1

Local database alias = GS_DB

[db2inst1@scekvm1 sample]$ db2 -tvf db2ddl.sql

CREATE TABLE DB2INST1.STUDENT ( STUDENT_ID INTEGER NOT NULL, STUDENT_NAME VARCHAR (30)NOT NULL, YEAR_OF_ADMISSION INTEGER NOT NULL, SCHOOL VARCHAR (30)NOT NULL, CLASS VARCHAR (10)NOT NULL, SECTION VARCHAR (3) NOT NULL, HOSTELER VARCHAR (3) NOT NULL )

DB20000I The SQL command completed successfully.

CREATE TABLE DB2INST1.STUDENT_DETAILS ( STUDENT_ID INTEGER NOT NULL, DOB DATE NOT NULL, GENDER VARCHAR (2) NOT NULL, HOME_CITY VARCHAR (15) NOT NULL, HOME_STATE VARCHAR (3) NOT NULL, ADMISSION_CATEGORY VARCHAR (15) NOT NULL, SOCIAL_CATEGORY VARCHAR (15) NOT NULL, SCHOOL_CATEGORY VARCHAR (15) NOT NULL, NATIONALITY VARCHAR (15) NOT NULL, RELIGION VARCHAR (15) NOT NULL )

DB20000I The SQL command completed successfully.

[db2inst1@scekvm1 sample]$ db2 import from Student.csv of del messages msg.txt insert into student

Number of rows read = 1000

Number of rows skipped = 0

Number of rows inserted = 1000

Number of rows updated = 0

Number of rows rejected = 0

Number of rows committed = 1000

[db2inst1@scekvm1 sample]$ db2 import from Student_Details.csv of del messages msg.txt insert into student_details

Number of rows read = 1000

Number of rows skipped = 0

Number of rows inserted = 1000

Number of rows updated = 0

Number of rows rejected = 0

Number of rows committed = 1000

SQL3107W At least one warning message was encountered during LOAD processing.

Now in BigInsights, let us create Hadoop tables for Student_Details & Student_Facts. After that we’ll load data in Student_Details from DB2 and in Student_Facts from csv file. Before we start please make sure BigInsights is running. If not then please start it by running /opt/ibm/biginsights/bin/start-all.sh

There are two ways of using BigSQL –

A) Using JSqsh. For details please refer - https://developer.ibm.com/hadoop/docs/tutorials/big-sql-hadoop-tutorial/big-sql-hadoop-lab-2-big-sql-command-line-interface/

B) Using Eclipse environment. For details please visit - https://developer.ibm.com/hadoop/docs/tutorials/big-sql-hadoop-tutorial/big-sql-hadoop-lab-3-using-eclipse/

Here we’ll use JSqsh. BigInsights supports a command-line interface for Big SQL through the Java SQL Shell (JSqsh, pronounced “jay-skwish”). JSqsh is an open source project for querying JDBC databases. You may find it handy to become familiar with basic JSqsh capabilities, particularly if you don’t expect to have access to an Eclipse environment at all times for your work. Below commands are being run on RHEL shell.

[root@scekvm1 sample]# su biadmin

[biadmin@scekvm1 sample]$ cd /opt/ibm/biginsights/jsqsh/bin/

[biadmin@scekvm1 bin]$ ls

jsqsh jsqsh.bat

[biadmin@scekvm1 bin]$ ./jsqsh bigsql

Password:********

WARN [State: ][Code: 0]: Statement processing was successful.. SQLCODE=0, SQLSTATE= , DRIVER=3.67.33

Type \help for available help topics. Using JLine.

[scekvm1.iicbang.ibm.com][biadmin] 1>

Just copy & paste below commands on JSqsh prompt to create tables -

CREATE HADOOP TABLE IF NOT EXISTS STUDENT_FACTS (
STUDENT_ID INTEGER NOT NULL,
ATTENDANCE INTEGER NOT NULL,
FEE_COLLECTED INTEGER NOT NULL,
FEE_BALANCE INTEGER NOT NULL,
MARKS INTEGER NOT NULL
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
;

CREATE HADOOP TABLE IF NOT EXISTS STUDENT_DETAILS (
STUDENT_ID INTEGER NOT NULL,
DOB DATE NOT NULL,
GENDER VARCHAR (2) NOT NULL,
HOME_CITY VARCHAR (15) NOT NULL,
HOME_STATE VARCHAR (3) NOT NULL,
ADMISSION_CATEGORY VARCHAR (15) NOT NULL,
SOCIAL_CATEGORY VARCHAR (15) NOT NULL,
SCHOOL_CATEGORY VARCHAR (15) NOT NULL,
NATIONALITY VARCHAR (15) NOT NULL,
RELIGION VARCHAR (15) NOT NULL
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

You'll get response like this -

0 rows affected (total: 0.49s)

Now lets load the data in table STUDENT_FACTS from the csv file. Here's the command -

LOAD HADOOP USING FILE URL
'file:///images/vmanoria/engagements/sample/StuFact.csv'
WITH SOURCE PROPERTIES ('field.delimiter'=',')
INTO TABLE STUDENT_FACTS OVERWRITE;

On successful completion, you'll get response like this -

WARN [State: ][Code: 5108]: The LOAD HADOOP statement completed. Number of rows loaded into the Hadoop table: "1000". Total number of source records: "1000". If the source is a file, number of lines skipped: "0". Number of source records that were rejected: "0". Job identifier: "job_201409242156_0009".. SQLCODE=5108, SQLSTATE= , DRIVER=3.67.33

0 rows affected (total: 21.26s)

Now lets load data in table STUDENT_DETAILS from DB2 database table we created earlier.

LOAD HADOOP USING JDBC CONNECTION URL
'jdbc:db2://scekvm1:50000/GS_DB'
WITH PARAMETERS (user = 'db2inst1',password='db2inst1')
FROM TABLE STUDENT_DETAILS SPLIT COLUMN STUDENT_ID
INTO TABLE STUDENT_DETAILS APPEND
WITH LOAD PROPERTIES ( 'num.map.tasks' = 1)

For more info on BIGSQL LOAD command please visit -
http://www-01.ibm.com/support/knowledgecenter/api/content/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.commsql.doc/doc/biga_load_from.html

Task-3) Create Cognos data sources, meta-data model and a sample report.

Let’s quickly create two data source connections from Cognos Administration interface. One with DB2 database GS_DB in our case and another JDBC connection using “IBM Infosphere BigInsights (Big SQL)” as shown below. Provide valid sign-on details and test the connection.

Now let’s open Framework Manager and pull the tables from respective sources. Create relationships between them and set the query items properties correctly. Like in my case I changed usage property of ‘Student_ID’ with ‘identifier’ which previously was ‘fact’ due to its integer data type. Before you create and publish the package, just test if the aggregate data is coming out correctly. Now you can create a package and publish it on Cognos Connection.