BI Foundation

Thursday, November 8, 2018

Oracle analytics - Different pricing schems

OBIEE on premise Licensing:

Component based licensing: The pricing is based on the user base or volume(COGS/revenue). It is a buy as you go model priced according to price list.

Enterprise license agreement: It is based on what oracle after seeing the customer's employee base and revenue sets the price lists. It does not considers if only some part of all employee strength is using the product. Once the license metric is exceeded additional fees in the increment of 10% is charged.

Unlimited license agreement: Unlimited license allows use of listed products for unlimited users and unlimited time. However duration is typically limited to 3 years. There is a risk of overbuying in this case.

BICS : Licensed per user / month basis ($150 as per oracle price list) with minimum 10 users (Free DVD)

DVCS : Licensed per user / month basis ($75 as per oracle price list) with minimum 5 users (Free DVD)

DVD : Licensed per user at $1250 as per oracle price list (If only DVD license is bought otherwise it comes free for a specified set of user base)

OAC Licensing:

The licensing is based on OCPU per hour basis.The cost differs based on if you are BYOL or getting new license.

Standard

Data Lake

Enterprise

OAC should also need cloud DB and storage license for setting up the service. OAC upgrades are controlled by customers and not Oracle while BICS upgrades are oracle managed.

Saturday, August 18, 2018

Connecting OAC or OBIEE 12c with Oracle autonomous data warehouse cloud

Oracle is now picking up pace and providing new and enhanced services for cloud. The new Oracle autonomous data warehouse cloud service gives a data store which is cloud based highly performant and highly available and has ability to automatically tune itself to give best performance for BI and analytics application. Inbuilt machine learning capabilities also gives customers to explore the AI and data science angle on top of the data already stored in the data warehouse.

As many oracle customers are already using OBIEE or OAC, they would like to migrate and connect OBIEE with Oracle ADWC. However as ADWC supports oracle client 12c and above connecting this through OCI client becomes difficult as normal adminitration tool of OBIEE supports OCI 10g/11g clients only.

Below are the steps one can carry out to connect Oracle's OBIEE or OAC with oracle ADWC.

Download the client credentials from ADWC console. This is a zip file with wallet credentials for ADWC database.
Download the oracle 12c client. ADWC supports oracle client 12c and above.
Copy the tnsnames and sqlnet in the oracle client network/admin folder from the zip file downloaded from ADWC console. This is to override sqlnet.ora to state the wallet location on local system where oracle client is installed.
Create a wallet folder inside oracle client folder and copy all the files downloaded in the client credentials zip and place in the new wallet folder.
Edit sqlnet.ora copied in step 3 and point the wallet folder location as in step 4 in sqlnet.ora.
Download OAC admin tool. Normal OBIEE admin tool is not supported.

http://www.oracle.com/technetwork/middleware/oac/downloads/oac-tools-4392272.html

Open rpd and point the tns alias in connection pool.

That's it for now. Have a good day.

Thursday, June 28, 2018

Amazon Redshift as datawarehouse

Redshift is a columnar database provided by Amazon AWS as a cloud database service. Redshift uses a cluster architecture with a leader node responsible for cluster and node management and compute node for data storage and computation. Redshift is based on postgre sql database however with some relaxed constraints and rules.

Redshift database cluster can be created by specifying the number of compute nodes required in the cluster and customers are only charged for compute nodes. Leader nodes usage is completely free. A redshift database architecture is as given below.

Physical Architecture

Logical Architecture

For schema to be used when a user logs in and queries in a database search_path is used to resolved the schema. To add a schema in search_path use set command.

set search_path to '$USER',pg_temp,pg_catalog,...;
------------------
pg_temp and pg_catalog is used first by default in order and then any schema specified by search path is scanned

To set the timeout of queries in session use statement_timeout.

set statement_timeout to ;
----------------------
This specifies the timeout for any statement being executed on the cluster. 0 turns off the timeout.

To set the timezone of a session use timezone.

SET timezone time_zone
-----------------------
TO set the timezone of current session. Run "select pg_timezone_names();" to view list of timezones. to set the timezone for database user used
"ALTER USER SET timezone to 'America/New_York';"

WLM: Workload management
You can use workload management (WLM) to define multiple query queues and to route queries to the appropriate queues at runtime.

When you have multiple sessions or users running queries at the same time, some queries might consume cluster resources for long periods of time and affect the performance of other queries. For example, suppose one group of users submits occasional complex, long-running queries that select and sort rows from several large tables. Another group frequently submits short queries that select only a few rows from one or two tables and run in a few seconds. In this situation, the short-running queries might have to wait in a queue for a long-running query to complete.

You can improve system performance and your users’ experience by modifying your WLM configuration to create separate queues for the long-running queries and the short-running queries. At run time, you can route queries to these queues according to user groups or query groups.

You can configure up to eight query queues and set the number of queries that can run in each of those queues concurrently. You can set up rules to route queries to particular queues based on the user running the query or labels that you specify. You can also configure the amount of memory allocated to each queue, so that large queries run in queues with more memory than other queues. You can also configure the WLM timeout property to limit long-running queries.

SET query_group TO group_label
------------------------------
This is used to set query group to a specific group for the session; This can be used to drive WLM to assign a query to a queue based on group assignment for the session.
create group admin_group with user admin246, admin135, sec555;

Wednesday, June 20, 2018

Pandas: Read encrypted zip file to data frames

This is easy....

#!/bin/bash
import pandas as pd
import zipfile


zf = zipfile.ZipFile("./kivaData_augmented.zip")
txtfiles = zf.infolist()
for txtfile in txtfiles:
    print(txtfile.filename)
    df = pd.read_csv(zf.open(txtfile, "r", pwd="1234".encode('cp850', 'replace')))
    print(df['term_in_months'])

Pandas: Python for data analysis

Pandas: Pandas library in python is used to do all kinds of data analysis. Pandas is highly efficient in data analysis where label oriented analysis is required. Major advantages of using python and pandas is that one would not need any relational database to first load the data and then do tha analysis using sql. Also with multitude of data formats like csv, xlsx, zip, pdf, xml, json etc one would need a tool that supports reading data from all these formats. As for now there are no single tool supporting everything. Python being open source can be used freely by any organisation to create a data reading
framework which can read data from the files (structured and unstructured), do some data cleansing on the go and carry out data analysis using pandas with label data columns.

Pandas support for mainly two data type, i,e. series and data frames makes it very flexible and efficient in data analysis.

Series: It is similar to ndarray with added advantage of having labels to the items in the array. Any compute operations between series takes care of computation based on labels. This is not available in ndarray.

sr = pd.Series([1,2,3,4],index=['col1','col2','col3','col4'])

DataFrames: Dataframes in pandas acts like a sql table with columns and index labels. Added advantage for data frames is that one can pass index labels as well with column labels. This helps in getting specific data values for data analysis based on the row index label and column names(column labels).

df = pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),
'two':pd.Series([4,5,6,7],index=['a','b','c','d'])})

Advantages in data analysis using data frames and series is that one can easily carry out scalar operation on each item of the
dataframe or series one go by using dataframe object and using scalar operation of the object itself.

for eg., df1 = [[1,2,3,nan],[4,5,6,7]]
df2 = [[8,9,10,nan],[11,12,13,14]]

df3 = df1 + df2

Which will give,

df3 = [[9,11,13,nan],[15,17,19,21]]

This is similar to operator overloading in C++.

Also some of the basic operations in dataframes are as follows:

Operation Syntax Result
Select column df[col] Series
Select row by label df.loc[label] Series
Select row by integer location df.iloc[loc] Series
Slice rows df[5:10] DataFrame
Select rows by boolean vector df[bool_vec] DataFrame

Every dataframe object has another copy of same dataframe transposed. This can be accessed using df.T operation on the dataframe object.

See you soon!!

Sunday, February 18, 2018

Download OAC Admin tool (Cloud BI)

Oracle has now made available cloud version of admin tool or Oracle Client which can be installed locally and be used to open RPD in cloud analytics (OAC).

Below is the link to download OAC Client tool (Cloud Client)

http://www.oracle.com/technetwork/middleware/bicloud/downloads/bi-client-tools-3959521.html

Oracle Analytics Cloud (OAC) - Open RPD in cloud using OBIEE Admintool

New cloud version of OBIEE client aka admintool has got an option to open RPD deployed in Oracle Analytics Cloud directly using option "In the cloud" directly.

User: Oracle cloud admin user (Used to login to Oracle cloud)
Password: Cloud admin user password
Cloud: bootstrap (default)
Host Name: Public IP address of OAC service deployed. You can obtain it from OAC service console.
Port Number: 9704
SSL: Uncheck

Once RPD is open, work on it as usual

Good Luck!

Sunday, February 11, 2018

Oracle Data Visualization (ODV)

Oracle Data visualization:

Oracle has some time back released data visualization capabilities to compete with mariad of visualization tool in the market like Tableau.Within its coveted business intelligence and analytics tool (OBIEE) the feature comes in the name of Visual Analyser(VA, to be licensed separately) and Data visualization desktop or Data visualization cloud service as cloud service in Oracle cloud as a part of Oracle analytics cloud.

All of these are licensed products details of which can be found here.

https://blogs.oracle.com/emeapartnerbiepm/oracle-data-visualisation-license-pricing

The desktop version of Oracle DV does not require any kind of OBIEE or database to be installed to get started with the data visualisation and exploration activities. Once installed this works as other desktop application.

Here we will go through the step by step process of creating a project and get going with creating visualization.

Creating a sample project in Oracle DV:

1. First step is to create a project in Oracle Data visualization. to do this go to home screen in DVD and click on Create > Project.

2. If this is the first time, then create a data set or pick the data set already imported.

3. To create a data set from the scratch you have to create a connection. A connection can be either directly through file or connection to any data source. Here I have create a data source to connect to Oracle Database cloud service(DBaaS).

4. Post creation of connection, you can import the data set with the columns needed from the data set.

One can also add data set from within the project by clicking on the + sign at the top left corner of the project data elements.

5. Once the data set is added, and in case multiple data set(tables) are added, one can check if joins are reflecting properly by clicking on the hamburger icon at the top left corner.

6. Once done, click on visualize at the top right, and then drag and drop the attributes and measures that is needed to be visualized. Oracle DV will automatically provide a visualization which best suits the data added to the palette.

7. However the visualization can be changed to whatever is needed from the first option in the left side toolbar.

8. At last one can click on narrate just beside the visualize option and add the canvas created in the project with visualization(In our case it is just one shown above). Users can add commentaries on the canvas to make it more clear and save the project.

9. The final project can be exported as a file in the form of ppt, or shared over email as a ppt or can be uploaded to cloud storage.

And that's it.

Wednesday, September 6, 2017

Text mining in python - Finding most important words from a bag of words in a document

Hi there,

You have been working in data mining and analytics area for sometime now and want to know how to get meaningful information from a bag of words in a document written in natural English understandable to humans automatically. If this is something that excites your mind then you can go ahead and read more.

Text mining is a process of extracting meaningful information from text written in natural English language automatically and programatically so as to automate process of information retrieval and generating analytics out of this unstructured data. With new ways getting devised daily to find out some information of relevance for generating decisive analytics, text mining is growing in prominence. So to do text mining one must be able to write code that can "read and understand" human language. This can be read from any source, be it a website, a social network profile or a word document.

The above statement gives us with two problem areas to solve.

1. Read data

2. Understand data

For the first part, reading data in any programming language is easy to do as I am pretty much sure now any programming language gives you with a feature to read data from any source. For the sake of this article I would concentrate on python as a language and its features.

So to read any data in python, one has to open a file pipe and read through it. So consider we have a text file "ML.txt" with some information on machine learning stored on a machine. To read from the file you have to open the file in python and read from it.

"""Sample code for the same goes here."""

""" This file contains text from a website"""

with open("ML.txt") as f:
    data = f.read();

Now the data variable in python will have data content from the text file ML.txt. So we have solved the

first step of reading data. Now to understand data we need to go through the process of splitting up bag

of words into separate words and then remove unnecessary words out of the bunch of words we have got.

From this we need to check the most occurring words and derive information from these words.

Let's try to note this down as points of what needs to be done.

1. Tokenize - Break the stream of strings or bag of words into separate words.

2. Remove stop words - Remove unnecessary words out of the list of words obtained from step above.

3. Remove special characters which does not give any meaning

4. Get the frequency distribution for the list of remaining words to get most occurring words.

So the python code for the above goes like this.

import nltk;
import os;
from nltk import chat;
from nltk.corpus import stopwords
from nltk.tokenize import WordPunctTokenizer;
from nltk.corpus import movie_reviews;
import re

""" This file contains text from a website"""


with open("ML.txt") as f:
    data = f.read();

#Removing special characters

reg_pattern = r'[^0-9|a-z|A-Z]'
new_data = re.sub(reg_pattern,' ',data)
new_data = new_data.decode('utf-8')

# Tokenize

word_tokens = WordPunctTokenizer().tokenize(new_data)

# Removing stop words

nostop_word_tokens =[x for x in word_tokens if x not in stopwords.words()]

# Get frequency distribution

fq = nltk.FreqDist(nostop_word_tokens)

print(fq.most_common(5))

If we try to show whatever we discussed as a flow, it would look something like this.

The flow diagram if we can see as a block diagram can be given below.

So now you know how text mining and natural language processing is done. We will see more on NLP in upcoming articles. Till then , have a nice day!