BI Foundation: 2011

Wednesday, December 28, 2011

Updating an Oracle view

Updating a view in Oracle means updating the base tables on which view is made. A view can be updatable or it can be non updatable. There are a few conditions that needs to be satisfied for a view to be updatable.

Restrictions on updating a view in Oracle:

A view having following clause cannot be updated in Oracle:
A set operator
A DISTINCT operator
An aggregate or analytic function
A GROUP BY, ORDER BY, MODEL, CONNECT BY, or START WITH clause
A collection expression in a SELECT list
A sub query in a SELECT list
A sub query designated WITH READ ONLY

Example,

Consider a scenario.

There are three tables: emp_details, emp_wages, department as follows:

Consider a view v_emp --

create or replace view v_emp as select emp_name name,emp_location location,dept_name department,
basic_pay basic,hra hra,pf pf
from emp_details detl, emp_wages wages, department dept
where
detl.emp_id= wages.emp_id
and wages.dept_id= dept.dept_id;

The given view is updatable as it does not contain any of the restictive clause in its definition.

Now, consider a view v_emp_distinct --

create or replace view v_emp_distinct as select distinct emp_name name,emp_location location,dept_name department,
basic_pay basic,hra hra,pf pf
from emp_details detl, emp_wages wages, department dept
where
detl.emp_id= wages.emp_id
and wages.dept_id= dept.dept_id;

As the view contains distinct clause, so the view is not updatable.

Again consider a view v_emp_aggr --

create or replace view v_emp_aggr as select emp_location location,
sum(basic_pay) Total_basic
from emp_details detl, emp_wages wages, department dept
where
detl.emp_id= wages.emp_id
and wages.dept_id= dept.dept_id
group by emp_location;

The clause in bold letters indicates that the views are not updatable.

You can yourself copy paste the codes to verify the restrictions.

Wednesday, December 7, 2011

Partition table in Oracle 11g using virtual columns

From 11g, Oracle has added few more partitioning methods to the list of partitioning methods. One of the method is partition using virtual column. A virtual column is a column which does not store any data but is calculated on the fly.

Actions that can be performed on virtual columns:

1. A virtual column can be used in WHERE clause of a SELECT, UPDATE or DELETE statement.
2. Statistics can be collected on them.
3. They can be used as a partition key to partition a table.
4. Indexes can be created on these columns.
5. Constraints can be created on them.

Actions that cannot be performed on virtual columns:

1. Virtual columns cannot be updated.
2. A virtual column cannot be of user defined datatype.
3. No DML's are allowed on virtual columns.
4. Virtual columns can only be created on normal tables and not on index organized or temporary tables.

Virtual column based partitioning:

SYNTAX:

CREATE TABLE <>
(
COL1 datatype1,
COL2 datatype2,
COL3 [datatype] GENERATED ALWAYS AS <expression>
)[TABLESPACE tablespace_name]
PARTITION BY <partition_type>(COL3)
(
PARTITION <partition_name> VALUES LESS THAN (value1),
PARTITION <partition_name> VALUES LESS THAN (value2),
PARTITION <partition_name> VALUES LESS THAN (value3),
PARTITION <partition_name> VALUES LESS THAN (MAXVALUE)
);

Here we have used range partitioning method to partition a table.

Thursday, November 24, 2011

Partitioning in Oracle

Partitioning in Oracle is a way of distributing table data into separate segments to improve performance of data retrieval. A table with more than 2GB of data volume is generally considered candidate of partitioning. Data is a partitioned table are physically divided and stored into separate physical datafiles, but it is transparent to the end user.

A table can be partitioned in different ways:

--> Range Partition
--> List Partition
--> Hash Partition

Range Partition:

A table can be divided using range partitioning if you want to distribute the data using the range of values of some columns of a table. Generally a data column is considered as a partition key for range partitioning. Like different partitions for Jan-2011, Feb-2011, ......,Dec-2011.

Syntax:

CREATE TABLE SAMPLE
(
COLUMN1 VARCHAR2(10),
COLUMN2 DATE,
COLUMN3 NUMBER(10,2)
)TABLESPACE TABLE1
PARTITION BY RANGE(COLUMN2)
(
PARTITION PART1 VALUES LESS THAN (TO_DATE('01/02/2011','DD/MM/YYYY')),
PARTITION PART2 VALUES LESS THAN (TO_DATE('01/03/2011','DD/MM/YYYY')),
....
....
PARTITION PART12 VALUES LESS THAN (TO_DATE('01/12/2011','DD/MM/YYYY')),
);

List Partition:

A table can be divided using list partitioning if you want to distribute the data using the listed values of some columns of a table. This partitioning method is generally used if most of the time the table is queried using the value of the column.

Syntax:

CREATE TABLE SAMPLE
(
COLUMN1 VARCHAR2(10),
COLUMN2 DATE,
COLUMN3 NUMBER(10,2)
)TABLESPACE TABLE1
PARTITION BY LIST(COLUMN2)
(
PARTITION PART1 VALUES ('A','B'),
PARTITION PART2 VALUES ('C','D','E'),
....
....
PARTITION PART12 VALUES ('G','H','V'),
);

Hash Partition:

Hash partitioning enables easy partitioning of data that does not lend itself to range or list partitioning.

Syntax:

CREATE TABLE SAMPLE
(
COLUMN1 VARCHAR2(10),
COLUMN2 DATE,
COLUMN3 NUMBER(10,2)
)TABLESPACE TABLE1
PARTITION BY HASH(COLUMN2)
PARTITIONS n
STORE IN (TS1,TS2,....TSn);

Key points while partitioning a table:

1. Table containig LONG and LONG RAW datatype column cannot be partitioned.
2. Pseudo columns like LEVEL,ROWID,MLSLABEl cannot be used as a partition key.
3. Partition key columns can contain columns that are nullable.
4. In case of range or hash partitioning you can use upto maximum 16 columns a partition key.
5. In case of a partitioned table, you cannot define a unique local index if the partitioning key is not a part of index key.
6. A bitmap index cannot be global index on a partitioned table.

Wednesday, November 16, 2011

Informatica backend query to get the load type in a session for a task

SELECT distinct subj.SUBJ_NAME AS FOLDERNAME,
task.TASK_NAME AS SESSIONNAME,
swdgt.INSTANCE_NAME AS INSTANCENAME,
DECODE (sessattr.ATTR_VALUE, '0', 'NORMAL', '1', 'BULK', NULL)
AS TARGETLOADTYPE
FROM OPB_SESSION sess,
OPB_SWIDGET_INST swdgt,
OPB_EXTN_ATTR sessattr,
OPB_TASK task,
OPB_SUBJECT subj
WHERE sess.SESSION_ID = swdgt.SESSION_ID
AND sess.SESSION_ID = sessattr.SESSION_ID
AND sessattr.ATTR_ID = 3
AND swdgt.SESS_WIDG_INST_ID = sessattr.SESS_WIDG_INST_ID
AND sess.SESSION_ID = task.TASK_ID
AND task.SUBJECT_ID = subj.SUBJ_ID
and swdgt.widget_type=2
order by subj.SUBJ_NAME ,
task.TASK_NAME ,
swdgt.INSTANCE_NAME

Monday, November 14, 2011

Informatica backend Query to find mapping in informatica using Aggregator transformation.

select distinct map.mapping_name,inst.instance_name from opb_widget wdgt,opb_widget_inst inst,opb_mapping map
where
wdgt.widget_id=inst.widget_id and
inst.mapping_id=map.mapping_id and
wdgt.widget_type=9;

Informatica backend query to find a mapping parameter used in a mapping.

select subj.subj_name,map.mapping_name ,param.pv_name from opb_map_parmvar param,opb_mapping map,opb_subject subj
where param.mapping_id=map.mapping_id
and map.subject_id=subj.subj_id
and param.pv_name='$$PARAMETER';

Friday, September 30, 2011

Informatica ETL backend query to get mapping name, source, target for a mapping.

select
subj.SUBJ_NAME AS FOLDER_NAME,
WrkFlw.TASK_NAME AS WORKFLOW_NAME,
Sess.INSTANCE_NAME AS SESSION_NAME,
map.MAPPING_NAME,
SRC.INSTANCE_NAME AS SOURCE_NAME,
TGT.INSTANCE_NAME AS TARGET_NAME
from
OPB_SUBJECT subj,
OPB_TASK WrkFlw,
(SELECT
WORKFLOW_ID,
INSTANCE_ID,
TASK_ID,
TASK_TYPE,
INSTANCE_NAME,
MAX(VERSION_NUMBER)
FROM OPB_TASK_INST SESS
WHERE
SESS.TASK_TYPE=68
GROUP BY
WORKFLOW_ID,INSTANCE_ID,TASK_ID,TASK_TYPE,INSTANCE_NAME) Sess,
(SELECT SESSION_ID,MAPPING_ID,MAX(VERSION_NUMBER) FROM OPB_SESSION GROUP BY SESSION_ID,MAPPING_ID) ssn,
opb_mapping map,
(SELECT MAPPING_ID,INSTANCE_NAME ,WIDGET_TYPE,MAX(VERSION_NUMBER) from opb_widget_inst GROUP BY MAPPING_ID,INSTANCE_NAME,WIDGET_TYPE) SRC,
(SELECT MAPPING_ID,INSTANCE_NAME ,WIDGET_TYPE,MAX(VERSION_NUMBER) from opb_widget_inst GROUP BY MAPPING_ID,INSTANCE_NAME,WIDGET_TYPE) TGT
where WrkFlw.IS_VISIBLE = 1
AND WrkFlw.SUBJECT_ID = subj.SUBJ_ID
AND Sess.WORKFLOW_ID = WrkFlw.TASK_ID
AND WrkFlw.TASK_TYPE = 71
AND sess.task_id = ssn.session_id
AND ssn.mapping_id = map.mapping_id
AND map.IS_VISIBLE = 1
AND SRC.MAPPING_ID=map.mapping_id
AND SRC.WIDGET_TYPE=1
AND TGT.MAPPING_ID=map.mapping_id
AND TGT.WIDGET_TYPE=2

Wednesday, August 24, 2011

ETL in Datawarehousing

ETL phase of datawarehousing deals with extracting data from source and loading it to datawarehouse, after transforming it to a standard required format.

ETL process can be carried out by many means. Simplest of which is PL/SQL procedures. Simplest and the earliest of all are ETL process is PL/SQL procedures. But with time new advanced tools for ETL process have been developed. Few tools for ETL are:

Oracle Warehouse Builder (OWB) Oracle
Data Services SAP Business Objects
IBM Information Server (Datastage) IBM
PowerCenter Informatica Informatica
DataFlow Manager Pitney Bowes Business Insight
Transformation Manager ETL Solutions Ltd.
Pentaho Data Integration Pentaho

Out of these Informatica is a powerful ETL tool, recently very popular because of its features.
To know more about Informatica tool visit the blog again.

Thursday, June 30, 2011

Indexes in data warehouse:

A telephone directory without an index. Can you imagine, what will happen in such a situation. It will take hours to search a
name in the directory. This is a very simple example of importance and use of index. Similarly an index on a table with millions of data
is helpful in reducing the search time.

An index is a database object that contains an entry for each value that appears in the indexed column(s) of the table or cluster and
provides direct, fast access to rows.

A normal index in Oracle database is created using B- Tree(balanced tree) structure. While there can be different types of indexes that
can be created in Oracle, the one's that are mainly used in data warehousing are normal and bitmap indexes.

Now before going forward, let us get a clear idea of how an index works in Oracle.

In Oracle, everything accessed is in terms of block.

What is a block?

Oracle stores data in blocks or logical blocks. One data block corresponds to a specific number of bytes of physical database space
on disk.It is the smallest unit of disk that oracle reads. As rows in a table are very small, so many rows can fit into
a block in oracle.

The size of a block in Oracle can be configured but generally is of size 4kb, 8kb,16kb, 32kb.

What is a B-Tree structure?

A B Tree or a balanced tree is a database tree structure, in which each node can contain multiple childs.
If d is the depth of the tree, then usually the number of childs vary between d and 2d. If a node starts filling
then it is split to balance nodes.

In Oracle, a B - Tree contains database blocks as nodes in the tree.

How an index works?

For example, in a telephone directory you want to search for a name James. You will move to index and see that
page 5 starts with Jack and page 10 starts with Jaqueline. So James would be some where in between page 5 and
page 10.

Now on page 5, you have another sub index containing entries for names between Jack and Japper which are in pages 5,6,7,8,9.
From this index you get that page 8 starts with Jame and ends with Jamy. So now you know that James would be in this
page only. Now you can easily search for james in this page.

Similarly in a b tree index, each child blocks gives us next block to search for a particular row in a table. And finally
leaf node or block gives us the required row.

Eg.

In main index,
Page 5 starts with Jack.
Page 10 starts with Jaqueline.

In Page 5 sub index,
Page 5 starts with Jack and page 9 starts with Japper.
Page 8 starts with jame and ends with Jamy.

So Page 8 is the target page.

Wednesday, June 8, 2011

Surrogate keys

Surrogate key is a unique identifier used to uniquely identify records in a table. Surrogate keys are generally used in data warehousing.

Then what is the difference between surrogate key and a primary key?

Well, it depends on the type of database model being used. For a current database, where only current data are kept, there will be one to one correspondence between surrogate key and primary key.

But in a temporal database, where there are lots of data with same primary key value, as historical data are also saved.

In a temporal database, it is necessary to distinguish between the surrogate key and the primary key. Typically, every row would have both a primary key and a surrogate key. The primary key identifies the unique row in the database, the surrogate key identifies the unique entity in the modeled world; these two keys are not the same. For example, record containing a customer location details when he lived in US and now when he lives in UK. Both will have same customer_id which is the primary key. But in a temporal database where historical database are stored, surrogate key will be required to uniquely identify record between these two records. For such kind of situations, surrogate keys are required.

Example:

A customer table containing customer details.

Customer_id	Customer_name	Customer_address	Customer_city	Phone_number	Email_id
1002	John Smith	10, Wall Street	New York	1123301111	john.smith@abc.com
1002	John Smith	12, Grand Street	London	1122331111	john.smith@abc.com

As you see here, we cannot identify difference between the two records. As the customer_id are same. But as first record was input 10 years before and now the customer has moved to a new location. So, new record also contain the record. Such situation can be dealt using surrogate keys.

Customer_key	Customer_id	Customer_name	Customer_address	Customer_city	Phone_number	Email_id
1	1002	John Smith	10, Wall Street	New York	1123301111	john.smith@abc.com
2	1002	John Smith	12, Grand Street	London	1122331111	john.smith@abc.com

Now you can see that surrogate keys are unique even if there is temporal data in a database.

A surrogate key are generally generated using database sequence objects.

Advantages of a surrogate key:

Immutability

Surrogate keys do not change while the row exists. This has the following advantages.

Requirement changes:

Attributes that uniquely identify an entity might change, which might invalidate the suitability of the natural, compound keys.

Performance

Surrogate keys tend to be a compact data type such as a four-byte integer. This allows the database to query the single key column faster than it could multiple columns.

Tuesday, May 24, 2011

Modelling a STAR schema from ER model

An ER model is a two dimensional data model used in transactional databases. These models does not provide features required in a data warehouse. For this purpose star schema model is used. We can design a star schema model from a ER data model.

Few points that needs to be considered for converting an ER model to a star schema are as fallows:

In an ER model, generally a FACT table is considered a table in ER model where there is maximum cardinality for all the relationships viz many in one to many or many to many relations.

Dimensions are the tables with cardinality of 0 or 1.

A star schema can also be modeled from scratch using top down approach. Following are the steps to be followed for modeling a STAR schema:

Identify the business process for analysis.

Identify measures or facts.

Identify dimensions for facts.

Determine lowest grain i,e lowest level of cardinality for fact table.

In the next post we will see the concept of surrogate keys.

Sunday, May 22, 2011

Business Intelligence and reporting:

Business intelligence is used for reporting purpose, to show the performance of any business using different dimensions like time, region, product etc.
Reporting requires data to be presented in a manner so that decision making can be done on that. For this purpose concept of data warehousing has been introduced.

The data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support environments. The process of gathering, cleaning and integrating data from various sources, usually from long-term existing operational systems to a target data warehouse is called data warehousing.

Design model for Data warehouse:
Generally two models are followed for storing data in data warehouse.
• Normalized approach – In this approach of storing data, the dimensional tables are normalized usually in 3NF. This approach is called snowflaking.
The advantage of this schema :
• Loading data from OLTP(source database) is easy .
Disadvantages of this schema :
• join data from different sources into meaningful information and then
• access the information without a precise understanding of the sources of data and of the data structure of the data warehouse.

• De normalized or Dimensional approach : In this approach the data are divided into dimensions and facts. The fact table is used to store measures to be analyzed while dimension tables are used to store the context on which facts in fact table are analyzed.

Advantages of dimensional approach :
• Complex query processing can be done efficiently.
• The schema can closely depict the business model.
• It incorporates the concept of slowly changing dimensions.

Difference between dimensional model and relational model:
• An E-R diagram (used in OLTP or transactional system) has highly normalized model (Even at a logical level), whereas dimensional model aggregates most of the attributes and hierarchies of a dimension into a single entity.
• An E-R diagram is a complex maze of hundreds of entities linked with each other, whereas the Dimensional model has logical grouped set of star-schemas.
• The E-R diagram is split as per the entities. A dimension model is split as per the dimensions and facts.
• In an E-R diagram all attributes for an entity including textual as well as numeric, belong to the entity table. Whereas a 'dimension' entity in dimension model has mostly the textual attributes, and the 'fact' entity has mostly numeric attributes.
Dimensional approach of data modeling leads to a design approach called OLAP (Online Analytical Processing). The core of OLAP system is an OLAP cube. It consists of numeric facts called measures which are categorized by dimensions. The cube metadata is typically created from a star schema or snowflake schema of tables in a relational database.
Data warehousing uses the OLAP design approach for data modeling. An OLAP uses a technique of OLAP cube for modeling and building a decision support system.

To understand an OLAP cube, lets take an example.
A company has got a project to warehouse all the sales data of different soft drinks company. The data will be used to create report of sales according to region, time and product of sale.
Here the data from analytical system will be extracted and stored in a data warehouse. The data warehouse will be designed using an OLAP cube as shown below.

Here, there are three dimensions on which our data warehouse reporting will be done. Time, Geography and Product.
If we have to find out the sales of a product class in a particular year for a particular region, we will move to that product class on product axis. Then move to that particular year on time axis and then for a particular geography, move in direction of geography axis. This will lead us to a cube that will give the value of sales we require. This multidimensional approach of warehouse design will help us drill to sales related data for any product class, for any year, month, week based in any geography.
This OLAP cube now has to be implemented on any data storage system. An OLAP designed on a relational database system is called as a ROLAP (Relational OLAP).
For a ROLAP system the data modeling schema followed to implement this multidimensional structure is STAR schema.

In te next post we will see how to model a star schema from a ER model schema..

Business Intelligence:

Business Intelligence refers to computer based technique, used in identifying and analyzing business data to get an insight into the trends of business health. The sales summary, market summary etc. of any business by region, by time period, by product, by product class etc.
Business intelligence aims to support business decision making by creating processes using transformation of business data in a form that helps in decision making.
Often transforming of business data is done using the concept of data warehousing.
PS: Remember that all Data warehouse is used in business intelligence, but all business intelligence does not require data warehousing.
In simple terms “Business Intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision-making”.
Common uses of business intelligence are reporting, data mining, process mining, business performance intelligence, predictive analysis.

In the next post I will describe about business intelligence and its use in reporting.

BI Foundation