BI Foundation: June 2018

Thursday, June 28, 2018

Amazon Redshift as datawarehouse

Redshift is a columnar database provided by Amazon AWS as a cloud database service. Redshift uses a cluster architecture with a leader node responsible for cluster and node management and compute node for data storage and computation. Redshift is based on postgre sql database however with some relaxed constraints and rules.

Redshift database cluster can be created by specifying the number of compute nodes required in the cluster and customers are only charged for compute nodes. Leader nodes usage is completely free. A redshift database architecture is as given below.

Physical Architecture

Logical Architecture

For schema to be used when a user logs in and queries in a database search_path is used to resolved the schema. To add a schema in search_path use set command.

set search_path to '$USER',pg_temp,pg_catalog,...;
------------------
pg_temp and pg_catalog is used first by default in order and then any schema specified by search path is scanned

To set the timeout of queries in session use statement_timeout.

set statement_timeout to ;
----------------------
This specifies the timeout for any statement being executed on the cluster. 0 turns off the timeout.

To set the timezone of a session use timezone.

SET timezone time_zone
-----------------------
TO set the timezone of current session. Run "select pg_timezone_names();" to view list of timezones. to set the timezone for database user used
"ALTER USER SET timezone to 'America/New_York';"

WLM: Workload management
You can use workload management (WLM) to define multiple query queues and to route queries to the appropriate queues at runtime.

When you have multiple sessions or users running queries at the same time, some queries might consume cluster resources for long periods of time and affect the performance of other queries. For example, suppose one group of users submits occasional complex, long-running queries that select and sort rows from several large tables. Another group frequently submits short queries that select only a few rows from one or two tables and run in a few seconds. In this situation, the short-running queries might have to wait in a queue for a long-running query to complete.

You can improve system performance and your users’ experience by modifying your WLM configuration to create separate queues for the long-running queries and the short-running queries. At run time, you can route queries to these queues according to user groups or query groups.

You can configure up to eight query queues and set the number of queries that can run in each of those queues concurrently. You can set up rules to route queries to particular queues based on the user running the query or labels that you specify. You can also configure the amount of memory allocated to each queue, so that large queries run in queues with more memory than other queues. You can also configure the WLM timeout property to limit long-running queries.

SET query_group TO group_label
------------------------------
This is used to set query group to a specific group for the session; This can be used to drive WLM to assign a query to a queue based on group assignment for the session.
create group admin_group with user admin246, admin135, sec555;

Wednesday, June 20, 2018

Pandas: Read encrypted zip file to data frames

This is easy....

#!/bin/bash
import pandas as pd
import zipfile


zf = zipfile.ZipFile("./kivaData_augmented.zip")
txtfiles = zf.infolist()
for txtfile in txtfiles:
    print(txtfile.filename)
    df = pd.read_csv(zf.open(txtfile, "r", pwd="1234".encode('cp850', 'replace')))
    print(df['term_in_months'])

Pandas: Python for data analysis

Pandas: Pandas library in python is used to do all kinds of data analysis. Pandas is highly efficient in data analysis where label oriented analysis is required. Major advantages of using python and pandas is that one would not need any relational database to first load the data and then do tha analysis using sql. Also with multitude of data formats like csv, xlsx, zip, pdf, xml, json etc one would need a tool that supports reading data from all these formats. As for now there are no single tool supporting everything. Python being open source can be used freely by any organisation to create a data reading
framework which can read data from the files (structured and unstructured), do some data cleansing on the go and carry out data analysis using pandas with label data columns.

Pandas support for mainly two data type, i,e. series and data frames makes it very flexible and efficient in data analysis.

Series: It is similar to ndarray with added advantage of having labels to the items in the array. Any compute operations between series takes care of computation based on labels. This is not available in ndarray.

sr = pd.Series([1,2,3,4],index=['col1','col2','col3','col4'])

DataFrames: Dataframes in pandas acts like a sql table with columns and index labels. Added advantage for data frames is that one can pass index labels as well with column labels. This helps in getting specific data values for data analysis based on the row index label and column names(column labels).

df = pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),
'two':pd.Series([4,5,6,7],index=['a','b','c','d'])})

Advantages in data analysis using data frames and series is that one can easily carry out scalar operation on each item of the
dataframe or series one go by using dataframe object and using scalar operation of the object itself.

for eg., df1 = [[1,2,3,nan],[4,5,6,7]]
df2 = [[8,9,10,nan],[11,12,13,14]]

df3 = df1 + df2

Which will give,

df3 = [[9,11,13,nan],[15,17,19,21]]

This is similar to operator overloading in C++.

Also some of the basic operations in dataframes are as follows:

Operation Syntax Result
Select column df[col] Series
Select row by label df.loc[label] Series
Select row by integer location df.iloc[loc] Series
Slice rows df[5:10] DataFrame
Select rows by boolean vector df[bool_vec] DataFrame

Every dataframe object has another copy of same dataframe transposed. This can be accessed using df.T operation on the dataframe object.

See you soon!!

BI Foundation

Thursday, June 28, 2018

Amazon Redshift as datawarehouse

Wednesday, June 20, 2018

Pandas: Read encrypted zip file to data frames

Pandas: Python for data analysis

Oracle analytics - Different pricing schems

Followers