Wednesday, June 20, 2018

Pandas: Python for data analysis

Pandas: Pandas library in python is used to do all kinds of data analysis. Pandas is highly efficient in data analysis where label oriented analysis is required. Major advantages of using python and pandas is that one would not need any relational database to first load the data and then do tha analysis using sql. Also with multitude of data formats like  csv, xlsx, zip, pdf, xml, json etc one would need a tool that supports reading data from all these formats. As for now there are no single tool supporting everything. Python being open source can be used freely by any organisation to create a data reading
framework which can read data from the files (structured and unstructured), do some data cleansing on the go and carry out data analysis using pandas with label data columns.


Pandas support for mainly two data type, i,e. series and data frames makes it very flexible and efficient in data analysis.


Series: It is similar to ndarray with added advantage of having labels to the items in the array. Any compute operations between series takes care of computation based on labels. This is not available in ndarray.

             sr = pd.Series([1,2,3,4],index=['col1','col2','col3','col4'])

DataFrames: Dataframes in pandas acts like a sql table with columns and index labels. Added advantage for data frames is that one can pass index labels as well with column labels. This helps in getting specific data values for data analysis based on the row index label and column names(column labels).

df = pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),
   'two':pd.Series([4,5,6,7],index=['a','b','c','d'])})


Advantages in data analysis using data frames and series is that one can easily carry out scalar operation on each item of the
dataframe or series one go by using dataframe object and using scalar operation of the object itself.

for eg., df1 = [[1,2,3,nan],[4,5,6,7]]
df2 = [[8,9,10,nan],[11,12,13,14]]

df3 = df1 + df2

Which will give,

df3 = [[9,11,13,nan],[15,17,19,21]]

This is similar to operator overloading in C++.

Also some of the basic operations in dataframes are as follows:

Operation                                              Syntax Result
Select column                                       df[col] Series
Select row by label                                df.loc[label] Series
Select row by integer location df.iloc[loc] Series
Slice rows                                             df[5:10]         DataFrame
Select rows by boolean vector df[bool_vec] DataFrame

Every dataframe object has another copy of same dataframe transposed. This can be accessed using df.T operation on the dataframe object.

See you soon!!

No comments:

Post a Comment

Oracle analytics - Different pricing schems

OBIEE on premise Licensing: Component based licensing: The pricing is based on the user base or volume(COGS/revenue). It is a buy as ...