Split Datasets

Image
 The Objective of this article is to transform data set from row to column using explode() method. The scope of this article is to understand how to  unnest or explode a data set using parallel processing framework Pyspark and Python native library- Pandas . Dataset looks like as below: dept,name 10,vivek#ruby#aniket 20,rahul#john#amy 30,shankar#jagdish 40, 50,yug#alex#alexa Pandas explode() import pandas as pd pan_df=pd.read_csv(r'explode.csv') df_exp=pan_df.assign(name=pan_df['name'].str.split('#')).explode('name') df_exp Output: Dataset is transformed successfully and we are able to create new rows from nested dataset. Pandas way of explode is simple, crisp and straight forward unless the dataset is complex. In next section of this article we will cover PySpark way of exploding or unnesting dataset. PySpark explode() Import libraries and Connect to Spark from pyspark import SparkContext,SparkConf import pyspark from pyspark.sql import SparkSes

pySQL

 The Objective of this article is to understand how to perform data manipulation on pandas DataFrames using SQL with pandasql library. 



What is pandasql?
As per documentation pandasql allows us to query pandas DataFrames using SQL syntax.

Installation of pandasql
Library can be installed using below two methods, both of them uses PIP installation:
  • Using Terminal
pip install -U pandasql
  • Using Jupyter Notebooks
!pip install -U pandasql

Use Case
The main function used in pandasql is sqldf. sqldf accepts two arguments: 
  1. SQL query 
  2. Session environment variables ( globals() and locals() )
session environment variables is optional and handled by python itself, even if we do not provide the session variables, we still can achieve the goal of converting pandas DataFrame to perform SQL querying.

Import Necessary Libraries
from pandasql import sqldf
import pandas as pd

Import Data Sets 
emp_df=pd.read_csv(r'D:\python_coding\pandas_practice\emp.csv')
emp_df.head(10)

dept_df=pd.read_csv(r'D:\python_coding\pandas_practice\dept.csv')
dept_df.head()

Querying DataFrames
In this section of blog we will write code to transform the pandas data frames for SQL querying.
create query variables: 
equery='select empno,ename,sal,comm,deptno from emp_df'
dquery='select deptno,dname,loc from dept_df'
As we have discussed in previous sections that
sqldf is the main method of pandasql library that basically performs the conversion or transformation of data frames to sql querying capability.
For conversion of DataFrames we will write a python function that will take query variables, created previously, 
as arguments.
Python conversion functions can be created in two ways lambda or def func(). This blog covers both approaches.
using def 
def df_to_sql(q):
    return sqldf(q)
using lambda 
dftosql = lambda q:sqldf(q)
Read Data using SQL query
equery='select empno,ename,sal,comm,deptno from emp_df'
print(type(df_to_sql(equery)))
df_to_sql(equery)
dquery='select deptno,dname,loc from dept_df'
print(type(dftosql(equery)))
dftosql(dquery)
Perform Join operation
edquery= 'select e.empno, e.ename, e.sal, 
           d.deptno, d.dname from emp_df e 
           inner join dept_df d on e.deptno=d.deptno'
merge_df=dftosql(edquery)
print(type(merge_df))
merge_df.head(10)

Group By and Aggregation
agg_query='select deptno, round(avg(sal),2) as avg from emp_df group by deptno'
agg_df=dftosql(agg_query)
print(type(agg_df))
agg_df.head()

pandasql documentation talks about session or environmental variables globals() and locals(),  
apart from this not much explained in documentation. In document session variables are passed as arguments 
to sqldf method, that performs conversion, but while working on use case I did not used these variables and it still works. 
Actually globals and locals is maintained internally and contains the namespace or environmental details of the current 
session in use. Output of globals() is python dictionary and contains information about all of the objects created in 
session. Just to keep the article simple to understand I have only extracted the keys of the dictionary that relates to 
all the objects for the session.
Let's check what exactly globals() holds. 
print(globals().keys())

dict_keys(['__name__', '__doc__', '__package__', '__loader__', '__spec__', 
'__builtin__', '__builtins__','_ih', '_oh', '_dh', 'In', 'Out', 'get_ipython', 'exit', 
'quit', '_', '__', '___', '_i', '_ii', '_iii', '_i1','sqldf', 'pd', '_i2', 
'emp_df', 'dept_df', '_i3', '_3', '_i4', 'equery', 'dquery', '_i5', 'dftosql', '_i6', 
'df_to_sql', '_i7', '_7', '_i8', '_8', '_i9', '_9', '_i10', 'edquery', 'merge_df', 
'_10', '_i11', 'agg_query', 'agg_df', '_11', '_i12', '_12', '_i13'])
  
To Summarize, topics covered in the article:
  • use of pandasql library to able us to execute SQL query on pandas dataframe.
  • perform SQL operations such as joins, groupby and aggregations etc on pandas dataframe using sql query.
That's all with the performing SQL operation on Pandas DataFrames. 
Thanks for reading my blog and supporting the content. 
Please subscribe and follow me on blogspot for upcoming contents. Word of appreciation always helps to keep up the spirit and a healthy knowledge network helps us grow. 
Share the Knowledge.

Comments

Popular posts from this blog

Split Datasets

Can Julia compete PySpark? A Data Comparison