pySQL

March 15, 2022

The Objective of this article is to understand how to perform data manipulation on pandas DataFrames using SQL with pandasql library.

What is pandasql?

As per documentation pandasql allows us to query pandas DataFrames using SQL syntax.

Installation of pandasql

Library can be installed using below two methods, both of them uses PIP installation:

Using Terminal

pip install -U pandasql

Using Jupyter Notebooks

!pip install -U pandasql

Use Case

The main function used in pandasql is sqldf. sqldf accepts two arguments:

SQL query
Session environment variables ( globals() and locals() )

session environment variables is optional and handled by python itself, even if we do not provide the session variables, we still can achieve the goal of converting pandas DataFrame to perform SQL querying.

Import Necessary Libraries

from pandasql import sqldf
import pandas as pd

Import Data Sets

emp_df=pd.read_csv(r'D:\python_coding\pandas_practice\emp.csv')
emp_df.head(10)

dept_df=pd.read_csv(r'D:\python_coding\pandas_practice\dept.csv')
dept_df.head()

Querying DataFrames

In this section of blog we will write code to transform the pandas data frames for SQL querying.

create query variables:

equery='select empno,ename,sal,comm,deptno from emp_df'
dquery='select deptno,dname,loc from dept_df'
As we have discussed in previous sections that sqldf is the main method of pandasql library that basically performs 
the conversion or transformation of data frames to sql querying capability.

For conversion of DataFrames we will write a python function that will take query variables, created previously, 
as arguments.

Python conversion functions can be created in two ways lambda or def func(). This blog covers both approaches.

using def

def df_to_sql(q):
    return sqldf(q)

using lambda

dftosql = lambda q:sqldf(q)
Read Data using SQL query
equery='select empno,ename,sal,comm,deptno from emp_df'
print(type(df_to_sql(equery)))
df_to_sql(equery)
dquery='select deptno,dname,loc from dept_df'
print(type(dftosql(equery)))
dftosql(dquery)
Perform Join operation
edquery= 'select e.empno, e.ename, e.sal, 
           d.deptno, d.dname from emp_df e 
           inner join dept_df d on e.deptno=d.deptno'
merge_df=dftosql(edquery)
print(type(merge_df))
merge_df.head(10)

Group By and Aggregation
agg_query='select deptno, round(avg(sal),2) as avg from emp_df group by deptno'
agg_df=dftosql(agg_query)
print(type(agg_df))
agg_df.head()

pandasql documentation talks about session or environmental variables globals() and locals(),  
apart from this not much explained in documentation. In document session variables are passed as arguments 
to sqldf method, that performs conversion, but while working on use case I did not used these variables and it still works. 
Actually globals and locals is maintained internally and contains the namespace or environmental details of the current 
session in use. Output of globals() is python dictionary and contains information about all of the objects created in 
session. Just to keep the article simple to understand I have only extracted the keys of the dictionary that relates to 
all the objects for the session.
Let's check what exactly globals() holds.

print(globals().keys())

dict_keys(['__name__', '__doc__', '__package__', '__loader__', '__spec__', 
'__builtin__', '__builtins__','_ih', '_oh', '_dh', 'In', 'Out', 'get_ipython', 'exit', 
'quit', '_', '__', '___', '_i', '_ii', '_iii', '_i1','sqldf', 'pd', '_i2', 
'emp_df', 'dept_df', '_i3', '_3', '_i4', 'equery', 'dquery', '_i5', 'dftosql', '_i6', 
'df_to_sql', '_i7', '_7', '_i8', '_8', '_i9', '_9', '_i10', 'edquery', 'merge_df', 
'_10', '_i11', 'agg_query', 'agg_df', '_11', '_i12', '_12', '_i13'])

To Summarize, topics covered in the article:

use of pandasql library to able us to execute SQL query on pandas dataframe.
perform SQL operations such as joins, groupby and aggregations etc on pandas dataframe using sql query.

That's all with the performing SQL operation on Pandas DataFrames.

Thanks for reading my blog and supporting the content.

Please subscribe and follow me on blogspot for upcoming contents. Word of appreciation always helps to keep up the spirit and a healthy knowledge network helps us grow.

Share the Knowledge.

Search This Blog

Let's Code with Vivek

Split Datasets

pySQL

Comments

Post a Comment

Popular posts from this blog

Split Datasets

Ingest Excel Data

Can Julia compete PySpark? A Data Comparison