Split Datasets

Image
 The Objective of this article is to transform data set from row to column using explode() method. The scope of this article is to understand how to  unnest or explode a data set using parallel processing framework Pyspark and Python native library- Pandas . Dataset looks like as below: dept,name 10,vivek#ruby#aniket 20,rahul#john#amy 30,shankar#jagdish 40, 50,yug#alex#alexa Pandas explode() import pandas as pd pan_df=pd.read_csv(r'explode.csv') df_exp=pan_df.assign(name=pan_df['name'].str.split('#')).explode('name') df_exp Output: Dataset is transformed successfully and we are able to create new rows from nested dataset. Pandas way of explode is simple, crisp and straight forward unless the dataset is complex. In next section of this article we will cover PySpark way of exploding or unnesting dataset. PySpark explode() Import libraries and Connect to Spark from pyspark import SparkContext,SparkConf import pyspark from pyspark.sql import SparkSes

Merge Datasets

In the Data universe, Joins are the most critical and frequently performed operations. With the help of Python Pandas API we perform similar kind of stuff while working on a Data Science algorithm or any ETL (Extract Transform and Load) project.


In Pandas for joins methods available are merge() and join(). Merge and join works in similar way but internally they have some differences. And in this blog I tried my best to list out the differences on the nature of these methods.

merge()

merge performs join operation on common columns.


import pandas as pd
d1 = {'Id': [1, 2, 3, 4, 5],
	'Name':['Vivek', 'Rahul', 'Gunjan', 'Ankit','Vishakha'],
	'Age':[30, 24, 32, 32, 28],}
d2 = {'Id': [1, 2, 3, 4],
	'Address':['Delhi', 'Gurgaon', 'Noida', 'Pune'],
    'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}

df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)

merge() when common columns are present in data frames

'''produces output similar as above, 
as pandas merge by default is an equi join'''
pd.merge(df1,df2, how='inner')
Output:

merge() when common columns are not present

d1 = {'Id': [1, 2, 3, 4, 5],
  	'Name':['Vivek', 'Rahul', 'Gunjan', 'Ankit','Vishakha'],
  	'Age':[27, 24, 22, 32, 28],}
d2 = {'EMPNO': [1, 2, 3, 4],
  	'Address':['Delhi', 'Gurgaon', 'Noida', 'Pune'],
  	'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}
df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)
pd.merge(df1,df2, how='inner')
Output:

How to fix the error?

to fix the merge error, we can use pandas argument ‘left_on’ and ‘right_on’ to explicitly indicate pandas on what key columns we want to merge data frames, rest everything remains similar.


pd.merge(df1,df2,left_on='Id',right_on='EMPNO',how='inner')
Output:

Till now we performed only inner join or natural join or equi join, let's visit other type of joins as well using merge, we just need to change the argument value of how and nothing else.

Left merge

'''matching and non matching records from left DF 
  which is df1 is present in result data frame'''
pd.merge(df1,df2,how='left')
Output:
Right merge
pd.merge(df1,df2,how='right')
Output:
Outer merge
#outer joins works same as databases full outer join
pd.merge(df1,df2,how='outer')
Output:
That's all with merge(), in next section we will discus about join() operation.

join()
join performs join operations on the basis of index values of data frame rows.
  
import pandas as pd
d1 = {'key': [0, 1, 5, 3, 4, 2],
	'A': ['A0', 'A1', 'A5', 'A3', 'A4', 'A2']}
d2 = {'key': [0, 1, 2],
	'B': ['B0', 'B1', 'B2']}
df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)
perform join operation on data frame indexes


df1.join(df2, lsuffix='_l', rsuffix='_r')

Output:
What if we don't mention suffixes? join() tries to perform the join on basis of columns, let's see how function behaves.

 
df1.join(df2)

Output:

create data frames with different indexes
 
d1 = pd.DataFrame({'key': [0, 1, 5, 3, 4, 2],
	 'A': ['A0', 'A1', 'A5', 'A3', 'A4', 'A2']},
  	 index=[0,1,2,3,4,5])

d2 = pd.DataFrame({'key': [0, 1, 2],
 	'B': ['B0', 'B1', 'B2']},
    index=[6,7,8])

Default join strategy of join() is left outer join.
Left Join
d1.join(d2,lsuffix='_l',rsuffix='_r',how='left')
#OR
d1.join(d2,lsuffix='_l',rsuffix='_r')
Output:
Equi Join
  
d1.join(d2,lsuffix='_l',rsuffix='_r',how='inner')
Output: No records as we do not have any matching indexes.
Full Outer Join 
d1.join(d2,lsuffix='_l',rsuffix='_r',how='outer')
Output:
perform join operation on common columns
Data frames can be joined on columns as well, but as joins work on indexes, we need to convert the join key into index and then perform join, rest every thin is similar.

#lets say dataframse have diff column names to perform join on
df1 = pd.DataFrame({'key1': ['K0', 'K1', 'K5', 'K3', 'K4', 'K2'],
	'A': ['A0', 'A1', 'A5', 'A3', 'A4', 'A2']},
    index=[0,1,2,3,4,5])
df2 = pd.DataFrame({'key2': ['K0', 'K1', 'K2'],
	'B': ['B0', 'B1', 'B2']},
    index=[6,7,8])

Left Join Now join will be performed on key columns

df1.set_index('key1').join(df2.set_index('key2'))
Output:
Outer Join

df1.set_index('key1').join(df2.set_index('key2'),how='outer')

Output:

To summarize, topics covered in the blog are listed below:
  • join operation using merge() and join()
  • difference between merge() and join().
That's all with the Python join operation using Pandas API. In normal course merge is the mostly used method to perform join operation but its good to have knowledge about other methods as well.
Thanks for reading my blog and supporting the content. 
Please subscribe and follow me on blogspot for upcoming contents. Word of appreciation always helps to keep up the spirit and a healthy knowledge network helps us grow. 
Share the Knowledge.



Comments

Popular posts from this blog

Split Datasets

pySQL

Can Julia compete PySpark? A Data Comparison