Merge Datasets

March 10, 2022

In the Data universe, Joins are the most critical and frequently performed operations. With the help of Python Pandas API we perform similar kind of stuff while working on a Data Science algorithm or any ETL (Extract Transform and Load) project.

In Pandas for joins methods available are merge() and join(). Merge and join works in similar way but internally they have some differences. And in this blog I tried my best to list out the differences on the nature of these methods.

merge()

merge performs join operation on common columns.


import pandas as pd
d1 = {'Id': [1, 2, 3, 4, 5],
	'Name':['Vivek', 'Rahul', 'Gunjan', 'Ankit','Vishakha'],
	'Age':[30, 24, 32, 32, 28],}
d2 = {'Id': [1, 2, 3, 4],
	'Address':['Delhi', 'Gurgaon', 'Noida', 'Pune'],
    'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}

df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)

merge() when common columns are present in data frames

'''produces output similar as above, 
as pandas merge by default is an equi join'''
pd.merge(df1,df2, how='inner')

Output:

merge() when common columns are not present

d1 = {'Id': [1, 2, 3, 4, 5],
  	'Name':['Vivek', 'Rahul', 'Gunjan', 'Ankit','Vishakha'],
  	'Age':[27, 24, 22, 32, 28],}
d2 = {'EMPNO': [1, 2, 3, 4],
  	'Address':['Delhi', 'Gurgaon', 'Noida', 'Pune'],
  	'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}
df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)
pd.merge(df1,df2, how='inner')

Output:

How to fix the error?

to fix the merge error, we can use pandas argument ‘left_on’ and ‘right_on’ to explicitly indicate pandas on what key columns we want to merge data frames, rest everything remains similar.


pd.merge(df1,df2,left_on='Id',right_on='EMPNO',how='inner')

Output:

Till now we performed only inner join or natural join or equi join, let's visit other type of joins as well using merge, we just need to change the argument value of how and nothing else.

Left merge

'''matching and non matching records from left DF 
  which is df1 is present in result data frame'''
pd.merge(df1,df2,how='left')

Output:

Right merge

pd.merge(df1,df2,how='right')

Output:

Outer merge

#outer joins works same as databases full outer join
pd.merge(df1,df2,how='outer')

Output:

That's all with merge(), in next section we will discus about join() operation.

join()

join performs join operations on the basis of index values of data frame rows.

  
import pandas as pd
d1 = {'key': [0, 1, 5, 3, 4, 2],
	'A': ['A0', 'A1', 'A5', 'A3', 'A4', 'A2']}
d2 = {'key': [0, 1, 2],
	'B': ['B0', 'B1', 'B2']}
df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)

perform join operation on data frame indexes


df1.join(df2, lsuffix='_l', rsuffix='_r')

Output:

What if we don't mention suffixes? join() tries to perform the join on basis of columns, let's see how function behaves.

 
df1.join(df2)

Output:

create data frames with different indexes

 
d1 = pd.DataFrame({'key': [0, 1, 5, 3, 4, 2],
	 'A': ['A0', 'A1', 'A5', 'A3', 'A4', 'A2']},
  	 index=[0,1,2,3,4,5])

d2 = pd.DataFrame({'key': [0, 1, 2],
 	'B': ['B0', 'B1', 'B2']},
    index=[6,7,8])

Default join strategy of join() is left outer join.

Left Join

d1.join(d2,lsuffix='_l',rsuffix='_r',how='left')
#OR
d1.join(d2,lsuffix='_l',rsuffix='_r')

Output:

Equi Join

  
d1.join(d2,lsuffix='_l',rsuffix='_r',how='inner')

Output: No records as we do not have any matching indexes.

Full Outer Join

d1.join(d2,lsuffix='_l',rsuffix='_r',how='outer')

Output:

perform join operation on common columns

Data frames can be joined on columns as well, but as joins work on indexes, we need to convert the join key into index and then perform join, rest every thin is similar.


#lets say dataframse have diff column names to perform join on
df1 = pd.DataFrame({'key1': ['K0', 'K1', 'K5', 'K3', 'K4', 'K2'],
	'A': ['A0', 'A1', 'A5', 'A3', 'A4', 'A2']},
    index=[0,1,2,3,4,5])
df2 = pd.DataFrame({'key2': ['K0', 'K1', 'K2'],
	'B': ['B0', 'B1', 'B2']},
    index=[6,7,8])

Left Join Now join will be performed on key columns


df1.set_index('key1').join(df2.set_index('key2'))

Output:

Outer Join


df1.set_index('key1').join(df2.set_index('key2'),how='outer')

Output:

To summarize, topics covered in the blog are listed below:

join operation using merge() and join()
difference between merge() and join().

That's all with the Python join operation using Pandas API. In normal course merge is the mostly used method to perform join operation but its good to have knowledge about other methods as well.

Thanks for reading my blog and supporting the content.

Please subscribe and follow me on blogspot for upcoming contents. Word of appreciation always helps to keep up the spirit and a healthy knowledge network helps us grow.

Share the Knowledge.

Search This Blog

Let's Code with Vivek

Split Datasets

Merge Datasets

Comments

Post a Comment

Popular posts from this blog

Can Julia compete PySpark? A Data Comparison

Split Datasets

Spark Window Functions