In the Data universe, Joins and Unions are the most critical and frequently performed operations. In my experience, almost every other operation is either a join or a union. As joins are inevitable so do unions. In previous article we have covered how joins work in Pandas.
Link to article:
https://letscodewithvivek.blogspot.com/2021/12/python-joins.html
The scope of this article is to understand about how concat() methods helps us achieve the union of data frames.
concat()
Concatenate or concat pandas objects along a particular axis with optional set logic along the other axes. create two data frames to understand how concat method works.
concat data frames on axis=0, default operation (union)
import pandas as pd
df1 = pd.DataFrame('Name': ['Vivek', 'Amy', 'Vishakha', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
df2 = pd.DataFrame({'Name': ['Billy', 'Rahul', 'Bran', 'Aniket', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[4,5,6,7,8])
df1.head()
df2.head()
df1 Output:df2 Output:concat data frames on axis=0 to perform union
pd.concat([df1,df2])
#or
pd.concat([df1,df2], axis=0)
#both work exactly same way
Output:
Observe the output, the indexes are preserved.pd.concat([df1,df2],ignore_index=True)
Output:
with ignore_index argument, indexes appear in sequential manner.
we can also define the keys for both data frames (overlapping indexes) to identify their outputs and make them more understandable.
pd.concat([df1,df2],keys=['x','y'])
Output:
when axis=1
concat() with axis=1 behaves like a join operation, perform join operation on basis of common indexes.
pd.concat([df1,df2],axis=1)
Output:
#Note: default behavior is full outer join
#mention join type for inner join
pd.concat([m1,m2],join='inner',axis=1)
Output:
concat unequal shape data frames Important things to consider while performing unions in SQL is that dataset must be identical in terms of shape and datatypes must match. Pandas take care of both aspects.
df1 = pd.DataFrame({'A':[1,2,3], 'B':[1,2,3]})
df2 = pd.DataFrame({'A':[4,5]})
#Note: pandas by default put a Nan for missing values, as Nan is of float type so the series value is converted to a float number
print(df1.dtypes)
print(df2.dtypes)
Output:
df= pd.concat([df1,df2],ignore_index=True)
print(df.dtypes)
df.head()
Output:
#Note: Datatype of series or column B altered by pandas to float to accommodate NaN, as NaN are float in nature. To summarize, topics covered in blog as listed below:
- pandas union operation using concat- axis variation in concat to perform joins
- concat unequal shape data framesThat's all with the Python union operation using Pandas API.
Thanks for reading my blog and supporting the content.
Please subscribe and follow me on blogspot for upcoming contents. Word of appreciation always helps to keep up the spirit and a healthy knowledge network helps us grow.
Share the Knowledge.
Comments
Post a Comment