Split Datasets

Image
 The Objective of this article is to transform data set from row to column using explode() method. The scope of this article is to understand how to  unnest or explode a data set using parallel processing framework Pyspark and Python native library- Pandas . Dataset looks like as below: dept,name 10,vivek#ruby#aniket 20,rahul#john#amy 30,shankar#jagdish 40, 50,yug#alex#alexa Pandas explode() import pandas as pd pan_df=pd.read_csv(r'explode.csv') df_exp=pan_df.assign(name=pan_df['name'].str.split('#')).explode('name') df_exp Output: Dataset is transformed successfully and we are able to create new rows from nested dataset. Pandas way of explode is simple, crisp and straight forward unless the dataset is complex. In next section of this article we will cover PySpark way of exploding or unnesting dataset. PySpark explode() Import libraries and Connect to Spark from pyspark import SparkContext,SparkConf import pyspark from pyspark.sql import SparkSes

Append Datasets

 In the Data universe, Joins and Unions are the most critical and frequently performed operations. In my experience, almost every other operation is either a join or a union. As joins are inevitable so do unions. In previous article we have covered how joins work in Pandas.

Link to article: 
https://letscodewithvivek.blogspot.com/2021/12/python-joins.html


The scope of this article is to understand about how concat() methods helps us achieve the union of data frames.

concat()
Concatenate or concat pandas objects along a particular axis with optional set logic along the other axes. create two data frames to understand how concat method works.
concat data frames on axis=0, default operation (union)
  
import pandas as pd
df1 = pd.DataFrame('Name': ['Vivek', 'Amy', 'Vishakha', 'Alice', 'Ayoung'],
		'subject_id':['sub1','sub2','sub4','sub6','sub5'],
        'Marks_scored':[98,90,87,69,78]},
        index=[1,2,3,4,5])

df2 = pd.DataFrame({'Name': ['Billy', 'Rahul', 'Bran', 'Aniket', 'Betty'],
		'subject_id':['sub2','sub4','sub3','sub6','sub5'],
        'Marks_scored':[89,80,79,97,88]},
        index=[4,5,6,7,8])
df1.head()
df2.head()
df1 Output:
df2 Output:
concat data frames on axis=0 to perform union
  
pd.concat([df1,df2])
#or
pd.concat([df1,df2], axis=0)
#both work exactly same way
Output:

Observe the output, the indexes are preserved.
pd.concat([df1,df2],ignore_index=True)
Output:

with ignore_index argument, indexes appear in sequential manner.
we can also define the keys for both data frames (overlapping indexes) to identify their outputs and make them more understandable.

pd.concat([df1,df2],keys=['x','y'])
Output:
when axis=1

concat() with axis=1 behaves like a join operation, perform join operation on basis of common indexes.


pd.concat([df1,df2],axis=1)

Output:
#Note: default behavior is full outer join

#mention join type for inner join
pd.concat([m1,m2],join='inner',axis=1)

Output:

concat unequal shape data frames
 Important things to consider while performing unions in SQL is that dataset must be identical in terms of shape and datatypes must match. Pandas take care of both aspects.

  
df1 = pd.DataFrame({'A':[1,2,3], 'B':[1,2,3]})
df2 = pd.DataFrame({'A':[4,5]})

#Note: pandas by default put a Nan for missing values, as Nan is of float type so the series value is converted to a float number

  
print(df1.dtypes)
print(df2.dtypes)

Output:

df= pd.concat([df1,df2],ignore_index=True)
print(df.dtypes)
df.head()

Output:
#Note: Datatype of series or column B altered by pandas to float to accommodate NaN, as NaN are float in nature.

To summarize, topics covered in blog as listed below:

- pandas union operation using concat
- axis variation in concat to perform joins
- concat unequal shape data frames

That's all with the Python union operation using Pandas API.

Thanks for reading my blog and supporting the content. 
Please subscribe and follow me on blogspot for upcoming contents. Word of appreciation always helps to keep up the spirit and a healthy knowledge network helps us grow.
Share the Knowledge.







Comments

Popular posts from this blog

Split Datasets

pySQL

Can Julia compete PySpark? A Data Comparison