Split Datasets

 The Objective of this article is to transform data set from row to column using explode() method. The scope of this article is to understand how to unnest or explode a data set using parallel processing framework Pyspark and Python native library- Pandas.


Dataset looks like as below:

dept,name
10,vivek#ruby#aniket
20,rahul#john#amy
30,shankar#jagdish
40,
50,yug#alex#alexa

Pandas explode()
import pandas as pd
pan_df=pd.read_csv(r'explode.csv')
df_exp=pan_df.assign(name=pan_df['name'].str.split('#')).explode('name')
df_exp
Output:

Dataset is transformed successfully and we are able to create new rows from nested dataset. Pandas way of explode is simple, crisp and straight forward unless the dataset is complex.
In next section of this article we will cover PySpark way of exploding or unnesting dataset.

PySpark explode()

Import libraries and Connect to Spark
from pyspark import SparkContext,SparkConf
import pyspark
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('explode_data').getOrCreate()
Load Dataset
from pyspark.sql.functions import explode, split, explode_outer
exp_df=spark.read.csv(r'explode.csv',header=True, inferSchema=True)
explode()
e_df=exp_df.withColumn('name',explode(split('name','#')))
e_df.show(15)
Output:
Dataset is transformed successfully. Now let's compare the PySpark output with Pandas output, observe carefully for deptno=40 in our dataset, no employee is assigned, pandas handled the scenario by assigning a Null or NaN value in data frame but in PySpark deptno=40 not present in final output. But we want the output to have this value of deptno as well. So to handle such scenario pyspark has another inbuilt method called explode_outer(). Let's see how explode_outer() works.

PySpark explode_outer()
e2_df=exp_df.withColumn('name',explode_outer(split('name','#'))) 
e2_df.show(15)
Output:

Data transformed successfully and the deptno=40 with no employee assigned is also available in output. So basically explode_outer() add extra feature to explode() method, both of the methods have their different use cases and explode_outer can also be construed as a kind of outer-join (not exaclty).

To summarize, topics covered in blog are listed below:
  • data transformation rows to columns
  • pandas explode() method
  • pyspark explode() and outer_explode() method
That's all with the exploding dataset operation using Pyspark and Pandas API. Thanks for reading my blog and supporting the content. 
Please subscribe and follow me on blogspot for upcoming articles. Word of appreciation always helps to keep up the spirit and a healthy knowledge network helps us grow. 
Share the Knowledge.

Comments

Popular posts from this blog

pySQL

Can Julia compete PySpark? A Data Comparison