Split Datasets

Image
 The Objective of this article is to transform data set from row to column using explode() method. The scope of this article is to understand how to  unnest or explode a data set using parallel processing framework Pyspark and Python native library- Pandas . Dataset looks like as below: dept,name 10,vivek#ruby#aniket 20,rahul#john#amy 30,shankar#jagdish 40, 50,yug#alex#alexa Pandas explode() import pandas as pd pan_df=pd.read_csv(r'explode.csv') df_exp=pan_df.assign(name=pan_df['name'].str.split('#')).explode('name') df_exp Output: Dataset is transformed successfully and we are able to create new rows from nested dataset. Pandas way of explode is simple, crisp and straight forward unless the dataset is complex. In next section of this article we will cover PySpark way of exploding or unnesting dataset. PySpark explode() Import libraries and Connect to Spark from pyspark import SparkContext,SparkConf import pyspark from pyspark.sql import SparkSes

Can Julia compete PySpark? A Data Comparison

Creators of Julia language claims Julia to be very fast, performance wise as it does not follow the two language theory like Python, it is a compiled language where as Python is an amalgamation of both compilation and interpretation. It would be interesting to dig deep to understand about how both of these languages behave behind the scenes but the objective of this blog is not to get into the theoretical details of the differences. 

As a Data Engineer my innate behavior is to understand how Julia behaves when it is bombed with GBs or TBs of data sets. As I am talking about GBs or TBs of data sets so obviously I can not straight away compare Python with Julia or even the rich Pandas library, as all of us know the processing will never complete as python is quiet slow. So the scope of this blog is to draw parallels between Julia and PySpark, I know for some this is unfair but pardon me. Inspiration behind the blog is the Twitter podcast on Julia that took place in Jan 2022. 

#Note: I have done this R&D on my personal laptop so that performance of both languages can be measured on same grounds.

My System configuration:

Processor Intel(R) Core(TM) i5-8250U 
CPU @ 1.60GHz   1.80 GHz
Installed RAM 16.0 GB
System type 64-bit operating system

For Demo I have used csv file of 6.5 GB in size, python 3.6 and spark 2.3.3, and Julia 1.7.1 all of the software are installed on my local system.
In this analysis, no data manipulation has been performed, just basic R/W operations to keep it simple and straight.

1. PySpark
from datetime import datetime
t1 = datetime.now()

#connect to spark session
import findspark
findspark.init('D:\spark-2.3.3-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
print('modules imported')
spark= SparkSession.builder.appName('BigData').getOrCreate()
print('app created')

#read the source dataset
sales_df= spark.read.csv(r"D:\python_coding\Sales Data\sales_data.csv",
                         inferSchema=True)
sales_df.show(10)

#write dataset to target csv file
sales_df.write.format('csv') \
        .option('header','true') \
        .save('D:\python_coding\Sales Data\spark_emp.csv',
              mode='overwrite')
t2 = datetime.now()
print(str((t2 - t1).total_seconds() * 1000) + ' milliseconds')

#Time taken by Pyspark for Read Write operation of 6.5GB csv file: 344340.066 milliseconds
Output:


2. Julia 
#import libraries
using CSV
using DataFrames
using Dates

d1=now()
#read csv file
sales=CSV.read("D:\\python_coding\\Sales Data\\sales_data.csv",DataFrame)
first(sales,10) #print first 10 rows of julia dataframe
#write dataframe as csv file
CSV.write("D:\\python_coding\\Sales Data\\julia_sale.csv.csv", sales)
d2=now()
print(d2-d1)
#Time taken by Julia for Read Write operation of 6.5GB csv file: 453396 milliseconds
Output:


Time taken by Julia to process 6.5GB of data is around 453396 milliseconds, where as processing time for Pyspark is 344340.066 milliseconds.
Time Difference is around 109,055.934 milliseconds or 109.055934 seconds or 2 mins approximately, which seems to be quiet good, because Julia has almost approached the performance speed of parallel computing framework Pyspark.
And who knows what is there in the womb of future, some day Julia may become an alternative to Spark to process bigdata. Anything is possible and possibilities are endless.
I hope I have put my point in rational manner with facts and figures for this particular use case. In case I missed out anything, please share your feedback and I would be very happy to include the points.

To Summarize:
  • Software used: Python 3.6, spark 2.3.3, and Julia 1.7.1.
  • Data set size is 6.5GB.
Github link: 

#Note: unable to attach 6.5GB dataset due to size issue so attached sample data.

Please subscribe and follow me on blogspot for upcoming articles. Word of appreciation always helps to keep up the spirit and a healthy knowledge network helps us grow. 
Share the Knowledge.


Comments

Popular posts from this blog

pySQL

Split Datasets