Can Julia compete PySpark? A Data Comparison
- Get link
- X
- Other Apps
Creators of Julia language claims Julia to be very fast, performance wise as it does not follow the two language theory like Python, it is a compiled language where as Python is an amalgamation of both compilation and interpretation. It would be interesting to dig deep to understand about how both of these languages behave behind the scenes but the objective of this blog is not to get into the theoretical details of the differences.
As a Data Engineer my innate behavior is to understand how Julia behaves when it is bombed with GBs or TBs of data sets. As I am talking about GBs or TBs of data sets so obviously I can not straight away compare Python with Julia or even the rich Pandas library, as all of us know the processing will never complete as python is quiet slow. So the scope of this blog is to draw parallels between Julia and PySpark, I know for some this is unfair but pardon me. Inspiration behind the blog is the Twitter podcast on Julia that took place in Jan 2022.
System type 64-bit operating system
from datetime import datetime
t1 = datetime.now()
#connect to spark session
import findspark
findspark.init('D:\spark-2.3.3-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
print('modules imported')
spark= SparkSession.builder.appName('BigData').getOrCreate()
print('app created')
#read the source dataset
sales_df= spark.read.csv(r"D:\python_coding\Sales Data\sales_data.csv",
inferSchema=True)
sales_df.show(10)
#write dataset to target csv file
sales_df.write.format('csv') \
.option('header','true') \
.save('D:\python_coding\Sales Data\spark_emp.csv',
mode='overwrite')
t2 = datetime.now()
print(str((t2 - t1).total_seconds() * 1000) + ' milliseconds')
#Time taken by Pyspark for Read Write operation of 6.5GB csv file: 344340.066 milliseconds
Output:#import libraries
using CSV
using DataFrames
using Dates
d1=now()
#read csv file
sales=CSV.read("D:\\python_coding\\Sales Data\\sales_data.csv",DataFrame)
first(sales,10) #print first 10 rows of julia dataframe
#write dataframe as csv file
CSV.write("D:\\python_coding\\Sales Data\\julia_sale.csv.csv", sales)
d2=now()
print(d2-d1)
#Time taken by Julia for Read Write operation of 6.5GB csv file: 453396 milliseconds
Output:- Software used: Python 3.6, spark 2.3.3, and Julia 1.7.1.
- Data set size is 6.5GB.
- Get link
- X
- Other Apps
Comments
Post a Comment