Ingest Excel Data

March 08, 2022

Data can be in any format, recently I have got chance to work on excel data sets. However we have pandas as very efficient library to work with different types of datasets but, pandas degrades performance whenever data size goes beyond MBs to GBs. For efficient processing of GBs datasets parallel computing was designed and Spark shines here. And we have a library named com.crealytics:spark-excel_xxx,this package allows querying Excel spreadsheets as Spark Data Frames and leverage the parallel computing infrastructure.

The Objective of this article is to understand the usage of spark-excel library with python version of spark or Pyspark.

Connect to spark (standalone cluster)

import pyspark
from pyspark.sql import SparkSession
spark=SparkSession.builder \
	.appName('Spark_DB') 
    .config("spark.jars.packages", "com.crealytics:spark-excel_2.11:0.12.2") \
    .getOrCreate()

com.crealytics:spark-excel_2.11:0.12.2 is the creaytics spark-excel package used for handling excel data sets in python version of spark.

Import Data sets

emp_df=spark.read \
    .format("com.crealytics.spark.excel") \
    .option("dataAddress", "'Employees'!A1") \
    .option("useHeader", "true") \
    .load("D:\python_coding\pyspark_tutorial\demo_excel.xlsx")

emp_df.show(10)

dept_df=spark.read \
	.format("com.crealytics.spark.excel") \
    .option("dataAddress", "'Departments'!A1") \
    .option("useHeader", "true") \
    .load("D:\python_coding\pyspark_tutorial\demo_excel.xlsx")
dept_df.show(10)

region_df=spark.read \ .format("com.crealytics.spark.excel") \ .option("dataAddress", "'Regions'!F5:G9") \ .option("useHeader", "true") \ .load("D:\python_coding\pyspark_tutorial\demo_excel.xlsx") region_df.show(10)
Write Data Frame output to Excel file

 #join emp and dept datasets
emp_dept_df= emp_df.join(dept_df,emp_df.DEPARTMENT_ID==dept_df.DEPARTMENT_ID,how='inner')
emp_dept_df.show(15)

#write emp_dept_df to excel using write method with specified format
emp_dept_df.write.format("com.crealytics.spark.excel") \
    .option("dataAddress", "emp_dept [#All]") \
    .option("useHeader", "true") \
    .mode("append") \
    .save("D:\python_coding\pyspark_tutorial\output_demo.xlsx")


region_df.write.format("com.crealytics.spark.excel") \
    .option("dataAddress", "region_details [#All]") \
    .option("useHeader", "true") \
    .mode("append") \
    .save("D:\python_coding\pyspark_tutorial\output_demo.xlsx")

Github link for Datasets and Notebook: https://github.com/viv07/python_ETL/tree/master/spark-excel

To summarize, topics covered in blog as listed below:

- use of crealytics spark-excel library

- read/write excel data sets

That's all with how to process Excel datasets with python version spark- Pyspark.

Thanks for reading my blog and supporting the content.

Please subscribe and follow me on blogspot for upcoming contents. Word of appreciation always helps to keep up the spirit and a healthy knowledge network helps us grow.

Share the Knowledge.

Search This Blog

Let's Code with Vivek

Split Datasets

Ingest Excel Data

Comments

Post a Comment

Popular posts from this blog

Can Julia compete PySpark? A Data Comparison

Split Datasets

Spark Window Functions