Let's Code with Vivek

Posts

pySQL

March 15, 2022

The Objective of this article is to understand how to perform data manipulation on pandas DataFrames using SQL with pandasql library. What is pandasql? As per documentation pandasql allows us to query pandas DataFrames using SQL syntax. Installation of pandasql Library can be installed using below two methods, both of them uses PIP installation: Using Terminal pip install -U pandasql Using Jupyter Notebooks !pip install -U pandasql Use Case The main function used in pandasql is sqldf. sqldf accepts two arguments: SQL query Session environment variables ( globals() and locals() ) session environment variables is optional and handled by python itself, even if we do not provide the session variables, we still can achieve the goal of converting pandas DataFrame to perform SQL querying. Import Necessary Libraries from pandasql import sqldf import pandas as pd Import Data Sets emp_df=pd.read_csv(r'D:\python_coding\pandas_practice\emp.csv') emp_df.head(10) dept_df=pd....

Can Julia compete PySpark? A Data Comparison

March 13, 2022

Creators of Julia language claims Julia to be very fast, performance wise as it does not follow the two language theory like Python, it is a compiled language where as Python is an amalgamation of both compilation and interpretation. It would be interesting to dig deep to understand about how both of these languages behave behind the scenes but the objective of this blog is not to get into the theoretical details of the differences. As a Data Engineer my innate behavior is to understand how Julia behaves when it is bombed with GBs or TBs of data sets. As I am talking about GBs or TBs of data sets so obviously I can not straight away compare Python with Julia or even the rich Pandas library, as all of us know the processing will never complete as python is quiet slow. So the scope of this blog is to draw parallels between Julia and PySpark, I know for some this is unfair but pardon me. Inspiration behind the blog is the Twitter podcast on Julia that took place in Jan 2022. #N...

Spark Window Functions

March 11, 2022

The objective of this article is to understand Pyspark Window functions. The blog will do a comparative study of Pyspark window functions and Relational DB systems, Oracle Database, analytical functions. Spark Window functions operate on a group of rows (like frame, partition) and return a single value for every input row. To perform an operation on a group first, we need to partition the data using Window.partitionBy() , and for row number and rank function we need to additionally order by on partition data using orderBy() clause. Connect to Spark import pyspark from pyspark.sql import SparkSession print('modules imported') spark=SparkSession.builder.appName('Spark_window_functions').getOrCreate() Load Dataset emp_df=spark.read.csv(r'emp.csv',header=True,inferSchema=True) emp_df.show(10) Import necessary Libraries from pyspark.sql.window import Window from pyspark.sql.functions import col, row_number, rank, dense_rank from pyspark.sql import functions as ...

Merge Datasets

March 10, 2022

In the Data universe, Joins are the most critical and frequently performed operations. With the help of Python Pandas API we perform similar kind of stuff while working on a Data Science algorithm or any ETL (Extract Transform and Load) project. In Pandas for joins methods available are merge() and join() . Merge and join works in similar way but internally they have some differences. And in this blog I tried my best to list out the differences on the nature of these methods. merge() merge performs join operation on common columns. import pandas as pd d1 = {'Id': [1, 2, 3, 4, 5], 'Name':['Vivek', 'Rahul', 'Gunjan', 'Ankit','Vishakha'], 'Age':[30, 24, 32, 32, 28],} d2 = {'Id': [1, 2, 3, 4], 'Address':['Delhi', 'Gurgaon', 'Noida', 'Pune'], 'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']} df1=pd.DataFrame(d1) df2=pd.DataFra...

Search This Blog

Let's Code with Vivek

Posts

Split Datasets

pySQL

Can Julia compete PySpark? A Data Comparison

Spark Window Functions

Merge Datasets

Popular posts from this blog

Spark Window Functions

Split Datasets

Merge Datasets