Posts

Showing posts with the label pandas joins

Split Datasets

Image
 The Objective of this article is to transform data set from row to column using explode() method. The scope of this article is to understand how to  unnest or explode a data set using parallel processing framework Pyspark and Python native library- Pandas . Dataset looks like as below: dept,name 10,vivek#ruby#aniket 20,rahul#john#amy 30,shankar#jagdish 40, 50,yug#alex#alexa Pandas explode() import pandas as pd pan_df=pd.read_csv(r'explode.csv') df_exp=pan_df.assign(name=pan_df['name'].str.split('#')).explode('name') df_exp Output: Dataset is transformed successfully and we are able to create new rows from nested dataset. Pandas way of explode is simple, crisp and straight forward unless the dataset is complex. In next section of this article we will cover PySpark way of exploding or unnesting dataset. PySpark explode() Import libraries and Connect to Spark from pyspark import SparkContext,SparkConf import pyspark from pyspark.sql import SparkSes

Merge Datasets

Image
In the Data universe, Joins are the most critical and frequently performed operations. With the help of Python Pandas API we perform similar kind of stuff while working on a Data Science algorithm or any ETL (Extract Transform and Load) project. In Pandas for joins methods available are  merge()  and  join() . Merge and join works in similar way but internally they have some differences. And in this blog I tried my best to list out the differences on the nature of these methods. merge() merge performs join operation on common columns. import pandas as pd d1 = {'Id': [1, 2, 3, 4, 5], 'Name':['Vivek', 'Rahul', 'Gunjan', 'Ankit','Vishakha'], 'Age':[30, 24, 32, 32, 28],} d2 = {'Id': [1, 2, 3, 4], 'Address':['Delhi', 'Gurgaon', 'Noida', 'Pune'], 'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']} df1=pd.DataFrame(d1) df2=pd.DataFra

Popular posts from this blog

Split Datasets

pySQL

Can Julia compete PySpark? A Data Comparison