Posts

Showing posts with the label data structures

Split Datasets

Image
 The Objective of this article is to transform data set from row to column using explode() method. The scope of this article is to understand how to  unnest or explode a data set using parallel processing framework Pyspark and Python native library- Pandas . Dataset looks like as below: dept,name 10,vivek#ruby#aniket 20,rahul#john#amy 30,shankar#jagdish 40, 50,yug#alex#alexa Pandas explode() import pandas as pd pan_df=pd.read_csv(r'explode.csv') df_exp=pan_df.assign(name=pan_df['name'].str.split('#')).explode('name') df_exp Output: Dataset is transformed successfully and we are able to create new rows from nested dataset. Pandas way of explode is simple, crisp and straight forward unless the dataset is complex. In next section of this article we will cover PySpark way of exploding or unnesting dataset. PySpark explode() Import libraries and Connect to Spark from pyspark import SparkContext,SparkConf import pyspark from pyspark.sql import SparkSes...

Append Datasets

Image
 In the Data universe, Joins and Unions are the most critical and frequently performed operations. In my experience, almost every other operation is either a join or a union. As joins are inevitable so do unions. In previous article we have covered how joins work in Pandas. Link to article:  https://letscodewithvivek.blogspot.com/2021/12/python-joins.html The scope of this article is to understand about how  concat()  methods helps us achieve the union of data frames. concat() Concatenate or concat pandas objects along a particular axis with optional set logic along the other axes. create two data frames to understand how concat method works. concat data frames on axis=0, default operation (union) import pandas as pd df1 = pd.DataFrame('Name': ['Vivek', 'Amy', 'Vishakha', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5'], 'Marks_scored':[98,90,87,69,...

Popular posts from this blog