Posts

Showing posts with the label python

Split Datasets

Image
 The Objective of this article is to transform data set from row to column using explode() method. The scope of this article is to understand how to  unnest or explode a data set using parallel processing framework Pyspark and Python native library- Pandas . Dataset looks like as below: dept,name 10,vivek#ruby#aniket 20,rahul#john#amy 30,shankar#jagdish 40, 50,yug#alex#alexa Pandas explode() import pandas as pd pan_df=pd.read_csv(r'explode.csv') df_exp=pan_df.assign(name=pan_df['name'].str.split('#')).explode('name') df_exp Output: Dataset is transformed successfully and we are able to create new rows from nested dataset. Pandas way of explode is simple, crisp and straight forward unless the dataset is complex. In next section of this article we will cover PySpark way of exploding or unnesting dataset. PySpark explode() Import libraries and Connect to Spark from pyspark import SparkContext,SparkConf import pyspark from pyspark.sql import SparkSes...

Split Datasets

Image
 The Objective of this article is to transform data set from row to column using explode() method. The scope of this article is to understand how to  unnest or explode a data set using parallel processing framework Pyspark and Python native library- Pandas . Dataset looks like as below: dept,name 10,vivek#ruby#aniket 20,rahul#john#amy 30,shankar#jagdish 40, 50,yug#alex#alexa Pandas explode() import pandas as pd pan_df=pd.read_csv(r'explode.csv') df_exp=pan_df.assign(name=pan_df['name'].str.split('#')).explode('name') df_exp Output: Dataset is transformed successfully and we are able to create new rows from nested dataset. Pandas way of explode is simple, crisp and straight forward unless the dataset is complex. In next section of this article we will cover PySpark way of exploding or unnesting dataset. PySpark explode() Import libraries and Connect to Spark from pyspark import SparkContext,SparkConf import pyspark from pyspark.sql import SparkSes...

pySQL

Image
 The Objective of this article is to understand how to perform data manipulation on pandas DataFrames using SQL with pandasql library.  What is pandasql? As per documentation pandasql allows us to query pandas DataFrames using SQL syntax. Installation of pandasql Library can be installed using below two methods, both of them uses PIP installation: Using Terminal pip install -U pandasql Using Jupyter Notebooks !pip install -U pandasql Use Case The main function used in pandasql is sqldf. sqldf accepts two arguments:  SQL query  Session environment variables ( globals() and locals() ) session environment variables is optional and handled by python itself, even if we do not provide the session variables, we still can achieve the goal of converting pandas DataFrame to perform SQL querying. Import Necessary Libraries from pandasql import sqldf import pandas as pd Import Data Sets emp_df=pd.read_csv(r'D:\python_coding\pandas_practice\emp.csv') emp_df.head(10) dept_df=pd....

Merge Datasets

Image
In the Data universe, Joins are the most critical and frequently performed operations. With the help of Python Pandas API we perform similar kind of stuff while working on a Data Science algorithm or any ETL (Extract Transform and Load) project. In Pandas for joins methods available are  merge()  and  join() . Merge and join works in similar way but internally they have some differences. And in this blog I tried my best to list out the differences on the nature of these methods. merge() merge performs join operation on common columns. import pandas as pd d1 = {'Id': [1, 2, 3, 4, 5], 'Name':['Vivek', 'Rahul', 'Gunjan', 'Ankit','Vishakha'], 'Age':[30, 24, 32, 32, 28],} d2 = {'Id': [1, 2, 3, 4], 'Address':['Delhi', 'Gurgaon', 'Noida', 'Pune'], 'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']} df1=pd.DataFrame(d1) df2=pd.DataFra...

Ingest Excel Data

Image
Data can be in any format, recently I have got chance to work on excel data sets. However we have pandas as very efficient library to work with different types of datasets but, pandas degrades performance whenever data size goes beyond MBs to GBs. For efficient processing of GBs datasets parallel computing was designed and Spark shines here. And we have a library named  com.crealytics:spark-excel_xxx, this package allows querying Excel spreadsheets as Spark Data Frames and leverage the parallel computing infrastructure. The Objective of this article is to understand the usage of spark-excel library with python version of spark or Pyspark. Connect to spark (standalone cluster) import pyspark from pyspark.sql import SparkSession spark=SparkSession.builder \ .appName('Spark_DB') .config("spark.jars.packages", "com.crealytics:spark-excel_2.11:0.12.2") \ .getOrCreate() com.crealytics:spark-excel_2.11:0.12.2 is the creaytics spark-excel package used fo...

Append Datasets

Image
 In the Data universe, Joins and Unions are the most critical and frequently performed operations. In my experience, almost every other operation is either a join or a union. As joins are inevitable so do unions. In previous article we have covered how joins work in Pandas. Link to article:  https://letscodewithvivek.blogspot.com/2021/12/python-joins.html The scope of this article is to understand about how  concat()  methods helps us achieve the union of data frames. concat() Concatenate or concat pandas objects along a particular axis with optional set logic along the other axes. create two data frames to understand how concat method works. concat data frames on axis=0, default operation (union) import pandas as pd df1 = pd.DataFrame('Name': ['Vivek', 'Amy', 'Vishakha', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5'], 'Marks_scored':[98,90,87,69,...

Popular posts from this blog