Please enable JavaScript.
Coggle requires JavaScript to display documents.
Pandas - Coggle Diagram
Pandas
Features
Fast and efficient since it is built on NumPy
Tools for reading/writing information
Powerful Data structure
Easy data aggregation and transformation
intelligent data alignment
high performance merging and joining of data sets
Dataframe
2 dimensional labeled data structure with columns of different datatypes
The data inputs can be
ndarray,list,dict,Series,Dataframe
How to create DataFrame
import pandas as pd
If ndarray are required then numpy must also be imported
df_movie_rating=pd.DataFrame ({'movie1':[5,4,3,3,2,1], 'movie2':[4,5,2,3,4,2]}, index=['Tom','Jeff','Peter','Ram','Ted','Paul'])
df_movie_rating=pd.DataFrame ({'movie1':[5,4,3,3,2,1], 'movie2':[4,5,2,3,4,2]}
participation=pd.Series([205,204,201],index=[2012,2008,2004])
cities=pd.Series(['London','Beijing','Athens'],index=[2012,2008,2004])
dataframe=pd.DataFrame({'No. of Participating countries':participation, 'Hostcities':cities})
Data operations
with functions
Step 1
create a dataframe eg df
Step 2
create a function eg func
Step 3
df.applymap(func)
with statistical functions
Step 1
create a dataframe eg df
Step 2
use functions like df.max(), df.min, df.std()
using groupby
Step 1
create a dataframe eg df
Step 2
df.groupby('groupname')
Step 3
df.get_group('sub_groupname')
Step 4
View the grouped data by using df.groupby('groupname').size()
merge,duplicate and concatenation
merge
Step 1
create 2 dataframes df1 and df2
Step 2
use merge function;
pd.merge(df1,df2)
pd.merge(df1,df2,on='columnname',how='left/right')
concatenate
Step 1
create 2 dataframes df1 and df2
Step 2
use concatenate function;
pd.concatenate(df1,df2,ignore_index='True')
duplicated
Step 1
create a data frame df
Step 2
df.duplicated()
drop_duplicates
Step 1
create a data frame df
Step 2
df.drop_duplicates('
itemtobedropped
')
Data standardization
Step1
define standardize function (which returns [test-test.mean()] /test.std())
Step2
create a dataframe eg df
Step3
transform the dataset values into standardized values by passing df as the argument for standardize function.
standardize(df)
Functions
df.describe()
Step 1:
Create a dataframe
df
df.describe
df.head()
df.tail()
df.iloc[]
df.loc[]
df.iat[]
df.columns
df.index
type(df)
df.dtype
df.shape
4 main libraries used for Pandas data structures
Series
One dimensional
Dataframe
Two dimensional
Panel
Three dimensional
Panel (4D)
Four dimensional
Series and Dataframes are used widely. Panel 4D hasnt been used currently.
Series
Contains Data and labels (index). Every data is assigned a label or an index and this process is called
Data alignment
A series can be created by various inputs like
ndarray, dict, list,scalar
How to create Series?
import pandas as pd
If ndarray are required then numpy must also be imported
S=pd.Series(data,index=[index])
S= pd.Series(list("abcdef"))
s= np.array(["Germany", "Australia","India", "USA","Canada"])
t=pd.Series(s)
s=pd.Series([4,16,36,64],index=['TWO','FOUR','SIX','EIGHT'])
Functions
loc
Step 1
create a series s and define index names for every element
Step 2
use s.loc(['INDEX NAME'])
iloc
Step 1
create a series s and define index names for every element
Step 2
use s.iloc(index number)
Vector additions
index wise addition
vectorseries1=pd.Series([1,2,3,4],index=['a','b','c','d'])
vectorseries2=pd.Series([5,6,7,8],index=['a','b','c','d'])
sum= vectorseries1+vectorseries2
Why is Pandas required?
Intrinsic data alignment
Data operation functions
Functions for handling missing data
Data standardization values
Missing values
Cause?
not provided by the source
software issue
network issue
data integration issue
Handling missing values
dropna
drops all the missing values (NaN)
df.dropna()
fillna()
fills all the missing values with the desired value
df.fillna('No')
df.fillna('No',inplace=True)
This will replace all missing values in the current dataset
read files
read_csv
has same functions as dataframe
data= pd.read_csv("location.csv")
If 3 columns are to be displayed from the entire dataset then
data['column1','column2','column3']
data= pd.read_csv("location.csv",skiprows=1)