Please enable JavaScript.
Coggle requires JavaScript to display documents.
Pandas (Operations: in general exclude missing data (Apply: apply func (df…
Pandas
Operations: in general exclude missing data
df.mean(), dfmean(1): 1 is meaning another axis
df.sub(s, axis='index') : dataframe df - dataframe s)
Apply: apply func
df.apply(np.cumsum)
df.apply(lambda x: x.max() - x.min())
Histogramming
s = pd.Series(np.random.randint(0, 7, size=10)) : random number from 0 to 7 and create 10 values)
s.value_counts() : count every value total count
String method
.str.lower() all data are in lower case
df = pd.DataFrame(np.random.randn(10, 4)) : create random number with 10X4 dimension
Selection
df['A'] : column A
df[0:3]: 1st to 3rd row
df['20130102':'20130104']: select specific dates
by Label
dates is index
df.loc[dates[0]] : single-axis
df.loc[:['A','B']] : multi-axis
df.loc['20130102':'2013'0104',['A','B']]
df.loc[data[0],'A']: scalar value
df.at[data[0],'A']: scalar value but quicker
by position
df.iloc[3]
df.loc[3:5,0:2]
df.iloc[[1,2,4],[0,2]]
df.iloc[1,1] : scalar value
df.iat[1,1] : scalar value but faster access
boolean Indexing: select data where a boolean condition is met
df[df.A > 0] : criteria on a column but show whole df
df[df > 0]: select data from df
df[df2['E'].isin(['two','four'])]
Viewing Data
head(),tail(3),index, columns, values
describe(): stat
T: transpose
.sort_index(axis=1, ascending=False)
if axis=1, its by column
.sort_values(by='B')
'B' is column..in ascending order if not mentioned.
Merge
join: SQL-style merge broaden the df (broader)
pd.merge(left, right, on='key')
use left df with right df using 'key' column
if 'key' column is same, will add create a value. so like index in this case
concat: lengthen the list
pd.concat([s1, s2])
usually will if not specify, index will use 0,1, so after concat will show something like 0,1,0,1 so for this case, ppl will use one more index call hierarchical index
Append: append rows to a df
missing data
np.nan: represent missing data, by default not in calculation
df1.dropna(how='any'): drop any NA data
df1.fillna(value=5): fill all nan with 5
pd.isna(df1): get boolean mask, i.e. True or False
Reshaping
df2.stack(): column to index
stacked.unstack(): reverse of above
stacked.unstack(1): if hierachical indexing, 1 is the second index to become column
pivot table
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
df is the dataframe
value is the data to show
index is like group by...those who formed the index
columns are the column that used the values inside to create diff columns in the pivot table for agg
Categoricals
df["grade"] = df["raw_grade"].astype("category")
change the above types into categorical data type
df["grade"].cat.categories = ["very good", "good", "very bad"]
rename to more meaningful names
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
reorder the categories and add missing categories
df.sort_values(by="grade")
not sort by lexical order but categorical order
df.groupby("grade").size()
show count of each categories
Series vs Dataframe
Series:([...]),...like a list
Dataframe:({...}) , like a dict
show attributes
df.<tab> in IPython
df.dtypes
show the types
Grouping
df.groupby('A').sum() : use col A to put tgt and the rest (only numeric coln) do summing.
df.groupby(['A','B']).sum() : use two to group by
all above are like pivot tables
Plotting
ts.plot()
simply plot will do all columns index with index