Please enable JavaScript.
Coggle requires JavaScript to display documents.
Pandas(py) - Coggle Diagram
Pandas(py)
CREATING, LOADING, AND SELECTING DATA WITH PANDAS
Create a DataFrame I
df1 = pd.DataFrame({
'name': ['John Smith', 'Jane Doe', 'Joe Schmo'],
'address': ['123 Main St.', '456 Maple Ave.', '789 Broadway'],
'age': [34, 28, 51]
})
Create a DataFrame II
df2 = pd.DataFrame([
['John Smith', '123 Main St.', 34],
['Jane Doe', '456 Maple Ave.', 28],
['Joe Schmo', '789 Broadway', 51]
],
columns=['name', 'address', 'age'])
Loading and Saving CSVs
Loading
df.read_csv('my-csv-file.csv')
Saving
df.to_csv('new-csv-file.csv')
The method what gives the first 5 rows of a DataFrame.
.head()
Select two or more columns from a DataFrame
new_df = df[['column_1', 'column_2']]
Select Rows
df.iloc[index]
Selecting Multiple Rows
df.iloc[3:7] OR
df.iloc[:3] OR
df.iloc[-3:]
Select Rows with Logic I
df[df.MyColumnName == desired_column_value]
sample:
df[df.age == 30]
df[df.age < 30]
Select Rows with Logic II
df[(df.age < 30) |
(df.name == 'Martha Jones')]
In Python, | means “or” and & means “and”.
Select Rows with Logic III
df[df.name.isin(['Martha Jones',
'Rose Tyler',
'Amy Pond'])]
Setting indices
return a new DataFrame:
df.reset_index()
Without old indexes:
df.reset_index(drop=True)
Modify existing DataFrame:
.reset_index(inplace=True)
MODIFYING DATAFRAMES
Adding a Column I
One way that we can add a new column is by giving a list of the same length as the existing DataFrame.
df['Quantity'] = [100, 150, 50, 35]
Adding a Column II
We can also add a new column that is the same for all rows in the DataFrame.
df['In Stock?'] = True
Adding a Column III
you can add a new column by performing a function on the existing columns.
df['Sales Tax'] = df.Price * 0.075
! "Price" this is column !
Performing Column Operations
more complex calculation than multiplication or addition
df['Name'] = df.Name.apply(str.upper)
Reviewing Lambda Function
stringlambda = lambda x: x.lower()
print(stringlambda("Oh Hi Mark!"))
The output:
"oh hi mark!"
Reviewing Lambda Function: If Statements
lambda x: [OUTCOME IF TRUE] if [CONDITIONAL] else [OUTCOME IF FALSE]
myfunction = lambda x: 40 + (x - 40) * 1.50 if x > 40 else x
Applying a Lambda to a Column
df['Email Provider'] = df.Email.apply(
lambda x: x.split('
')[-1]
)
Applying a Lambda to a Row
If we use apply and add the argument
axis=1
,the input to our lambda function will be an entire
row
, not a
column
.
df['Price with Tax'] = df.apply(lambda row:
row['Price'] * 1.075
if row['Is taxed?'] == 'Yes'
lse row['Price'],
axis=1
)
Renaming Columns
.columns
This command edits the existing DataFrame df.
df = pd.DataFrame({
'name': ['John', 'Jane', 'Sue', 'Fred'],
'age': [23, 29, 21, 18]
})
df.columns = ['First Name', 'Age']
Renaming Columns II
There are several reasons why
.rename
is preferable to
.columns
:
1.You can rename just one column
You can be specific about which column names are getting changed (with .column you can accidentally switch column names if you’re not careful)
Using
inplace=True
lets us edit the
original
DataFrame.
df = pd.DataFrame({
'name': ['John', 'Jane', 'Sue', 'Fred'],
'age': [23, 29, 21, 18]
})
df.rename(columns={
'name': 'First Name',
'age': 'Age'},
inplace=True)
WORKING WITH MULTIPLE DATAFRAMES
.merge() ищет столбцы, которые являются общими для двух DataFrames, а затем ищет строки, в которых значения этих столбцов совпадают.
new_df = pd.merge(orders, customers)
В дополнение к использованию pd.merge() каждый DataFrame имеет собственный метод .merge(). Например, если вы хотите объединить
заказы
с
клиентами
, вы можете использовать:
new_df = orders.merge(customers)
This produces the same DataFrame as if we had called pd.merge(orders, customers)
sample:
results = all_data[(all_data.revenue > all_data.target) & (all_data.women > all_data.men)]
use
.rename()
to rename the columns for our merges. if columns have the same name
pd.merge(orders, customers.rename(columns={'id': 'customer_id'}))
another way is use the keywords
left_on
and
right_on
pd.merge(
orders,
customers,
left_on='customer_id',
right_on='id')
Pandas won’t let you have two columns with the same name, so it will change them to
id_x
and
id_y
.
id_x
and
id_y
. it's unreadable for us. So, use the keyword
suffixes
pd.merge(
orders,
customers,
left_on='customer_id',
right_on='id',
suffixes=['_order', '_customer']
)
merging only those elements, who perfectly match each other is calling
INNER
merge
OUTER
join combine all rows, even if they don't match each other. Any missing values are filled in with
None
or
nan
pd.merge(company_a, company_b, how='outer')
left
and
right
merge
left
merge includes all rows from first table(left), and rows from seond(right) table, that perfectly match row from first table
pd.merge(company_a, company_b, how='left')
Concatenate DataFrames
sometimes dataset is split on multiple pices. For instance, often dataset is split into multiple CSV files. And if we want to combine in together we can use the method:
pd.concat([df1, df2, df3, ...])
realy imortant
, that
columns
in dataframes are the
same
Example:
pd.concat(
[
df1, df2
]
)
AGGREGATES IN PANDAS
Calculating Column Statistics
The general syntax for these calculations is:
df.column_name.command()
examples:
print(customers.age)
'>> [23, 25, 31, 35, 35, 46, 62]
print(customers.age.median())
'>> 35
print(inventory.color)
'>> ['blue', 'blue', 'blue', 'blue', 'blue', 'green', 'green', 'orange', 'orange', 'orange']
print(inventory.color.unique())
'>> ['blue', 'green', 'orange']
common commands:
Command:
1.mean
2.std
3.median
4.max
5.min
6.coun
7.nunique
8.unique
Description:
1.Average of all values in column
2.Standard deviation
3.Median
4.Maximum value in column
5.Minimum value in column
6.Number of values in column
7.Number of unique values in column
8.List of unique values in column
Calculating Aggregate Functions I
df.groupby('column1').column2.measurement()
example:
grades = df.groupby('student').grade.mean()
output:
studen
t |
grade
Amy | 80
Bob | 90
Chris | 75
Calculating Aggregate Functions II
Generally,
groupby
statement followed by
reset_index
df.groupby('column1').column2.measurement().reset_index()
Example:
teas_counts = teas.groupby('category').id.count().reset_index()
output:
category
|
id
0 black 3
1 green 4
2 herbal 8
3 white 2
Rename column:
teas_counts = teas_counts.rename(columns={"id": "counts"})
output:
category
|
counts
0 black 3
1 green 4
2 herbal 8
3 white 2
Calculating Aggregate Functions III
np.percentile
can calculate any percentile over an array of values
high_earners = df.groupby('category').wage
.apply(lambda x: np.percentile(x, 75))
.reset_index()
Calculating Aggregate Functions IV
we can
group by
more than one column
df.groupby(['Location', 'Day of Week'])['Total Sales'].mean().reset_index()
Pivot Tables
In Pandas, the command for pivot is:
df.pivot(columns='ColumnToPivot',
index='ColumnToBeRows',
values='ColumnToBeValues')
Example
First use the groupby statement:
unpivoted = df.groupby(['Location', 'Day of Week'])['Total Sales'].mean().reset_index()
Now pivot the table:
pivoted = unpivoted.pivot(
columns='Day of Week',
index='Location',
values='Total Sales')
Remember to use reset_index() at the end of your code!