Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Science datascience - Coggle Diagram
Data Science
NumPy
What is Numpy
Numerical Python
import numpy as np
Python Library
Sophisticated way of creating array
Data Types
bool
int8, int16, int64m uint64
float, float 16, float32, float 64
shape, size, ndim, dtype
Give information about array
.shape is the shape of array
.size number of elements in array
.ndim number of dimensions
.dtype data type insede array
Tuples in Python
immutable, can store elements of different type
parenthesis
Arrays
Array Indexing
x[0,1]
x[-1]
x[1, 1:2]
to get subset of the values x[start:stop:step]
https://numpy.org/doc/stable/reference/arrays.indexing.html
x[0][1]
Fancy indexing: passing arrays as parameters
Array Reshaping
.reshape()
parameter is a tuple
x[..., 0] to remove a dimension
Concatenating Arrays
.concatenate([a,b])
always takes an array as input
.concatenate([a,b], axis =1)
number of axis you can have is equal to the number of dimensions
axis = 0 is default behaviour
Merging 1D Arrays
.vstack([a,b]) vertical stack
.hstack([a,b])
.T will transpose the matrix
.newaxis makes an array of your array
Splitting of Arrays
the split method takes as input the split points
x, y, z = np.split(a, [2,4])
.vsplit(x, [2]) and .hsplit(x, [2]])
Sorting arrays
np.sort(x)
np.argsort(x) gives index of sorted elements
np.sort(x, axis = 0)
Partial Sorting: np.partition(x3)
Creating Arrays
zeros, full, arange, random, identitity, diagonal
Vectorized Operations
Add things of different dimensions: a +5. This is
Broadcasting
. Only works when they are different dimensions, else, they need to be the same size
Custom Function Vector: np.vectorize(customFunc)(a) this will vecorize a function so it works in an array
Combining two arrays: np.vectorize(myFunction. otypes=[np.float64])(a, b)
The apply_along_axis Function: np.apply_along_axis(customFunc, 0, a) Axis = 0 will apply it to columns and axis = 1 will apply it by rows
Outer Product function: np.multiply.outer(x, y)takes 2 arrays as input and creates a 2D array that has a value for every possible combination of the two arrays
Predefined functions that we can create in Numpy arrays: sum, max, std, var, argmin (index of smallest element), mean, all, median, percentile
Using Axis; x.min(axis=0)
Boolean Conditions: .all(x< 8, axis = 1)
Pandas
Basics
import pandas as pd
DataFrame:
table with heterogenous elements and column labels and rows with indexes.
states = pd.DataFrame({'population': population, 'area': area})
pd.DataFrame([[1,2,3[,[3,4,5]], columns = ['A', 'B', 'C'], index = ['1','2'])
Creating a DataFrame with Hierarchical Columns: df = pd.DataFrame(d, index=[])
Opoerations on DataFrames: add, A.stack.mean()
stack Method: .stack() gives back a series. If you call it with parameter, it tells what level in a hierarchical df to disappear
Broadcasting: .describe to get information about x
np.nan to specify a missing value. Data cleaning. dropna() keeps only rows and creates a new dataframe. data.fillna() to fill Nan values
.setindex() to move a column to be the index
joining dataframes
.join: on attribute or index. joining needs to happen on same type
.merge: more complete than join
how="" could be either inner (things are the same) or outer(all the elements)
analyzing dfs
groupby(): count(). .groups (for lookups) .
.nunique(): number of unique
.value_counts(): number of rows
Pivot Tables
numeric values
Series
data.loc[0] to access elements in series. loc is inclusive and .iloc is like regular python
data frames with a single column
using custom indexes: data = pd.Series(np.arange(5) + 5, index['a', ..])
mySeries = pd.Series(pythonDictionary)
Time Series
Create dates in python = datetime
In numpy: np.datetime64
Vector arithmetic works
In pandas: pd.to_datetime()
Index
period_range() for a period and not a specific timestamp
pd.tp_timedelta(): track days and not dates exactly
pd.date_range()