Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Wrangling, Cltaning and Tidying - Coggle Diagram
Data Wrangling, Cltaning and Tidying
Data Wrangling
Data Wrangling (обработка данных) фокусируется на общей подготовке и преобразовании данных. Основные задачи обработки данных включают удаление дубликатов, обработку отсутствующих значений (пропусков), исправление ошибок и тд.
df.
shape
the .
shape
method in pandas identifies the number of rows and columns in our dataset as (rows, columns)
Preliminary data cleaning
duplicate data
To remove any duplicate rows, we can use the
drop_duplicates()
function
Example:
restaurants = df.
drop_duplicates()
convert the column names to all lowercase
map()
applies the
str.lower()
function to each of the columns in our dataset to convert the column names to all lowercase
Example:
restaurants.columns =
map
(
str.lower
, restaurants.columns)
Rename columns
We can use the
rename()
function and a dictionary to relabel our columns.
Example:
restaurants = restaurants.
rename
({'dba': 'name', 'cuisine description': 'cuisine'},
axis=1
)
.# axis=1
refers to the columns,
axis=0` would refer to the rows
Data Types
data types
Let’s take a look on our data types in our dataframe. We can use
.dtypes
Example:
restaurants.
dtypes
unique values
It's might be informative to look at the number of unique values in each column using the
nunique()
function.
Example:
restaurants.
nunique()
Missing data
To do this we can use
isna()
to identify if the value is missing. This will give us a boolean and indicate if the observation in that column is missing (
True
) or not (
False
). We will also use
sum()
to count the number of missing values, where
isna()
returns True.
restaurants
.isna()
.
sum()
print(df.column.unique())
heart[heart.isnull().any(axis=1)]
in our dataFrame we wave missing values in latitude and longtidute. It equal to (0.000,0.000). We must replace it ti NaN. We will use the
where()
function to replace the coordinates 0.000 with
np.nan
.
Example:
here our .where() function replaces latitude values less than 40 with NaN values
restaurants['latitude'] = restaurants['latitude'].
where
(restaurants['latitude'] < 40,
np.nan
)
here our .where() function replaces longitude values greater than -70 with NaN values
restaurants['longitude'] = restaurants['longitude'].
where
(restaurants['longitude'] > -70,
np.nan
)
restaurants.isna().sum()
Characterizing missingness with crosstab
Let’s try to understand the missingness in the url column by counting the missing values across each borough. We will use the crosstab() function in pandas to do this.
cross_tab = pd.crosstab(index= restaurants['boro'], columns= restaurants['url'].isna(), rownames = ['boro'], colnames = ['url is na']) )
Removing prefixes
It might be easier to read what url links are by removing the prefixes of the websites, such as “
https://www
.". We will use
str.lstrip()
to remove the prefixes.
Example:
restaurants['url'] = restaurants['url'].
str.lstrip
('https://')
restaurants['url'] = restaurants['url'].
str.lstrip
('www.')
Data Tidying
Data Tidying (или Data Cleaning) относится к процессу структурирования данных в соответствии с принципами "тиди" данных (tidy data).
Согласно принципам "тиди" данных, каждая переменная должна иметь свой столбец, каждое наблюдение должно иметь свою строку, а каждый тип наблюдения должен быть представлен отдельной таблицей.
Melt
Функция melt() в библиотеке Pandas используется для преобразования широкого формата данных (wide format) в длинный формат данных (long format). Этот процесс называется "расплавлением" (melt) данных.
data = {'Студент': ['Алиса', 'Боб', 'Карл'], 'Математика': [85, 92, 78], 'Химия': [82, 90, 85]}
df = pd.DataFrame(data)
melted_df = df.melt(id_vars='Студент', var_name='Предмет', value_name='Оценка')
Regular Expressions
Regular expressions are special sequences of characters that describe a pattern of text that is to be matched
We can use literals to match the exact characters that we desire
Alternation, using the pipe symbol
|
, allows us to match the text preceding or following the
|
Character sets, denoted by a pair of brackets
[]
, let us match one character from a series of characters
Wildcards, represented by the period or dot
.
, will match any single character (letter, number, symbol or whitespace)
Ranges allow us to specify a range of characters in which we can make a match(for example: [a-b] or [0-9] or [A-Z])
Shorthand character classes like
\w
,
\d
and
\s
represent the ranges representing word characters, digit characters, and whitespace characters, respectively(Oposite: \W, \D, \S)
Groupings, denoted with parentheses
()
, group parts of a regular expression together, and allows us to limit alternation to part of a regex
Fixed quantifiers, represented with curly braces
{}
, let us indicate the exact quantity or a range of quantity of a character we wish to match
Optional quantifiers, indicated by the question mark
?
, allow us to indicate a character in a regex is optional, or can appear either 0 times or 1 time
The Kleene star, denoted with the asterisk
*
, is a quantifier that matches the preceding character 0 or more times
The Kleene plus, denoted by the plus
+
, matches the preceding character 1 or more times
The anchor symbols hat
^
and dollar sign
$
are used to match text at the start and end of a string, respectively
HOW TO CLEAN DATA WITH PYTHON
Diagnose the Data
For data to be
tidy
, it must have:
1.Each variable as a separate column
2.Each row as a separate observation
pandas
functions to explore and probe the dataset:
.head()
— display the first 5 rows of the table
.info()
— display a summary of the table
.describe()
— display the summary statistics of the table
.columns
— display the column names of the table
.value_counts()
— display the distinct values for a column
df[column].
unique
()
Useful questions for inspecting data:
How many (non-null) observations do we have?
How many unique columns/features do we have?
Which columns (if any) contain missing data?
What is the data type of each column?
Dealing with Multiple Files
Let’s say that we have a ton of files following the filename structure: 'file1.csv', 'file2.csv', 'file3.csv', and so on. And we must do one dataFrame from thise csv files
glob function (can)take regex as input
import
glob
student_files = glob.
glob
("exams*.csv")
df_list = []
for
filename
in
student_files:
df_list.append(pd.read_csv(filename))
students = pd.
concat
(df_list)
Reshaping your Data
IF we want to reshape our data.
Supose we have that data:
Account
|
Checking
|
Savings
“12456543” 8500 8900
“12283942” 6410 8020
“12839485” 78000 92000
And want it reshape it so that comply with
TIDY
principles Into a table that looks more like:
Account
|
Account Type
|
Amount
“12456543” |“Checking” | 8500
“12456543” | “Savings” | 8900
“12283942” | “Checking” | 6410
“12283942” | “Savings” | 8020
“12839485” | “Checking” | 78000
“12839485” | “Savings” | 920000
To do that. you whould use a
pd.melt()
to do this transformation.
df = pd.
melt
(frame=df, id_vars="Account", value_vars=["Checking","Savings"], value_name="Amount", var_name="Account Type")
The parameters you provide are:
frame
: the DataFrame you want to melt
id_vars
: the column(s) of the old DataFrame to preserve
value_vars
: the column(s) of the old DataFrame that you want to turn into variables
value_name
: what to call the column of the new DataFrame that stores the values
var_name
: what to call the column of the new DataFrame that stores the variables
Dealing with Duplicates
To check for duplicates, we can use the pandas function
df.duplicated()
, which will return a Series telling us which rows are duplicate rows.
We can use the pandas
.drop_duplicates()
function to remove all rows that are duplicates of another row.
If we wanted to remove every row with a duplicate value in the item column, we could specify a subset:
fruits = fruits.drop_duplicates(
subset
=['item'])
Example:
duplicates = students.duplicated()
print(duplicates.value_counts())
students = students.drop_duplicates()
Splitting by Index | Dealing with multiple measurements are recorded in the same column
WE should
splitting that data by Index
. For example column
gender_age
(data in column: M14)we shold separate by
gender
and
age
students["gender"] = students.gender_age.str[:1]
students["age"] = students.gender_age.str[1:]
And after splitting by index we should set the students DataFrame to be the students DataFrame with all columns
except
gender_age
:
students = students[['full_name','grade','score','gender','age', 'exam']]
Splitting by Character | Dealing with multiple measurements are recorded in the same column
For example we should split data in column
full_name
. Data in full_name looks like "Timofei Strowan" and we shold split that data by " " and store it in new colmns: "first name" and second name.
name_split = students['full_name'].str.split()
students['first_name'] = name_split.str.get(0)
students['last_name'] = name_split.str.get(1)
String Parsing
if we inspect our dataFrame(using print(df.
dtypes
)) and know there data type. Sometimes we need to modify strings in our DataFrames to help us transform them into more meaningful metrics.
For example we have thise dataFrame(fruit):
item
|
price
|
calories
“banana” | “$1” | 105
And we need to convert price to
floats
.
First we need to get rid of all of the dollar signs using regex:
fruit.price = fruit['price'].replace({'\$': ''},
regex
=True)
And then and convert data to numeric using
.to_numeric()
function:
fruit.price = pd.
to_numeric
(fruit.price)
More String Parsing
Suppose we had this DataFrame df representing a workout regimen:
date
|
exerciseDescription
10/18/2018 | “lunges - 30 reps”
We'd like to separate data from "exerciseDescription" to 2 colmns: "exercise" and "reps"
To extract the numbers from the string we can use pandas’
.str.split()
function:
split_df = df['exerciseDescription'].str.split('(\d+)', expand=True)
which would result in this DataFrame split_df:
0 | 1 | 2
“lunges - “ | “30” | “reps”
Firstly we convert data to numeric
df.reps = pd.
to_numeric
(split_df[1])
Then, we can assign columns from this DataFrame to the original df:
df.exercise = split_df[0].replace('[- ]', '', regex=True
Missing Values
The missing elements normally show up as NaN (or Not a Number) values.
Method 1: drop all of the rows with a missing value
bill_df = bill_df.
dropna()
If we wanted to remove every row with a
NaN
value in the num_guests column only, we could specify a
subset
:
bill_df = bill_df.dropna(
subset
=['num_guests'])
Method 2: fill the missing values with the mean of the column, or with some other aggregate value.
We can use .
fillna()
to do this:
bill_df = bill_df.
fillna
(value={"bill":bill_df.bill.mean(), "num_guests":bill_df.num_guests.mean()})
Missing Data
Introduction to Handling Missing Data
Why Data Might Be Missing
One of the simplest causes for missing data is due to
data never being provided in the first place.
How to fix?
Verify that data was uploaded correctly in the first place.
Since most missing data use cases happen from a
systematic error
, try to find the culprit and correct the faulty data feed.
Identifying and Checking for Missing Data
Try looking at small chunks of the data.
Look at either the beginning or end of a dataset, or retrieve a random sampling of data to look at. If there is a significant amount of missing data, it will be apparent by doing so.
Look at statistics for the entire dataset.
We can count how many values there are in each column of your dataset, and note any discrepancies. If a column has a count lower than our total number of rows, it has missing data!
Types of Missing Data
How important is this data to the larger dataset?
t seems obvious that we can’t use data that doesn’t fit our questions – but you’d be surprised how often we accidentally try to.
For example, let’s imagine we are collecting data for a health survey. We are collecting activity level (measured in minutes), location (city), and blood pressure. If we are missing a lot of data on blood pressure, then we can’t use that dataset to answer questions about blood pressure. We can still use it to understand the relationship between location and activity, but not blood pressure.
Different types of missing data
Structurally Missing Data
we expect this data to be missing for some logical reason
Missing Completely at Random
(MCAR) the probability of any datapoint being MCAR is the same for all data points – this type of missing data is mostly hypothetical
Missing at Random
(MAR) the probability of any data point being MAR is the same within groups of the observed data – this is much more realistic than MCAR
Missing at Random (MAR) означает, что пропущенные значения в данных могут быть объяснены другими доступными переменными в датасете. То есть, вероятность пропущенных значений зависит от значений других переменных, но не от самого пропущенного значения. Визуально это может выглядеть случайно и несистематично.
Missing Not at Random
(MNAR) there is some reason why the data is missing
It’s
best practice
to always assume that the data is MNAR and try to uncover clues as to that reason. As the analyst, you will probably never truly know why data is missing, but finding a pattern can often inform whether the MNAR data is important to your study or not. And knowing that can help you decide what to do about the missing data.
Missing Not at Random (MNAR) означает, что пропущенные значения зависят от самого пропущенного значения или недоступных факторов, которые не отображены в датасете. Это может указывать на наличие скрытых причин или механизмов, которые приводят к пропущенным значениям. Визуально MNAR может проявляться в систематических и зависимых от других переменных паттернах.
Data about Data
When trying to diagnose the type of missingness, data about the data (aka
meta data
) can be invaluable. The date/time data was collected, how it was collected, who collected it, where it was collected, etc. can all give invaluable clues to solving the problem of missing data.
Handling Missing Data with Deletion
When is it safe to use deletion?
It is either MAR or MCAR missing data.
However, if the percentage of missing data is too high, then we can’t delete the data — we would be reducing our sample size too much.
The missing data has a low correlation with other features in the data
If the missing data is not important for what we’re doing, then we can safely remove that data.
Types of deletion
Listwise Deletion
Listwise deletion, also known as complete-case analysis, is a technique in which we remove the
entire
observation when there is missing data. Usually perform with
MAR
or
MCAR
missing data.
In general, we should be
cautious
when using listwise deletion, as we lose a lot of information when we remove an entire row. This means we would have less confidence in the accuracy of any conclusions we draw from the resulting dataset.
As a
best practice
, we should only use listwise deletion when the number of rows with missing data is relatively small to avoid significant bias(
less than 5%
of data is missing).
Pandas
Drop rows that have any missing data:
data.
dropna
(inplace=True)
Pairwise Deletion
In pairwise deletion, we only remove rows when there are missing values in the variables we are
directly
analyzing.
Pairwise deletion has the advantage of
retaining
as much data as possible, while still letting us handle the missing data for our key variables.
In general, pairwise deletion is the
preferred
technique to use.
Pandas
data.
dropna
(
subset
=['Height','Education'], #only looks at these two columns
inplace
=True, #removes the rows and keeps the data variable
how
='any') #removes data with missing data in either field
Dropping Variables
If a variable is missing enough data, then we really don’t know enough about that variable to use it in our analysis, so we can’t be confident in any conclusions we draw from it.
We generally don’t want to drop entire variables. We should only drop a variable as a
last resort
, and if that variable is missing a very significant amount of data (at least 60%).
Single Imputation
What is time-series data?
The stock ticker data we were watching is called a
time-series dataset
. These datasets have interesting properties and analytics capabilities because they track data over (usually) set periods of time.
The methods that we will talk about in this article are
specific to time-series data
because of this property.
Is it MNAR?
We
must
verify that our missing data can be categorized as
MNAR
— these techniques assume that
to be the case
.
There are two key aspects to be able to accurately describe missing data as MNAR:
Use domain knowledge
: Probably the quickest way to identify MNAR is by having extensive knowledge of the data and industry we are working in. Missing data that falls into MNAR will look and feel different from “normal” data in our dataset.
For example, someone might know that data in a survey is missing in a particular column because the participant was either too embarrassed to answer, or didn’t know the answer. This would let us know that the data is MNAR.
Analyze the dataset to find patterns:
As we explore, filter, and separate our data, we should ultimately be able to identify something about our missing data that doesn’t line up with everything else.
For example, if we have some survey data, we might find that our missing data almost exclusively comes from men older than 55 years old. If we see a pattern like this emerge, we can confidently say it is MNAR.
What can we do?
If we have identified the missing data as MNAR, the best way to handle this missing data is by using the
data around it
to “fill in the blanks”. This is called
single imputation
, and there are many ways that we can tackle this problem.
LOCF
LOCF stands for
Last Observation Carried Forward
.
With this technique, we are going to fill in the missing data with the
previous
value. LOCF is used often when we see a relatively
consistent
pattern that has continued to either increase or decrease over time.
If your data is in a
pandas
DataFrame, we can use the .
ffil
() method(Forward Fill another name for LOCF) on a particular column:
df['comfort'].ffill(axis=0, inplace=True)
If our data is in a
NumPy
array, there is a commonly used library called
impyute
that has a variety of time-series functions.
Applying LOCF to the dataset:
impyute.imputation.ts.locf(data, axis=0)
NOCB
NOCB stands for
Next Observation Carried Backward
, and it solves imputation in the
opposite
direction of LOCF. NOCB is usually used when we have more
recent
data, and we know enough about the past to fill in the blanks that way.
If your data is in a
pandas
DataFrame, when we can use the .
bfil
() method on a particular column.
df['comfort'].bfill(axis=0, inplace=True)
To use
impyute
to perform NOCB
impyute.imputation.ts.nocb(data, axis=0)
BOCF
One such alternative is BOCF, or
Baseline Observation Carried Forward
. In this approach, the initial values for a given variable are applied to missing values. This is a common approach in medical studies, particularly in drug studies. .
Example:
For example, we could assume that missing data for a reported pain level could return to the baseline value. This would happen if a drug were not working as well, so it could be a safe assumption to make if the data is trending in that direction
.# Isolate the first (baseline) value for our data
baseline
= df['concentration'][0]
.# Replace missing values with our baseline value
df['concentration'].
fillna
(value=baseline, inplace=True)
WOCF
WOCF, or Worst Observation Carried Forward. With this kind of imputation, we want to assume that the data is the
worst
possible value. This would be useful if the purpose of our analysis was to record improvement in some value (for example, if we wanted to study if a treatment was helping a particular patient’s condition).
.# Isolate worst pain value (in this case, the highest)
worst = df['pain'].max()
.# Replace all missing values with the worst value
df['pain'].
fillna
(value=worst, inplace=True)
What are the disadvantages?
Data often will change in unexpected ways. Single imputation will ignore these potential changes and will “smooth” out our data, which could lead to inaccurate results.
In general, single imputation can be an effective technique to handle missing data for our time-series datasets. While it might not be the most sophisticated approach, it can be a useful quick-fix in the right circumstances.
Multiple Imputation
What is Multiple Imputation:
Multiple imputation is a technique for filling in missing data, in which we replace the missing data multiple times. Multiple imputation, in particular, is used when we have missing data across multiple
categorical
columns in our dataset. After we have tried different values, we use an algorithm to pick the best values to replace our missing data. By doing this, we are able to, over time, find the correct value for our missing data.
When to use it
Multiple imputation is best for
MAR
data, so we should ensure that our data fits that description.
Benefits we
meet the criteria
for using multiple imputation
We can safely assume that our data won’t be biased, since we start the process off with a random assignment of values for the missing data.
Because the goal of multiple imputation is to have a model that fits the data, we can be pretty confident that the resulting data will be a close approximation of the real data.
How to use it
One place to start would be with the
IterativeImputer
module within
sklearn
.
import numpy as np
from sklearn.experimental import
enable_iterative_imputer
from sklearn.impute
import IterativeImputer
import pandas as pd
.# Загрузка DataFrame из CSV файла
df = pd.read_csv('your_data.csv')
.# Создание тестового датасета с пропущенными значениями
dTest = df.sample(frac=0.2, random_state=0) # Пример: выбор 20%
dTest = dTest.mask(np.random.random(dTest.shape) < 0.2) # Пример: присвоение случайных пропущенных значений
.# Создание объекта IterativeImputer
imp = IterativeImputer(max_iter=10, random_state=0)
.# Обучение модели на тестовом датасете
imp.fit(dTest)
.# Применение модели для заполнения пропущенных значений в исходном датасете
dfComplete = pd.DataFrame(np.round(imp.transform(df), 1), columns=df.columns)
.# Вывод первых 10 строк заполненного датасета
print(dfComplete.head(10))