Please enable JavaScript.

Coggle requires JavaScript to display documents.

Data Wrangling, Cltaning and Tidying - Coggle Diagram

- - - - To remove any duplicate rows, we can use the drop_duplicates() function
        
        Example:
        restaurants = df.drop_duplicates()
    - - map() applies the str.lower() function to each of the columns in our dataset to convert the column names to all lowercase
        
        Example:
        restaurants.columns = map(str.lower, restaurants.columns)
    - - We can use the rename() function and a dictionary to relabel our columns.
        
        Example:
        restaurants = restaurants.rename({'dba': 'name', 'cuisine description': 'cuisine'}, axis=1)
        .# axis=1refers to the columns,axis=0` would refer to the rows
  - - - Let’s take a look on our data types in our dataframe. We can use .dtypes
        
        Example:
        restaurants.dtypes
    - - It's might be informative to look at the number of unique values in each column using the nunique() function.
        
        Example:
        restaurants.nunique()
  - - - restaurants.isna().sum()
      - print(df.column.unique())
      - heart[heart.isnull().any(axis=1)]
    - - Example:
        here our .where() function replaces latitude values less than 40 with NaN values
        restaurants['latitude'] = restaurants['latitude'].where(restaurants['latitude'] < 40, np.nan)
        here our .where() function replaces longitude values greater than -70 with NaN values
        restaurants['longitude'] = restaurants['longitude'].where(restaurants['longitude'] > -70, np.nan)
        restaurants.isna().sum()
    - - Let’s try to understand the missingness in the url column by counting the missing values across each borough. We will use the crosstab() function in pandas to do this.
        
        cross_tab = pd.crosstab(index= restaurants['boro'], columns= restaurants['url'].isna(), rownames = ['boro'], colnames = ['url is na']) )
  - - - Example:
        restaurants['url'] = restaurants['url'].str.lstrip('https://')
        restaurants['url'] = restaurants['url'].str.lstrip('www.')
- - - - data = {'Студент': ['Алиса', 'Боб', 'Карл'], 'Математика': [85, 92, 78], 'Химия': [82, 90, 85]}
        df = pd.DataFrame(data)
        melted_df = df.melt(id_vars='Студент', var_name='Предмет', value_name='Оценка')
- - - - pandas functions to explore and probe the dataset:
        .head() — display the first 5 rows of the table
        .info() — display a summary of the table
        .describe() — display the summary statistics of the table
        .columns — display the column names of the table
        .value_counts() — display the distinct values for a column
        df[column].unique()
      - Useful questions for inspecting data:
        
        How many (non-null) observations do we have?
        
        How many unique columns/features do we have?
        
        Which columns (if any) contain missing data?
        
        What is the data type of each column?
  - - - import glob
        student_files = glob.glob("exams*.csv")
        df_list = []
        for filename in student_files:
        df_list.append(pd.read_csv(filename))
        students = pd.concat(df_list)
  - - - To do that. you whould use a pd.melt() to do this transformation.
        df = pd.melt(frame=df, id_vars="Account", value_vars=["Checking","Savings"], value_name="Amount", var_name="Account Type")
        
        The parameters you provide are:
        
        frame: the DataFrame you want to melt
        
        id_vars: the column(s) of the old DataFrame to preserve
        
        value_vars: the column(s) of the old DataFrame that you want to turn into variables
        
        value_name: what to call the column of the new DataFrame that stores the values
        
        var_name: what to call the column of the new DataFrame that stores the variables
  - - - If we wanted to remove every row with a duplicate value in the item column, we could specify a subset:
        fruits = fruits.drop_duplicates(subset=['item'])
        Example:
        duplicates = students.duplicated()
        print(duplicates.value_counts())
        students = students.drop_duplicates()
  - - - students["gender"] = students.gender_age.str[:1]
        students["age"] = students.gender_age.str[1:]
        And after splitting by index we should set the students DataFrame to be the students DataFrame with all columns except gender_age:
        students = students[['full_name','grade','score','gender','age', 'exam']]
  - - - name_split = students['full_name'].str.split()
        students['first_name'] = name_split.str.get(0)
        students['last_name'] = name_split.str.get(1)
  - - - For example we have thise dataFrame(fruit):
        item | price | calories
        “banana” | “$1” | 105
        And we need to convert price to floats.
        
        First we need to get rid of all of the dollar signs using regex:
        fruit.price = fruit['price'].replace({'\$': ''}, regex=True)
        
        And then and convert data to numeric using .to_numeric() function:
        fruit.price = pd.to_numeric(fruit.price)
  - - - To extract the numbers from the string we can use pandas’ .str.split() function:
        split_df = df['exerciseDescription'].str.split('(\d+)', expand=True)
        which would result in this DataFrame split_df:
        0 | 1 | 2
        “lunges - “ | “30” | “reps”
        
        Firstly we convert data to numeric
        df.reps = pd.to_numeric(split_df[1])
        Then, we can assign columns from this DataFrame to the original df:
        df.exercise = split_df[0].replace('[- ]', '', regex=True
  - - - Method 1: drop all of the rows with a missing value
        
        bill_df = bill_df.dropna()
        
        If we wanted to remove every row with a NaN value in the num_guests column only, we could specify a subset:
        bill_df = bill_df.dropna(subset=['num_guests'])
      - Method 2: fill the missing values with the mean of the column, or with some other aggregate value.
        
        We can use .fillna() to do this:
        bill_df = bill_df.fillna(value={"bill":bill_df.bill.mean(), "num_guests":bill_df.num_guests.mean()})
- - - - One of the simplest causes for missing data is due to data never being provided in the first place.
        
        How to fix?
        Verify that data was uploaded correctly in the first place. Since most missing data use cases happen from a systematic error, try to find the culprit and correct the faulty data feed.
    - - Try looking at small chunks of the data.
        
        Look at either the beginning or end of a dataset, or retrieve a random sampling of data to look at. If there is a significant amount of missing data, it will be apparent by doing so.
      - Look at statistics for the entire dataset.
        
        We can count how many values there are in each column of your dataset, and note any discrepancies. If a column has a count lower than our total number of rows, it has missing data!
  - - - t seems obvious that we can’t use data that doesn’t fit our questions – but you’d be surprised how often we accidentally try to.
        
        For example, let’s imagine we are collecting data for a health survey. We are collecting activity level (measured in minutes), location (city), and blood pressure. If we are missing a lot of data on blood pressure, then we can’t use that dataset to answer questions about blood pressure. We can still use it to understand the relationship between location and activity, but not blood pressure.
    - - Structurally Missing Data we expect this data to be missing for some logical reason
      - Missing Completely at Random (MCAR) the probability of any datapoint being MCAR is the same for all data points – this type of missing data is mostly hypothetical
      - Missing at Random (MAR) the probability of any data point being MAR is the same within groups of the observed data – this is much more realistic than MCAR
        
        Missing at Random (MAR) означает, что пропущенные значения в данных могут быть объяснены другими доступными переменными в датасете. То есть, вероятность пропущенных значений зависит от значений других переменных, но не от самого пропущенного значения. Визуально это может выглядеть случайно и несистематично.
      - Missing Not at Random (MNAR) there is some reason why the data is missing
        
        It’s best practice to always assume that the data is MNAR and try to uncover clues as to that reason. As the analyst, you will probably never truly know why data is missing, but finding a pattern can often inform whether the MNAR data is important to your study or not. And knowing that can help you decide what to do about the missing data.
        
        Missing Not at Random (MNAR) означает, что пропущенные значения зависят от самого пропущенного значения или недоступных факторов, которые не отображены в датасете. Это может указывать на наличие скрытых причин или механизмов, которые приводят к пропущенным значениям. Визуально MNAR может проявляться в систематических и зависимых от других переменных паттернах.
    - - When trying to diagnose the type of missingness, data about the data (aka meta data) can be invaluable. The date/time data was collected, how it was collected, who collected it, where it was collected, etc. can all give invaluable clues to solving the problem of missing data.
  - - - It is either MAR or MCAR missing data.
        
        However, if the percentage of missing data is too high, then we can’t delete the data — we would be reducing our sample size too much.
      - The missing data has a low correlation with other features in the data
        
        If the missing data is not important for what we’re doing, then we can safely remove that data.
    - - Listwise Deletion
        
        Listwise deletion, also known as complete-case analysis, is a technique in which we remove the entire observation when there is missing data. Usually perform with MAR or MCAR missing data.
        
        In general, we should be cautious when using listwise deletion, as we lose a lot of information when we remove an entire row. This means we would have less confidence in the accuracy of any conclusions we draw from the resulting dataset.
        
        As a best practice, we should only use listwise deletion when the number of rows with missing data is relatively small to avoid significant bias( less than 5% of data is missing).
        
        Pandas
        Drop rows that have any missing data:
        data.dropna(inplace=True)
      - Pairwise Deletion
        
        In pairwise deletion, we only remove rows when there are missing values in the variables we are directly analyzing.
        
        Pairwise deletion has the advantage of retaining as much data as possible, while still letting us handle the missing data for our key variables.
        
        In general, pairwise deletion is the preferred technique to use.
        
        Pandas
        data.dropna(subset=['Height','Education'], #only looks at these two columns
        inplace=True, #removes the rows and keeps the data variable
        how='any') #removes data with missing data in either field
      - Dropping Variables
        
        If a variable is missing enough data, then we really don’t know enough about that variable to use it in our analysis, so we can’t be confident in any conclusions we draw from it.
        We generally don’t want to drop entire variables. We should only drop a variable as a last resort, and if that variable is missing a very significant amount of data (at least 60%).
  - - - The stock ticker data we were watching is called a time-series dataset. These datasets have interesting properties and analytics capabilities because they track data over (usually) set periods of time.
        The methods that we will talk about in this article are specific to time-series data because of this property.
    - - We must verify that our missing data can be categorized as MNAR — these techniques assume that to be the case.
      - There are two key aspects to be able to accurately describe missing data as MNAR:
        
        Use domain knowledge: Probably the quickest way to identify MNAR is by having extensive knowledge of the data and industry we are working in. Missing data that falls into MNAR will look and feel different from “normal” data in our dataset.
        
        For example, someone might know that data in a survey is missing in a particular column because the participant was either too embarrassed to answer, or didn’t know the answer. This would let us know that the data is MNAR.
        
        Analyze the dataset to find patterns: As we explore, filter, and separate our data, we should ultimately be able to identify something about our missing data that doesn’t line up with everything else.
        
        For example, if we have some survey data, we might find that our missing data almost exclusively comes from men older than 55 years old. If we see a pattern like this emerge, we can confidently say it is MNAR.
    - - If we have identified the missing data as MNAR, the best way to handle this missing data is by using the data around it to “fill in the blanks”. This is called single imputation, and there are many ways that we can tackle this problem.
      - LOCF
        
        LOCF stands for Last Observation Carried Forward.
        With this technique, we are going to fill in the missing data with the previous value. LOCF is used often when we see a relatively consistent pattern that has continued to either increase or decrease over time.
        
        If your data is in a pandas DataFrame, we can use the .ffil() method(Forward Fill another name for LOCF) on a particular column:
        
        df['comfort'].ffill(axis=0, inplace=True)
        
        If our data is in a NumPy array, there is a commonly used library called impyute that has a variety of time-series functions.
        
        Applying LOCF to the dataset:
        impyute.imputation.ts.locf(data, axis=0)
      - NOCB
        
        NOCB stands for Next Observation Carried Backward, and it solves imputation in the opposite direction of LOCF. NOCB is usually used when we have more recent data, and we know enough about the past to fill in the blanks that way.
        
        If your data is in a pandas DataFrame, when we can use the .bfil() method on a particular column.
        
        df['comfort'].bfill(axis=0, inplace=True)
        
        To use impyute to perform NOCB
        
        impyute.imputation.ts.nocb(data, axis=0)
      - BOCF
        
        One such alternative is BOCF, or Baseline Observation Carried Forward. In this approach, the initial values for a given variable are applied to missing values. This is a common approach in medical studies, particularly in drug studies. .
        
        Example:
        For example, we could assume that missing data for a reported pain level could return to the baseline value. This would happen if a drug were not working as well, so it could be a safe assumption to make if the data is trending in that direction
        
        .# Isolate the first (baseline) value for our data
        baseline = df['concentration'][0]
        .# Replace missing values with our baseline value
        df['concentration'].fillna(value=baseline, inplace=True)
      - WOCF
        
        WOCF, or Worst Observation Carried Forward. With this kind of imputation, we want to assume that the data is the worst possible value. This would be useful if the purpose of our analysis was to record improvement in some value (for example, if we wanted to study if a treatment was helping a particular patient’s condition).
        
        .# Isolate worst pain value (in this case, the highest)
        worst = df['pain'].max()
        .# Replace all missing values with the worst value
        df['pain'].fillna(value=worst, inplace=True)
    - - Data often will change in unexpected ways. Single imputation will ignore these potential changes and will “smooth” out our data, which could lead to inaccurate results.
        In general, single imputation can be an effective technique to handle missing data for our time-series datasets. While it might not be the most sophisticated approach, it can be a useful quick-fix in the right circumstances.
  - - - Multiple imputation is a technique for filling in missing data, in which we replace the missing data multiple times. Multiple imputation, in particular, is used when we have missing data across multiple categorical columns in our dataset. After we have tried different values, we use an algorithm to pick the best values to replace our missing data. By doing this, we are able to, over time, find the correct value for our missing data.
    - - Multiple imputation is best for MAR data, so we should ensure that our data fits that description.
    - - We can safely assume that our data won’t be biased, since we start the process off with a random assignment of values for the missing data.
        
        Because the goal of multiple imputation is to have a model that fits the data, we can be pretty confident that the resulting data will be a close approximation of the real data.
    - - One place to start would be with the IterativeImputer module within sklearn.
        
        import numpy as np
        from sklearn.experimental import enable_iterative_imputer
        from sklearn.impute import IterativeImputer
        import pandas as pd
        .# Загрузка DataFrame из CSV файла
        df = pd.read_csv('your_data.csv')
        .# Создание тестового датасета с пропущенными значениями
        dTest = df.sample(frac=0.2, random_state=0) # Пример: выбор 20%
        dTest = dTest.mask(np.random.random(dTest.shape) < 0.2) # Пример: присвоение случайных пропущенных значений
        .# Создание объекта IterativeImputer
        imp = IterativeImputer(max_iter=10, random_state=0)
        .# Обучение модели на тестовом датасете
        imp.fit(dTest)
        .# Применение модели для заполнения пропущенных значений в исходном датасете
        dfComplete = pd.DataFrame(np.round(imp.transform(df), 1), columns=df.columns)
        .# Вывод первых 10 строк заполненного датасета
        print(dfComplete.head(10))