Please enable JavaScript.

Coggle requires JavaScript to display documents.

ML (CV (Image Preprocessing, Visualizing Output, Transfer Learning, Data…

- - - - Drop Rows
        
        if few NaNs
      - Imputation
        
        SimpleImputer
        
        0
        
        mean
        
        mode
        
        next value in a columns
      - Extenson to imputation
        
        additional column for imputed
      - Drop Columns
        
        if many NaNs
      - Why Missing?
        
        Impute if wasn't recorded
        
        New category if doesn't exist
    - - Scaling
        
        changing the range of the data
        
        You want to scale data when you're using methods
        based on measures of how far apart data points
        
        SVM, kNN
      - Normalization
        
        changing the shape of the distribution of your data
        
        The point of normalization is to change your observations
        so that they can be described as a normal distribution.
        
        In general, you'll only want to normalize your data if you're going to be using a machine learning or statistics technique that assumes your data is normally distributed.
        
        Linear regression, Gaussian..., LDA, t-tests, ANOVA
    - - Character Encodings
        
        chardet.detect()
        
        UTF-8
      - Inconsistent data entry
        
        80%: .str.lower(), .str.strip()
        
        fuzzywuzzy
  - - - if don't \contain useful information
    - - Imposes Ordinality
      - Ordinality is not a problem for Tree based methods
      - Takes less disc space
      - handle unknown: get_dummies + reindexing + fill nans
    - - use for nominal variables (does not assume
        an ordering of the categories)
      - high cardinality: does not perform well if the categorical variable
        takes on a large number of values (> 15 values)
      - Can overcome high cardinality with PCA afterwards
      - handle_unknown: ignore/error
      - sparse/dense
    - - Like OneHot but fewer dimensions
      - some info loss due to collisions
    - - with the mean of the target variable
  - - - Too low a value causes underfitting
      - Too high a value causes overfitting
      - Typical values range from 100-1000,
        though this depends a lot on the learning_rate
    - - automatically find the ideal value for n_estimators
  - - - occurs when your predictors include data that will
        not be available at the time you make predictions
      - think about target leakage in terms of the timing
        or chronological order that data becomes available
      - any variable updated (or created) after the
        target value is realized should be excluded
    - - occurs when you aren't careful to distinguish
        training data from validation data
      - If your validation is based on a simple train-test split, exclude the validation data from any type of fitting, including the fitting of preprocessing steps
      - This is easier if you use scikit-learn pipelines
      - When using cross-validation, it's even more
        critical that you do your preprocessing inside the pipeline
- - - - Feature importance
      - Permutation importance
  - - - more detailed alternatives to permutation importance
        and partial dependence plots
      - Summary plots
        
        birds-eye view of feature importance and what is driving it
      - Dependence Contribution Plots
        
        similar insight to PDP's, but they add a lot more detail
- - - - Optimal embedding size
      - Adding biases as regularization
      - Benchmarking with mean values
      - tf.train. optimizers for sparce data
  - - - thresholding
      - L2 regularization
  - - - Euclidian distance
      - Cosine distance
      - scipy.spatial.distance
    - - gensim.models.keyedvectors.WordEmbeddingsKeyedVectors
      - most_similar plots
        
        textwrap
        
        plot_most_similar
      - Semantic vector math
      - Analogy solving
      - not only for words
    - - L2 norm
      - np.linalg.norm
      - Patterns in vector lengths
  - - - t-SNE is a dimensionality reduction algorithm
        which is often used for visualization
      - It learns a mapping from a set of high-dimensional vectors, to a space with a smaller number of dimensions (usually 2), which is hopefully a good representation of the high-dimensional space.
      - t-SNE tries to preserve "closeness" between
        entities in the feature space.
      - it's good at capturing clusters at multiple scales
      - nice cheap technique for understanding
        the nature of your embeddings
    - - helper functions
        
        plot_by_title_pattern
        
        plot_region_around
        
        plot_with_annotations
        
        add_annotations
        
        plot_region
      - cmap
        
        several hues
        
        qualitative colormap
      - Normalize
        
        mpl.colors.LogNorm
        
        vmin/vmax
- - - - Accelerometer
        
        linear acceleration
      - Gyroscope
        
        rotation
      - Magnetometer
        
        heading
    - - Point cloud data
      - Spatial Mapping
      - Simultaneous Localization & Mapping
    - - Image Recognition
      - Object, Room and Scene Recognition
      - Anchoring, Tracking, and Persistence
- - - - Line charts
      - show trends over a period of time, and multiple lines
        can be used to show trends in more than one group
  - - - comparing quantities corresponding to different groups.
      - sns.barplot
    - - sns.heatmap
      - find color-coded patterns in tables of numbers
    - - sns.regplot
        
        Including a regression line in the scatter plot
        
        makes it easier to see any linear
        relationship between two variables
      - sns.lmplot
        
        drawing multiple regression lines, if the scatter
        plot contains multiple, color-coded groups.
      - sns.scatterplot
      - relationship between two continuous variables
      - if color-coded, we can also show the
        relationship with a third categorical variable
    - - Categorical scatter plots
      - sns.swarmplot
      - relationship between a continuous
        variable and a categorical variable
  - - - Histograms
      - distribution of a single numerical variable
    - - KDE plots
      - show an estimated, smooth distribution
        of a single numerical variable
      - or two numerical variables (2D KDE plots)
    - - simultaneously displaying a 2D KDE plot with the
        corresponding KDE plots for each individual variable.
- - - - estimate_query_size
      - query_to_pandas_safe
      - Avoid using the asterisk (*) in your queries
      - For initial exploration, look at just part of the table instead of the whole thing.
      - Double-check the size of complex queries.
      - Be cautious about joining tables.
      - Don't rely on LIMIT