Please enable JavaScript.

Coggle requires JavaScript to display documents.

Lecture 4 Pre-processing (Data Understanding (Exploring all available…

- - - - The digital revolution
      - Healthcare systems are shifting from patient care in hospitals to monitored care at home
      - Remote patient management systems are being used
        
        Doctors can:
        
        Monitor the conditions of patients remotely
        
        Sensor data
        
        Adjust therapy
        
        Advise
        
        Patients can:
        
        Report symptoms
        
        Exchange information with medical professionals
        
        e.g. http://www.lifewatch.com/
    - - Patients that have had a heart attack are being monitored at home
      - The goal is to predict worsening of the conditions (hospitalisation)
      - Data mining setting
        (Individual predictive model for each patient)
        
        An example
        
        instance is a day for a given patient
        
        Input attributes
        
        Pressure
        
        Weight
        
        Complaints
        
        Medical records
        
        Target
        
        Hospitalisation within the next 14 days?
        
        {Yes, No}
        
        Many data sources need to be integrated
        
        Aligning the time stamps
        
        Different frequencies of different data
        
        Missing/inaccurate data needs to be taken care of
- - - - Simple errors
      - Inconsistencies
      - Spelling variants
      - Varying formats
        
        e.g. data
    - - Make the formats uniform e.g:
        
        Dates
        
        Times
      - Split fields that carry mixed information
        
        e.g "Chocolate, 100g" -> "Chocolate" "100g"
        
        e.g. "23/10/2012" -> "2012" "10" or "autumn"
      - Normalise address, names, possibly ignoring the order
      - Convert numerical values into standard units
    - - Missing data
        
        Value is missing
        
        Data contains factual errors
        
        Subjective Identification -> declaring missing
        
        e.g. price -99 GBP; year of construction 2999
        
        Outliers
        
        Extreme values
        
        Subjective identification -> declaring missing
      - Handling missing data
        
        Ignoring
        
        Some tools can handle this automatically
        
        e.g. WEKA
        
        Deleting examples of missing values
        
        Replacing all missing values by a global constant
        
        e.g. "missing"
        
        Inputting missing values
        
        Replacing missing values with the average or mode
        
        From the training data for a given attribute
      - Deleting examples
        
        Deleting redundant or noisy training examples may improve accuracy and/or reduce complexity of the predictive models
        
        Examples can be considered noisy if
        
        Most of the attribute values are missing
        
        Two training examples have exactly the same input attribute values, but different labels
        
        Duplicated examples can be considered redundant
        
        We can remove one
- - - - Potentially improve prediction accuracy
  - - - e.g. from numeric to binary
      - Discretisation, binarisation
        
        Converting numerical data to categorical
        
        WEKA: filters -> unsupervised -> attributes -> Discretize
        
        Changing attribute format
        
        WEKA: filters -> unsupervised -> attributes ->
        
        NominalToBinary
        
        NumericToBinary
        
        NumericToNominal
      - Normalisation transforms numeric attributes to fall within [0, 1]
        
        WEKA: filters -> unsupervised -> attributes -> normalise
      - Standardisation transforms numeric attributes to standard variance (1)
        
        WEKA: filters-> unsupervised -> attributes -> standardise
    - - Derived values can help "present" better the information about the target label that we want to predict
      - We can use our background knowledge about the application to create new attributes
        
        e.g. given length and width we can compute area
        
        e.g. given start and end coordinates of a journey we can compute the travel distance
      - In WEKA
        
        We can create derived attributes using math expressions
        
        e.g. new_attribute = attribute1 + attribute2
        
        WEKA: filters -> unsupervised -> attribute -> MathExpression
    - - e.g. reducing correlation between attributes
        
        Which may have a negative impact to the performance of:
        
        Linear regression
        
        Decision trees
        
        Etc.
- - - - The more the better
  - - - Dimensionality reduction can improve accuracy
      - Can reduce computational time
      - Lead to simpler and more interpretable predictive models
    - - Attribute selection
        
        Discard some attributes, keep others
      - Attribute extraction/transformation
        
        Transform data mathematically to get new attributes
    - - WEKA can automatically suggest which attribute to select
  - - - Compute a score describing how well an attribute or a set of attributes explain the target label
      - Does not build a predictive model for attribute selection
      - In WEKA:
        
        attributeSelection -> FilteredAttributeEval
        
        attributeSelection -> GainRationAttributeEval
    - - Build a predictive model and test how well it predicts using different attributes
      - In WEKA:
        
        attributeSelection -> WrapperSubsetEval
        
        Select the inner classification/regression algorithm as a parameter
    - - If we have k attributes, there are 2^k-1 possible combinations for how to select attributes
        
        e.g. if we have three attributes "area", "granite", "no. of rooms". We have the following options to select as input attributes
        \[ \begin{matrix} \text{"area"} \\ \text{"granite"} \\ \text{"no. of rooms"} \\ \text{"area" and "granite"} \\ \text{"granite" and "no. of rooms"} \\ \text{"area" and "no. of rooms"} \\ \text{"area", "granite", "no. of rooms"} \end{matrix} \]
        
        In order to make an informed decision we would need to compute scores for all of them or try to build models and evaluate all of them
      - In larger data-sets it may not be feasible and practical to try all possible combinations (exhaustive search)
      - Search methods
        
        Exhaustive search -> try all possible combinations
        
        In WEKA: AttributeSelection -> ExhaustiveSearch
        
        Linear forward search -> add/remove fixed amount of attributes at every iteration
        
        In WEKA: AttributeSelection -> RandomSearch
        
        Random Search -> blind search
        
        In WEKA: AttributeSelection -> RandomSearch
        
        Genetic -> guided blind search
        
        In WEKA: AttributeSelection -> GeneticSearch
        
        There are many more search methods/variations available in WEKA, they may be interesting to explore for the assignment
- - - - Examples:
        
        Text
        
        Web data (click-stream)
        
        Images
        
        Music
        
        Network data
    - - Convert it into tabular format
      - Examples:
        
        Classify textual news articles according to the topic
        
        Predicting web user behaviour based on click-stream
        
        Automated postcode digit recognition from scans
  - - - e.g. health
        
        What could be the input attributes?
    - - e.g.
        
        Document: "to be or not to be"
        
        Document 2: "I am not sick"
        
        Document 3 "I may be sick"
    - - Also normalises data with respect to document length and word frequency
      - Pre-process tab: filters -> unsupervised -> attribute -> StringToWordVector
  - - - To classify into "painting" or "photo"
      - To classify (fMRI) into "healthy" or "abnormal"
      - Classification/regression tasks in forensics
        
        e.g. fingerprint recognition
      - What could be the input attributes?
    - - Reduce resolution, place a grid, record intensity