Please enable JavaScript.

Coggle requires JavaScript to display documents.

Data mining (Predictive analytics:obtain a model - an approximation of an…

- - - - linear model : linear regression / linear discriminants
      - logical approaches : trees / rules
      - probabilistic approaches : Naive Bayes
      - complex models: Neural Networks/ support vector machines
      - set of models : ensembles
    - - strictness of the assumed functional form of f()
      - the computational complexity of task --> best instance of form
      - interpretability of resultiing model
- - - - nominal variable
        
        bar plot
      - continuous variable
        
        histogram
        
        idea: devide the range of numeric variable into bins
        
        count the frequency of these bins
        
        show infos as height of bars
        
        hist() : to obtain the histogram
        xlab() to set label for X axis
        ggplot(iris, aes(x= Petal.Length)) +geom_histogram() + xlab("")
        
        boxplot()
        
        one of the best option
  - - - Partitioning methods
        
        input : a data set + a target number of cluster k
        
        use the info on the distances between cases in the dataset to obtain k "best" groups ~ a certain criterion
        
        iterative process : some cases may be moved btwn the clusters -> improve overall quality of the solution
      - Hierarchical methods
        
        obtain a hierarchy of alternative clustering solution: a dendogram
        
        ref tieng viet: http://scholar.vimaru.edu.vn/sites/default/files/thinhnv/files/dm_-_chapter_5_-_clustering.pdf
        
        follow a divisive / an agglomerative approach to build the hierarchy
        
        divisive: start with a signle group , containing all obs, split iteratively one of current groups into 2 separate clusters according to some criterion.
        
        proceed from n groups to a single group : at each iteration, 2 most similar groups are selected for being merged
      - Density based methods
        
        overcomes the limitation of the shape of clusters through density
        
        find regions of the feature space where cases are packed together wth high density --> find outliers
      - Grid - based methods
        
        obtain clusters using a division of the feature space into a grid like structure
        
        high computational efficiency, often integrated with hierarchical / density based method
- - - - 2 important criteria to evaluate a solution
        
        separation: how different is a cluster from the others
        
        compactness: how similar are th cases on each clusters
      - chose a certain criterion - assign a score for each cluster/ group of cases h(c), given a clustering solution formed by k cluster C =c1,c2,c3,...,ck
      - obtain the overall score of this clustering H(C,k)
      - evaluate each cluster : Sum of squares to the center of each cluster / L1 measure to median value
    - - initialize the centers of the k groups to a set of randomly chosen observation
      - repeat
        
        allocate each observation to the group whose center is nearest
        cluster's center = mean value of objects in cluster
        
        recalculate the center of each group
        
        until the groups are stable
      - weakness
        
        distance calculation = Euclidean distance
        
        "maximize" the inter-cluster dissimilarity (separation)
        different starting point as cluster centers may lead to converge to <> solution -- > approaches to overcome :question:
      - cluster validation
        
        :question:
        
        is the obtained group structure random
        
        what is the right number of clusters for this dataset
        
        validation measures
        
        external metric
        
        requires external info: class label - ground truth, not applicable in practice
        
        internal metrics
        
        evaluate fr point of view of cluster compactness / cluster separation
        
        metric: silhouette coefficient
        
        select the ideal number of cluster
        
        compare different clustering solutions