Please enable JavaScript.
Coggle requires JavaScript to display documents.
Lecture 4
Pre-processing (Data Understanding (Exploring all available…
Lecture 4
Pre-processing
Data Understanding
-
-
-
In order to prepare a meaningful and useful data-set for data mining, we need to investigate and understand available data
-
Data Cleaning
Data scrubbing
-
-
Missing missing values
-
-
Deleting examples
Deleting redundant or noisy training examples may improve accuracy and/or reduce complexity of the predictive models
-
-
-
Data Reduction
-
As a rule of thumb, one should aim at having at least 10 times more instances than attributes in historical data for learning a model
-
Dimensionality reduction
Dimensionality reduction aims at reducing the number of attributes, so that data becomes more focused to the target that we want to predict
-
-
-
-
-
Attribute Evaluation
-
-
Attribute subset search
If we have k attributes, there are 2^k-1 possible combinations for how to select attributes
e.g. if we have three attributes "area", "granite", "no. of rooms". We have the following options to select as input attributes
\[
\begin{matrix}
\text{"area"} \\
\text{"granite"} \\
\text{"no. of rooms"} \\
\text{"area" and "granite"} \\
\text{"granite" and "no. of rooms"} \\
\text{"area" and "no. of rooms"} \\
\text{"area", "granite", "no. of rooms"}
\end{matrix}
\]
In order to make an informed decision we would need to compute scores for all of them or try to build models and evaluate all of them
In larger data-sets it may not be feasible and practical to try all possible combinations (exhaustive search)
Search methods
-
-
-
-
There are many more search methods/variations available in WEKA, they may be interesting to explore for the assignment
-