Please enable JavaScript.
Coggle requires JavaScript to display documents.
01 Preintensive Week: Lecture 1: Overview and course introduction (Data…
-
-
-
-
-
-
-
-
Lecture 3: Data extraction and storage, data warehousing
- Data extraction
- The process of retrieving data out of data sources for further processing and storage;
- Sources from internal and external;
- Different sources require different extraction methods;
- Structured and unstructured sources;
- Process of extracting data from web is called Web scraping;
- Extraction, transformation and loading (ETL)
- ETL is an intergral of data warehousing;
- Extraction involves retrieving data from disparate sources;
- In loading phase, the extracted data are loaded into a staging area of a data warehouse, where extraction logic is used to ensure suitable data are added to the warehouse.
- In transformation phase the data is transformed to conform to the structure and formats of data warehouse.
- Extracting data from PDF files
- Many documents are stored in PDF (portable document format);
- PDF often contain valuable information;
- Various PDF extraction tools (R or Python)
- Data storage
- Various ways to store data: databases, datawarehouses, document management systems, filtes (text, binary, multimedia, proprietary formats), cloud;
- Data storage should be: persistent, robust, secure, consistent and availabile;
- Garbage-in garbage-out principle in terms of data quality;
- Data warehousing
- A data warehouse is a decision support database that is maintained separately from organisation's operational databases;
- Providing a solid platform of consolidated, historical data for analysis and mining;
- Longer time horizon than operational systems;
- Contains a time element;
- Only two operations: initial loading and querying of data;
- Data warehousing architecture: data cubes, dimension tables, data are stored at different levels of details;
- Concept hierarchies can be created by discretising or grouping mumerical values;
- For data warehouses, a multi-dimensional data model is most popular;
- Implemented as: Star schema, snowflake schema, fact constellation schema
-
-
-
- Data Warehouse Operation
Data warehouse operations:
- Roll-up;
- Roll-down;
- Slice and dice;
- Pivot
- Data Warehouse Architecture
- Data sources;
- Data Storage;
- OLAP (Online analytical processing) engine;
- Front-end tools;
-
-
-
Week 3 - Lecture 7 Data Transformation, Aggregation and Reduction
- Data Transformation
- Transforming data from one structure or format to another;
1.1 Generalisation
- Transforming data from a lower conceptual level to a higher conceptual level;
- Concept hierarchy: replacing lower level concept by higher level concept; e.g. Street (higher distinct values)<City<State<Country (lowest distinct values)
- Value generalisation hierarchy: specifies a hierarchy for the values of an attribute by explicit data grouping;
e.g. {Dickson,Lyneham,Watson} < Canberra
1.2 Normalisation
- Database normalisation is the process of transforming a database design into somehting that adheres to a common standard for databases
- Brining data into a certain range.
- Min-max normalisation: changing the value from 0 to 1. (value - min)/(max - min)
- Z-score normalisation: (value - mean)/(Standard Deviation)
- Robust normalisation: (value - median) / (median absolute deviation)
- Logarithm normalisation: For skewed distribution, broad range of numeric value, useful for data with outliers with large variance.
1.3 Attribute/feature Selection
- Reduce number of attributes not significant for a certain data;
- Select a minimum set of attributes such that probability of different classes or information gain is close as possible.
- Exponential number of choices.
- Step-wise forward selelction: best feature is selected first, then second.
- Step-wise backward elimination: the worst feature is eliminated first, then second.
- Combining forward selection and backward elimination
- Decision-tree induction
1.4 Attribute/feature Construction:
- A process of adding derived features to data. i.e. constructive induction or attribute discovery.
- Construct new attributes based on existing attributes: combining or splitting existing raw attributes into new one with higher predictive power.
- Data Aggregation
- Compiling and summarising data to prepare new aggregated data;
- The aim is to get more information about particular groups based on specific attribute, such as age, income, and location.
2.1 Data Reduction
- Process reducing data volume using smaller forms of representation;
- Parametric method: Construct model fitting the data.
- Non-parametric method: based on histograms, clustering and sampling.
2.2 Parametric Methods:
- Linear regression: fit the data to a straight line.
- Multiple regression: to transform to non-linear functions;
- Log-linear models: approximate discrete multi-dimensional probability distributions.
2.3 Non-parametric method:
- Histograms:
- Bining: divides data into buckets and store summary for each bucket;
- Binning methods: equal width or equal frequency/depth;
- Clustering:
- Partition data into clusters based on similarity;
- Clustering techniques:
- Centroid-based;
- Connectivity-based;
- Density-based;
- Distribution-based;
- Sampling:
- Generate a small sample to represent the whole dataset;
- Choose a representative subset of the data;
- Sampling methods: Simple random sampling does not perform well on skewed data;
- Stratified sampling: divides data into groups and a probability sample is drawn from each group;
Summary:
- Data transformation, aggregation, reduction are used in data science applications to improve effectiveness and quality of data analysis and mining;
- Data pre-processing includes:
- Data cleaning, transformation, aggregation, and reduction
- Data standardisation and parsing;
- Data integration;