Please enable JavaScript.
Coggle requires JavaScript to display documents.
Business Intelligence (week 7-8) (Text and Web Analytics (Web Mining (KDD…
Business Intelligence (week 7-8)
Big data concepts and tools
Big Data
Sources
Banks
Healthcare
Google (20PB/day)
Facebook (500TB/day)
MB -> GB ->TB -> PB
Main Characteristics
2. Variety (Complexity)
Structured data e.g. relational / spreadsheet data
Semi-structured data (e.g. email / document)
Unstructured data ( e.g. video, image)
-> All these data linked together = To extract knowledge
3. Velocity (Speed)
Data needs to be processed fast
Late decisions -> Missing opportunities
e.g. Healthcare monitoring = Monitor activities in body -> Abnormal reaction -> Require immediate help
1. Volume (Scale)
Large amount of data
Generated Data has been increasing exponentially
Formula
: Big Data = Transactions + Operations + Interactions
Other Characteristics
5.Variability
Data flows can be inconsistent with periodic peaks
6.Value
Provide business value
4.Veracity
Quality, Accuracy
Popular Big Data Technologies
MapReduce
Goal: Achieve high performance with "simple" computers.
Good at processing large volume of multi-structured data in a timely manner.
Used in indexing the web for search, graph, text analysis and machine learning.
NoSQL(Not only SQL)
A new style of database which process large volume of multi-structured data.
Often work in conjunction with Hadoop.
E.g. Cassandra
Hadoop
Definition
: An open source framework for storing and analyzing massive amount of distributed, semi and unstructured data.
Open Sources = Hundred of contributors continuously improve the core technology.
Run on inexpensive commodity hardware so project can scale-out in expensively.
MapReduce + Hadoop = Big data core technlogy.
How does it works
Distributed with some centralization:
Data is broken up into "parts" which are then loaded into a file system (clusters), made up of multiple nodes (machines).
Each part is replicated multiple times and loaded into the file system for replication and failsafe processing
Consist of 2 components
Hadoop distributed file system (HDFS)
MapReduce
Jobs are distributed to the clients and once completed, the results are collected and aggregated using MapReduce.
Co-existence with DW
Uses Hadoop for filtering, transforming and consolidating multi-structured data
Use Hadoop to analyze the data and publish the analytical results.
Uses Hadoop for storing and archiving multi-structured data
Representation list for big data vendor
e.g. Cloudera, MapR, Hortonworks
Stream Analytics
Examples
E-commerce
Use of click stream data to make product recommendations and bundles
Smart lamp posts
Sensor for humidity, rainfall, temperature.
Track criminals
Detect suspicious activities
Why?
Store everything approach infeasible when the number of data sources increase
Needed for critical event processing - complex pattern variation that need to be detected and acted on as soon as they happen
Definition
Used for extracting actionable information from streaming data sources
Text and Web Analytics
Text mining applications
Marketing
E.g. Blogs, user reviews of product reveal user sentiments
Biomedical
E.g. DNA analysis, analysis of gene expression.
Text mining vs data mining
Text Mining:
Unstructured data e.g. word document, PDF / XML file
Data Mining
: Structured data in database
Text Mining Application Areas
Clustering
Group similar documents together
Summarisation
To save time for readers
Information Extraction
Identification of key phrases and relationship within text by looking for patterns
Natural Language Processing (NLP) = Text Mining
Important component in text mining.
A subfield of Artificial intelligence
The study of "understanding" the natural human language.
Word Net (a hand-coded database of english words)
Challenges
Text containing abbreviation, misspelling & acronym
e.g. customer -> cust, customar, csmr
Syntactic Ambiguities
A sentence can be interpreted more than one way.
E.g. Grammar Ambiguity
3 Steps text mining process
Step 2: Create the term-by-document matrix (TDM)
To create TDM where the cells are filled with the most appropriate indices.
Eliminate terms with very few occurrence in very few documents (filtering)
Step 3: Extract knowledge
Classification
Clustering
Association
Trend Analysis
Step 1: Establish the corpus
Collect all relevant unstructured data.
Digitize the collection in ASCII text file
Place the collection in a common place.(in a flat file)
Sentiment Analysis
A settled opinion reflective of one's feeling
Categorised by topic
An integral part of CRM and customer experience management system
E.g. Voice of customer / market (VOC/VOM)
Web Mining
Web usage mining application
Determine the lifetime value of client
Design cross marketing strategies across product
Predict user behavior based on user profile.
Challenges
The web has everything
The web is too complex
The web is too big for effective data mining.
Definition:
The process of discovering intrinsic relationship from web data.
It is divided into 3 parts
Web Structure Mining
The development of useful information from the links included in the web documents.
Web page includes hyperlink e.g. authoritative pages, hubs.
Web Usage mining (Web analytics)
Extraction of information from click-stream analysis of web server log generated through web page visit and transaction.
Gather click-stream data and click-stream analysis of web server log.
Web Content Mining
The extraction of useful information from web page
Data collection via web crawlers.
Used for competitive intelligence, sentiment analysis and information collection.
KDD (Knowledge Discovery and Data Mining) for web mining
Step 4: Modelling
Association Rule Mining
Unsupervised Clustering e.g. usage profile
Step 5: Evaluation
Interpret results of a web page mining session using association rule mining and clustering
E.g. if p4 &p5, then p10
Using confidence and support
Step 2 & 3: Data Understanding and Preparation
3 main Issues with data preparation
Identify unwanted log files entries
A single page request usually generate multiple log files entries from several type of server.
A technique should be used to identify unwanted log file.
Data transformation
New attributes can be added to user session record to improve prediction of outcome
e.g. average purchase amount of repeat customer.
Differentiate individual user session
Host address are not useful as multiple user may access a site from the same host
Best solution to differentiate user session:cookie to be placed to contain session information
Step 6: Deployment
Targeted marketing communication by sending email and set up online advertisement to promote product.
Personalisation of web page based on user profile by manual ways or data mining
Step 1: Business Understanding
Goals for web based minings:
Improve usability of web site by decreasing the number of pages visited by a customer before a purchase transaction.
Personalize web page for customers.
Success Stories:
e.g. Amazon
Good recommendation of product or services
Optimized web navigation.