Please enable JavaScript.
Coggle requires JavaScript to display documents.
Big Data Concepts and Tool (Critical Success Factors for Big Data Anaytics…
Big Data Concepts and Tool
What is Big Data?
Big data describes the exponential growth, availability and use of information, both structured and unstructured
e.g. from social media, web, RFID, GPS, textual data, sensory
Characteristics of Big Data
Volume (Scale)
Data volume
Variety (Complexity)
Structured data
eg Relational Data
(Tables/Transaction/Legacy Data), spreadsheet
data
Semi-structured data
eg email, logs, documents
Big public data
online, weather, finance, etc
Unstructured data
eg videos, images, audio files,
streaming data, graphs, GPS location data,
simulation data , streaming data ,etc
A single application
can be generating/collecting
many types of data
Velocity (Speed)
Data is being generated fast and needs to
be processed fast
Online Data Analytics
Late decisions -> missing opportunities
Veracity
Accuracy, quality, truthfulness, thrustworthiness
Variability
data flows can be inconsistent with periodic peaks
Value
provides business values
Limitations of Data Warehouse/ Relational Database
Schema (fixed)
Scalability
Unable to handle huge amounts of new / contemporary data sources
Speed
Unable to handle speed at which big data is arriving
Others
unable to handle sophisticated processing such as machine learning
unable to perform queries on big data efficiently
Challenges of Big Data Analytics
Data Volume
The ability to capture, store, and process the huge volume of data in a timely manner
Data integration
The ability to combine data quickly and at reasonable cost
Processing capabilities
The ability to process the data quickly, as it is captured (i.e. stream analytics)
Data governance
Security, privacy, ownership, quality issues
Skill availbility
Shortage of data scientists
Solution cost
Return on investment
Critical Success Factors for Big Data Anaytics
A Clear Business need
Strong committed sponsorship
Alignment between the business & IT Strategy
A fact based decision making culture
A strong data infrastructure
The right analytics tools
Personnel with advanced analytical skills
Popular Big Data Technologies
Hadoop
is an
open source
framework for strong and analyzing massive amounts of distributed, semi and unstructured data
Hadoop clusters run on inexpensive commodity
hardware so projects can scale-out inexpensively
MapReduce
NoSQL (Not only SQL)
A new style of database which process large volumes of multi-structured data
Serves discrete data stored among large volumes of multi--structured data to end-users and Big Data application
Examples : Cassandra, MongoDB, CouchDB, Hbase, etc
HIVE
Hadoop-based data warehousing-like framework developed by facebook
Allows users to write queries in an SQL-like language called HiveQL, which are then converted to MapReduce.
PIG
Hadoop-based query language developed by Yahoo
it is relatively easy to learn and is adept at very deep, very long data pipelines (a limitations of SQL)
High-performance computing for Big Data
In-memory analytics
Storing and processing the complete data set in
RAM
In-database analytics
Placing analytic procedures close to where data
is stored
Grid computing & MPP
Use of many machines and processors in
parallel (MPP - massively parallel processing)
Appliances
Combining hardware, software, and storage in a
single unit for performance and scalability
Fundamentals of Big Data analytics
Big Data by itself, regardless of the size, type, or speed, is worthless
Big Data + “big” analytics = value
With the value proposition, Big Data also brought about big challenges
Effectively and efficiently capturing, storing, and analyzing Big Data
New breed of technologies needed (developed or purchased or hired or outsourced …)
Coexistence of Hadoop and DW
Use Hadoop for storing and archiving multistructured data
Use Hadoop for filtering, transforming, and/or
consolidating multi-structured data
Use Hadoop to analyze large volumes of multistructured data and publish the analytical results
Use a relational DBMS that provides MapReduce capabilities as an investigative computing platform
Use a front-end query tool to access and analyze data
Stream Analytics Applications
e-Commerce
use of click-stream data to make product
recommendations and bundles
Financial Services
use transactional data to detect
fraud and illegal activities
Health Services
use medical data to detect anomalies
so as to improve patient conditions and save lives.
Government
use data from traffic sensors to change
traffic light sequences and traffic lanes to ease the pain
caused by traffic congestion problems
Law Enforcement and Cyber Security
– use video surveillance and face recognition for real-time situational awareness to improve crime prevention and law enforcement
Why stream Analytics
Store-everything approach infeasible when the number
of data sources increases
Need for critical event processing - complex pattern
variations that need to be detected and acted on as
soon as they happen.
Analytic process of extracting actionable information from
continuously flowing data
Data-in-motion analytics and real-time data analytics
One of the Vs in Big Data = Velocity
Hadoop Cluster
Hadoop
How does Hadoop work?
Each “part” is replicated multiple times and loaded into the file system for replication and failsafe processing
Jobs are distributed to the clients, and once completed,the results are collected and aggregated using MapReduce
Distributed with some centralization: Data is broken up into “parts,” which are then loaded into a file system (cluster) made up of multiple nodes (machines)
Consists of 2 components : Hadoop Distributed File System (HDFS) and Map Reduce
MapReduce
Goal - achieving high performance with “simple” computers
Developed and popularized by Google
Good at processing and analyzing large volumes of multistructured data in a timely manner
Distributes the processing of very large multi-structured data files across a large cluster of ordinary machines/processors
Used in indexing the Web for search, graph analysis, text analysis, machine learning,
Big Data Technologies Hadoop - Demystifying Facts
Hadoop consists of multiple products
Hadoop is open source but available from vendors too
Hadoop is an ecosystem, not a single product
HDFS is a file system, not a DBMS
Hadoop and MapReduce are related but not the same
MapReduce provides control for analytics
Hadoop is about data diversity, not just data volume
Hadoop complements a DW; it’s rarely a replacement
Hadoop enables many types of analytics, not just Web analytics
Hadoop Technical Components
Hadoop Distributed File System (HDFS)
Name Node (primary facilitator)
Secondary Node (backup to Name Node)
Job Tracker
Slave Nodes (the grunts of any Hadoop cluster)
Additionally, Hadoop ecosystem is made up of a number of complementary sub-projects: NoSQL (Cassandra, Hbase), DW (Hive), …
NoSQL = not only SQL
2 Nodes :
Master : Name Node and Job Tracker
Slave : Data Node and Task Tracker
Data Nodes referred to as storage node where the data is stored
Name node keeps track of the files and directories and provides information on where in the cluster data is stored and whether any nodes have failed
Job and Task tracker are for processing data and are known as compute node
Job Tracker initiates and co-ordinates jobs or the processing of data and dispatches compute tasks to the Task Tracker
open source - hundreds of contributors
continuously improve the core technology
Business Problems Addressed by
Big Data Analytics
Process efficiency and cost reduction
Brand management
Revenue maximization, cross-selling/up-selling
Enhanced customer experience
Churn identification, customer recruiting
Improved customer service
Identifying new products and market opportunities
Risk management
Regulatory compliance
Enhanced security capabilities
…these are data collected in relation to behaviour
Big Data And Data Warehousing
What is the impact of Big Data on DW?
Big Data and RDBMS do not go nicely together
Will Hadoop replace data warehousing/RDBMS?
Use Cases for Hadoop
Hadoop as the repository and refinery
Hadoop as the active archive
Use Cases for Data Warehousing
Data warehouse performance
Integrating data that provides business value
Interactive BI tools
Skills That Define a Data Scientist
Domain Expertise, Problem Definition and Decision Modelling
Data Access and Management
Programming, Scripting and Hacking
Internet and Social Media / Social Networking Technologies
Curiosity and Creativity
Communication and interpersonal