Please enable JavaScript.
Coggle requires JavaScript to display documents.
Big Data Concepts and Tools (Critical Success Factors for Big Data…
Big Data Concepts and Tools
What is Big Data?
exponential growth
availability and use of information, both structured and unstructured
massive volumes of data
Characteristics of Big Data
Volume (scale)
data volume
increasing exponentially
Variety (complexity)
structured data (spreadsheet data)
unstructured data (video/images)
semi-structured data (email, documents)
big public data (weather/finance)
Velocity (speed)
data
generated
and
processed fast
Variability, value, veracity
Fundamentals of Big Data Analytics
regardless of the size, type, or speed, Big Data is worthless
Big Data + "big" analytics = value
Big Data brought about big challenges
new breed of technologies needed
effectively and efficiently capturing, storing and analyzing Big Data
Limitations of Data Warehouse
Fixed Schema
Scalability
unable to handle huge amounts of new data sources
Speed
unable to handle speed
at which Big Data is arriving
Others
unable to handle sophisticated processing
unable to perform queries
on big data efficiently
Challenges of Big Data Analytics
Data Volume
ability to capture, store, and process the huge volume of data in
a timely manner
Data integration
ability to
combine data quickly
and at reasonable cost
Processing Capabilities
ability to
process the data quickly
as it is captured (eg. stream analytics)
Data Governance
Skill Availability
shortage
of data scientists
Solution Cost
Return on Investment
Critical Success Factors for Big Data Analytics
a
clear
business need
strong committed
sponsorship
alignment between the
business and IT strategy
a
fact-based
decision making culture
a strong
data infrastructure
the
right
analytics tools
personnel with
advanced analytical skills
High-Performance Computing for Big Data
In-memory analytics
In-database analytics
Grid computing & MPP
Appliances
Popular Big Data Technologies
Hadoop
"how to process big data with reasonable cost and time?"
an
open source framework
for storing and analyzing massive amounts of distributed, semi and unstructured data
runs on inexpensive commodity hardware
Hadoop + MapReduce = Big Data core technology
How does Hadoop work?
consists of
Hadoop Distributed File System (HDFS)
and
Map Reduce
data is being
broken up into "parts"
which are then loaded into a
file system
(cluster) made up of multiple nodes
each "part" is
replicated multiple times
and loaded into the file system for replication and failsafe procesing
Jobs are being distributed to clients and once completed, the results are collected and aggregated using
MapReduce
Hadoop Cluster
consists of 2 nodes
Master
Name Node
keeps track
of the files and directories
provides information on
where
in the cluster data is stored and
whether any nodes have failed
Job Tracker
initiates and co-ordinates jobs
or the
processing of data
and dispatches compute tasks to Task Tracker
Slave
Data Node
referred to as a "storage node"
where data is stored
Task Tracker
known as "compute node"
whereby data is processed
MapReduce
goal is to
achieve high performance with "simple" computers
good at
processing and analyzing large
volumes of
multi-structured data
in a
timely manner
distributes
the processing of very
large multi-structured data files
across a large cluster of ordinary machines/processors
NoSQL
a new style of
database
(DMBS)
processes large volumes of
multi-structured data
works in conjunction with Hadoop
serves discrete data stored among large volumes of multi-structured data to end-users and Big Data applications
eg.Cassandra, MongoDB, CouchDB, Hbase
HIVE
Hadoop-based data warehousing-like framework
allows users to
write queries in SQL-like language
called HiveQL, which are
then converted to MapReduce
PIG
Hadoop-based query language
easy to learn
adept at very, very long data pipelines (a limitation of SQL)
Coexistence of Hadoop and DW
use
Hadoop
for
storing
and
archiving
multi-structured data
use
Hadoop
for filtering, transforming and consolidating multi-structured data
use
Hadoop
to analyze
large volumes of multi-structured data
and publish the analytical results
use a
relational DBMS
that provides MapReduce capabilities as an
investigative computing platform
use a front-end query tool to access and analyze data
Stream Analytics
extracting actionable information from continuously flowing or streaming data sources
store-everything approach infeasible when the number of data sources increases
need for
critical event processing
: complex
pattern variations
needs to be
detected
and
acted
on
as soon as they happen
Stream Analytics Applications
E-Commerce
use of
click-stream data
to make product recommendations and bundles
Law Enforcement and Cyber Security
use
video surveillance
and
face recognition
Financial Services
use
transactional data
to detect fraud and illegal activities
Health Services
use
medical data
to detect anomalies to improve patient conditions and save lives
Government
use
data from traffic sensors
to ease the pain caused by traffic congestion problems
Big Data Vendors
Cloudera
MapR
Hortonworks
IBM
Oracle
Google