Please enable JavaScript.
Coggle requires JavaScript to display documents.
Week 7: Big Data Concepts & Tools (Big Data Concept (Critical Success…
Week 7: Big Data Concepts & Tools
Big Data Concept
What is Big Data?
describes the exponential growth, availability and use of information, both structured and unstructured
Where does Big Data come from?
Facebook, YouTube, Google, (Healthcare, govts, military, education, media), (Banks, credit card transactions), (Web data, e-Commerce), (Grocery, dept store purchases)
Characteristics of Big Data: 3Vs
Volume (Scale)
Data volume is increasing exponentially
Variety (complex)
Structured Data eg. relational data, spreadsheet data
Semi-Structured data eg. email, logs, documents
Unstructured Data eg. videos, images, audio files, streaming data, graphs
A single application can be generating/collecting many types of data
Big public data (online, weather, finance etc)
Velocity (speed)
Data is being generated fast and needs to be processed fast
Online Data Analytics
Late Decisions > Missing Opportunities
eg. e-promotions, healthcare monitoring
Other VS Big Data
Veracity: accuracy, quality, truthfulness, trustworthiness
Variability: data flows can be inconsistent with periodic peaks
Value: provides business value
Limitations of Data Warehouse/Relational Database
Schema (fixed)
Scalability
unable to handle huge amounts of new/contemporary data sources
Speed
unable to handle speed which big data is arriving
Others
unable to handle sophisticated processing such as machine learning
unable to perform queries on big data efficiently
Challenges of Big Data Analytics
Data Volume: the ability to capture, store and process the huge volume of data in a timely manner
Data Integration: the ability to combine data quickly and at responsible cost
Processing capabilities: the ability to process the data quickly, as it is captured
Data governance: security, privacy, ownership, quality issues
Skill availability: shortage of data scientists
Solution cost: Return on Investment
Critical Success Factors for Big Data Analytics
a clear business need
strong committed sponsorship
alignment between the business & IT strategy
a fact based decision making culture
a strong data infrastructure
the right analytics tools
personnel with advanced analytical skills
High-Performance Computing for Big Data
In-memory analytics: storing & processing the complete data set in RAM
In-database analytics: placing analytic procedures close to where data is stored
Grid computing & MPP: use of many machines & processors in parallel
Appliances: combining hardware, software and storage in a single unit for performance & scalability
Popular Big Data Technologies
Hadoop, MapReduce, NoSQL, HIVE, PIG
Hadoop
an open source framework for storing and analyzing massive amounts of distributed, semi & unstructured data
MapReduce + Hadoop = Big Data core technology
How does Hadoop work?
consists of 2 components: Hadoop Distributed File System (HDFS) & MapReduce
Data is broken up into "parts", which are then loaded into a file system (cluster) made up of multiple nodes (machine)
each "part" is replicated multiple times & loaded into the file system for replication & fail safe processing
Jobs are distributed to the clients and once completed, the results are collected & aggregated using MapReduce
Hadoop Cluster
Master: Name Node & Job Tracker
Slave: Data Node & Task Tracker
Data Node: referred to as storage node where the data is stored
Name Node: keep track of the files and directories and provides information on where in the cluster data is stored and whether nodes have failed
Job Tracker: initiates and coordinate jobs or the processing of data and dispatches compute tasks to the Task Tracker
Job & Task Tracker are for processing data and are known as compute node
Demystifying Facts (Hadoop)
consists of multiple products
open source but available from vendors too
an ecosystem, not a single product
are related to MapReduce but not the same
MapReduce provides control for analytics
about diversity, not just data volume
complements a data warehouse; its rarely a replacement
enables many types of analytics, not just web analytics
Technical Components
Hadoop Distributed File System (HDFS)
Name Node (primary facilitator)
Secondary Node (backup to Name Node)
Job Tracker
Slave Nodes
MapReduce
Goal: achieving high performance with "simple" computers
Good at processing and analyzing large volumes of multi-structured data in a timely manner
Distributes the processing of very large multi-structured files across a large cluster of ordinary machines/processors
NoSQL
a new style of database which process large volumes of multi-structured data
often works in conjunction with Hadoop
serves discrete data stored among large volumes of multi-structured to end-users and Big Data applications
HIVE
Hadoop-based data warehousing-like framework developed by Facebook
Allow users to write queries in an SQL-like language called HIVEQL, which are the converted to MapReduce
PIG
Hadoop-based query language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines
Coexistence of Hadoop & DW
use Hadoop for storing & archiving multi-structured data
use Hadoop for filtering, transforming &/consolidating multi-structured data
use Hadoop to analyze large volumes of multi-structured data and publish the analytical results
use a relational DBMS that provides MapReduce capabilities as an investigative computing platform
use a front-end query tool to access and analyze data
Big Data & Stream Analytics
Data-in-motion Analytics & Real-time data analytics: velocity
Analytic process of extracting actionable information from continuously flowing data
Stream Analytics
Why Stream Analytics?
Store-everything approach infeasible when the number of data sources increases
Need for critical event processing - complex pattern variations that need to be detected & acted on as soon as they happen
Applications
e-Commerce: use of click-stream data to make products recommendations and bundles
Law Enforcement & Cyber Security: use video surveillance and face recognition for real-time situational awareness to improve crime prevention and law enforcement
Financial Services: use transactional data to detect fraud and illegal activties
Health Services: use medical data to detect anomalies so as to improve patient conditions and save lives
Government: use data from traffic sensors to change traffic light sequences and traffic lanes to ease the pain cause by traffic congestion problems