Please enable JavaScript.
Coggle requires JavaScript to display documents.
Big Data Concepts and tools (Business problems addressed by big data…
Big Data Concepts and tools
What is Big Data?
Massive volumes of data
Describes the exponential growth , availability and use of information both structured and unstructured.
Characteristics
Volume (Scale)
Data volume is increasing
Variety (Complexity)
Structured data, Semi-structured data & unstructured data
Velocity (Speed)
Data needs to be generated and processed fast
Veracity
Refers to accuracy, quality, truthfulness, trustworthiness
Variability
Data flows can be inconsistent with periodic peaks
Value
Provides business value
Real time/Fast data
Social media and networks: Generating data
Scientific instruments: collection all sorts of data
Mobile devices: tracking all objects all the time
Sensor technology and networks: Measuring all kinds of data
Fundamentals of big data analysis
worthless
Big data + "big" analysis = value
challenges:
capture, store & analyze effectively & efficiently
New breed of technologies needed
Limitation of data warehouse
Schema (fixed)
Scalability
unable to handle huge amounts of new data source
Speed
Unable to handle speed
Unable to handle sophisticated processing
Unable to perform queries efficiently
Challenges of big data analytics
Data Volume
Ability to capture, store & process huge volume of data in a timely manner
Data Integration
ability to combine data quickly and less costly
Processing capabilities
ability to process data quickly
Data governance
security, privacy, ownership, quality issues
Skill availability
shortage of data scientists
Solution Cost
return on investment
Success factors for Data Analytics
A clear business needs
Strong committed sponsorship
Alignment between the business and IT strategy
A fact based decision making culture
A strong data infrastructure
The right analytics tools
Personnel with advanced analytical skills
Big Data Technologies
Hadoop
-It is an open source framework for storing and analysing massive amounts of distributed, semi and unstructured data.
-Big data core technology: MapReduce + Hadoop
-consist of 2 components: Hadoop distributed file system (HDFS) and MapReduce
How hadoop works?
Data is broken up into “parts” and are loaded into a file system (cluster) which is made up of multiple nodes
each “part” is replicated many times and loaded into the file system for replication and failsafe processing
jobs are distributed to the clients. Upon completion, results are collected and aggregated using MapReduce
Hadoop cluster
Master
-Name node: Keep track of the files and directories and provides information on where in the cluster data is stored and whether any nodes failure
-Job Tracker: process data known as compute node
slave
-Data node: storage node where data is stored
-Task Tracker: process data known as compute node
Mapreduce
It is a technique to distribute the processing of very large multi-structured data files across a large cluster of community machines
NoSQL (Not only SQL)
A new style of database which process large volumes of multi-structured data and it works in conjunction with hadoop. Eg. Cassandra, MongoDB, CouchDB, Hbase
HIVE (By Facebook)
It is a Hadoop-based data warehouse-like framework that enables to write queries in an SQL-like language known as HiveSQL
PIG (by Yahoo!)
Hadoop-based query language and is relatively easy to learn and is adept at very deep and long data pipelines
Big Data & Data Warehouse
Impact of big data on
data warehouse
They do not go well together
Use cases
for Hadoop
Hadoop as the repository and refinery and the active archive
Use cases for
data warehousing
Data warehouse performance
Integration data
Interactive BI tools
Stream analytics
Provides data-in-motion and real time data analytics
Ability to store everything when the number of data source increases
Ability for critical processing
An analytic process of extraction actionable information from continuously flowing data
needed for critical event processing - complex pattern variations that need to be detected and acted on as soon as they happen
Applications
e-Commerce
Use of click-stream data to make product recommendations and bundles
Law enforcement and cyber security
Use video surveillance and face recognition for real-time situational awareness to improve crime prevention and law enforcement
Financial services
Use transactional data to detect fraud and illegal activities
Health Services
Use medical data to detect anomalies so as to improve patient conditions and save lives
Government
Use data from traffic sensors to change traffic light sequences and traffic lanes to ease the pain caused by traffic congestion problems
Business problems addressed by big data analytics
process efficiency and cost reduction
brand management
revenue maximization, cross selling/upselling
Enhanced customer experience
churn identification, customer recruiting
improved customer service
identifying new products and market opportunities
risk management
regulatory compliance
enhanced security capabilities
Big data's high performance computing
In-database analytics
able to place analytic procedures close to where data is stored
Grid computing & massively parallel processing (MPP)
use of many machines and processors in parallel
In-memory analysis
able to store and process complete data set in RAM
Appliances
combine hardware, software and storage in a single unit for performance and scalability
Hadoop or data warehouse?
use Hadoop to store and archive multi-structured data
use Hadoop to filter, transform and consolidate multi-structured data
use Hadoop to analyse large volumes of multi-structured data and publish analytical results
use a relational DBMS that provides MapReduce capabilities as an investigative computing platform
use a front-end query tool to access and analyse data
Data scientist skill sets
Domain expertise, problem definition and decision making
Data access and management
programming, scripting and hacking
internet and social media/social networking technologies
curiosity and creativity
communication and interpersonal skills