Big Data Concepts and tools
What is Big Data?
- Massive volumes of data
- Describes the exponential growth , availability and use of information both structured and unstructured.
Characteristics
- Volume (Scale)
- Variety (Complexity)
- Velocity (Speed)
Data volume is increasing
Structured data, Semi-structured data & unstructured data
Data needs to be generated and processed fast
- Veracity
- Variability
- Value
Refers to accuracy, quality, truthfulness, trustworthiness
Data flows can be inconsistent with periodic peaks
Provides business value
Real time/Fast data
- Social media and networks: Generating data
- Scientific instruments: collection all sorts of data
- Mobile devices: tracking all objects all the time
- Sensor technology and networks: Measuring all kinds of data
Fundamentals of big data analysis
- worthless
- Big data + "big" analysis = value
- challenges:
- capture, store & analyze effectively & efficiently
- New breed of technologies needed
Limitation of data warehouse
- Schema (fixed)
- Scalability
- Speed
- Unable to handle sophisticated processing
- Unable to perform queries efficiently
Challenges of big data analytics
- Data Volume
- Data Integration
- Processing capabilities
- Data governance
- Skill availability
- Solution Cost
Success factors for Data Analytics
- A clear business needs
- Strong committed sponsorship
- Alignment between the business and IT strategy
- A fact based decision making culture
- A strong data infrastructure
- The right analytics tools
- Personnel with advanced analytical skills
Big Data Technologies
- Hadoop
- Mapreduce
- NoSQL (Not only SQL)
- HIVE (By Facebook)
- PIG (by Yahoo!)
-It is an open source framework for storing and analysing massive amounts of distributed, semi and unstructured data.
-Big data core technology: MapReduce + Hadoop
-consist of 2 components: Hadoop distributed file system (HDFS) and MapReduce
It is a technique to distribute the processing of very large multi-structured data files across a large cluster of community machines
A new style of database which process large volumes of multi-structured data and it works in conjunction with hadoop. Eg. Cassandra, MongoDB, CouchDB, Hbase
It is a Hadoop-based data warehouse-like framework that enables to write queries in an SQL-like language known as HiveSQL
Hadoop-based query language and is relatively easy to learn and is adept at very deep and long data pipelines
Big Data & Data Warehouse
Impact of big data on
data warehouse
Use cases
for Hadoop
Use cases for
data warehousing
They do not go well together
Hadoop as the repository and refinery and the active archive
- Data warehouse performance
- Integration data
- Interactive BI tools
Stream analytics
- Provides data-in-motion and real time data analytics
- Ability to store everything when the number of data source increases
- Ability for critical processing
- An analytic process of extraction actionable information from continuously flowing data
- needed for critical event processing - complex pattern variations that need to be detected and acted on as soon as they happen
Applications
- e-Commerce
- Law enforcement and cyber security
- Financial services
- Health Services
- Government
unable to handle huge amounts of new data source
Unable to handle speed
Ability to capture, store & process huge volume of data in a timely manner
ability to combine data quickly and less costly
ability to process data quickly
security, privacy, ownership, quality issues
shortage of data scientists
return on investment
Business problems addressed by big data analytics
- process efficiency and cost reduction
- brand management
- revenue maximization, cross selling/upselling
- Enhanced customer experience
- churn identification, customer recruiting
- improved customer service
- identifying new products and market opportunities
- risk management
- regulatory compliance
- enhanced security capabilities
Big data's high performance computing
In-database analytics
Grid computing & massively parallel processing (MPP)
In-memory analysis
Appliances
able to store and process complete data set in RAM
able to place analytic procedures close to where data is stored
use of many machines and processors in parallel
combine hardware, software and storage in a single unit for performance and scalability
How hadoop works?
Hadoop cluster
- Data is broken up into “parts” and are loaded into a file system (cluster) which is made up of multiple nodes
- each “part” is replicated many times and loaded into the file system for replication and failsafe processing
- jobs are distributed to the clients. Upon completion, results are collected and aggregated using MapReduce
Master
slave
-Name node: Keep track of the files and directories and provides information on where in the cluster data is stored and whether any nodes failure
-Job Tracker: process data known as compute node
-Data node: storage node where data is stored
-Task Tracker: process data known as compute node
Hadoop or data warehouse?
- use Hadoop to store and archive multi-structured data
- use Hadoop to filter, transform and consolidate multi-structured data
- use Hadoop to analyse large volumes of multi-structured data and publish analytical results
- use a relational DBMS that provides MapReduce capabilities as an investigative computing platform
- use a front-end query tool to access and analyse data
Data scientist skill sets
- Domain expertise, problem definition and decision making
- Data access and management
- programming, scripting and hacking
- internet and social media/social networking technologies
- curiosity and creativity
- communication and interpersonal skills
Use of click-stream data to make product recommendations and bundles
Use video surveillance and face recognition for real-time situational awareness to improve crime prevention and law enforcement
Use transactional data to detect fraud and illegal activities
Use medical data to detect anomalies so as to improve patient conditions and save lives
Use data from traffic sensors to change traffic light sequences and traffic lanes to ease the pain caused by traffic congestion problems