Big Data Concepts and tools

What is Big Data?

Massive volumes of data
Describes the exponential growth , availability and use of information both structured and unstructured.

Characteristics

Volume (Scale)

Variety (Complexity)

Velocity (Speed)

Data volume is increasing

Structured data, Semi-structured data & unstructured data

Data needs to be generated and processed fast

Veracity

Variability

Value

Refers to accuracy, quality, truthfulness, trustworthiness

Data flows can be inconsistent with periodic peaks

Provides business value

Real time/Fast data

Social media and networks: Generating data

Scientific instruments: collection all sorts of data

Mobile devices: tracking all objects all the time

Sensor technology and networks: Measuring all kinds of data

Fundamentals of big data analysis

worthless
Big data + "big" analysis = value
challenges:

capture, store & analyze effectively & efficiently
New breed of technologies needed

Limitation of data warehouse

Schema (fixed)

Scalability

Speed

Unable to handle sophisticated processing

Unable to perform queries efficiently

Challenges of big data analytics

Data Volume

Data Integration

Processing capabilities

Data governance

Skill availability

Solution Cost

Success factors for Data Analytics

A clear business needs

Strong committed sponsorship

Alignment between the business and IT strategy

A fact based decision making culture

A strong data infrastructure

The right analytics tools

Personnel with advanced analytical skills

Big Data Technologies

Hadoop

Mapreduce

NoSQL (Not only SQL)

HIVE (By Facebook)

PIG (by Yahoo!)

-It is an open source framework for storing and analysing massive amounts of distributed, semi and unstructured data.
-Big data core technology: MapReduce + Hadoop
-consist of 2 components: Hadoop distributed file system (HDFS) and MapReduce

It is a technique to distribute the processing of very large multi-structured data files across a large cluster of community machines

A new style of database which process large volumes of multi-structured data and it works in conjunction with hadoop. Eg. Cassandra, MongoDB, CouchDB, Hbase

It is a Hadoop-based data warehouse-like framework that enables to write queries in an SQL-like language known as HiveSQL

Hadoop-based query language and is relatively easy to learn and is adept at very deep and long data pipelines

Big Data & Data Warehouse

Impact of big data on
data warehouse

Use cases
for Hadoop

Use cases for
data warehousing

They do not go well together

Hadoop as the repository and refinery and the active archive

Data warehouse performance
Integration data
Interactive BI tools

Stream analytics

Provides data-in-motion and real time data analytics
Ability to store everything when the number of data source increases
Ability for critical processing
An analytic process of extraction actionable information from continuously flowing data
needed for critical event processing - complex pattern variations that need to be detected and acted on as soon as they happen

Applications

e-Commerce

Law enforcement and cyber security

Financial services

Health Services

Government

unable to handle huge amounts of new data source

Unable to handle speed

Ability to capture, store & process huge volume of data in a timely manner

ability to combine data quickly and less costly

ability to process data quickly

security, privacy, ownership, quality issues

shortage of data scientists

return on investment

Business problems addressed by big data analytics

process efficiency and cost reduction

brand management

revenue maximization, cross selling/upselling

Enhanced customer experience

churn identification, customer recruiting

improved customer service

identifying new products and market opportunities

risk management

regulatory compliance

enhanced security capabilities

Big data's high performance computing

In-database analytics

Grid computing & massively parallel processing (MPP)

In-memory analysis

Appliances

able to store and process complete data set in RAM

able to place analytic procedures close to where data is stored

use of many machines and processors in parallel

combine hardware, software and storage in a single unit for performance and scalability

How hadoop works?

Hadoop cluster

Data is broken up into “parts” and are loaded into a file system (cluster) which is made up of multiple nodes

each “part” is replicated many times and loaded into the file system for replication and failsafe processing

jobs are distributed to the clients. Upon completion, results are collected and aggregated using MapReduce

Master

slave

-Name node: Keep track of the files and directories and provides information on where in the cluster data is stored and whether any nodes failure
-Job Tracker: process data known as compute node

-Data node: storage node where data is stored
-Task Tracker: process data known as compute node

Hadoop or data warehouse?

use Hadoop to store and archive multi-structured data

use Hadoop to filter, transform and consolidate multi-structured data

use Hadoop to analyse large volumes of multi-structured data and publish analytical results

use a relational DBMS that provides MapReduce capabilities as an investigative computing platform

use a front-end query tool to access and analyse data

Data scientist skill sets

Domain expertise, problem definition and decision making

Data access and management

programming, scripting and hacking

internet and social media/social networking technologies

curiosity and creativity

communication and interpersonal skills

Use of click-stream data to make product recommendations and bundles

Use video surveillance and face recognition for real-time situational awareness to improve crime prevention and law enforcement

Use transactional data to detect fraud and illegal activities

Use medical data to detect anomalies so as to improve patient conditions and save lives

Use data from traffic sensors to change traffic light sequences and traffic lanes to ease the pain caused by traffic congestion problems