Big Data Concepts & Tools

Characteristics of Big Data

Volume (Scale)

Data volume is increasing exponentially

Variety (Complexity)

Velocity (Speed)

Data is being generated fast and needs to be processed fast

Veracity

accuracy, quality, truthfulness, trustworthiness

Variability

data flows can be inconsistent with periodic peaks

Value

provide business value

Limitations of Data Warehouse/Relational Database

Schema (fixed)

Scalability

Unable to handle huge amounts of new/contemporary data sources

Speed

Unable to handle speed at which big data is arriving

Others

Unable to handle sophisticated processing such as machine learning

Unable to perform queries on big data efficiently

Challenges of Big Data Analytics

Data volume

ability to capture, store & process the huge volume of data in a timely manner

Data integration

ability to combine data quickly and at reasonable cosT

Processing capabilities

ability to process the data quickly, as it is captured

Data governance

Security, privacy, ownership, quality issues

Skill availability: shortage of data scientists

Solution cost: Return on Investment

Critical Success Factors

A clear business need

Strong committed sponsorship

Alignment between the business & IT strategy

A fact based decision making culture

A strong data infrastructure

The right analytics tools

Personnel with advanced analytical skills

Business Problems

Process efficiency and cost reduction

Brand management

Revenue maximization, cross-selling/up-selling

Enhanced customer experience

Churn identification, customer recruiting

Improved customer service

Identifying new products & market opportunities

Risk Management

Regulatory compliance

Enhanced security capabilities

High-Performance Computing

In-memory analytics

In-database analytics

Grid computing & MPP

Appliances

Storing & processing the complete data set in RAM

Placing analytic procedures close to where data is stored

Combining hardware, software & storage in a single unit for performance & scalability

Use of many machines & processors in parallel

Big Data Technologies

Hadoop

MapReduce

NoSQL

HIVE

PIG

Master: Name Node & Job Tracker

Slave: Data Node & Task Tracker