Big Data Concepts and tools

What is Big Data?

  • Massive volumes of data
  • Describes the exponential growth , availability and use of information both structured and unstructured.

Characteristics

  1. Volume (Scale)
  1. Variety (Complexity)
  1. Velocity (Speed)

Data volume is increasing

Structured data, Semi-structured data & unstructured data

Data needs to be generated and processed fast

  1. Veracity
  1. Variability
  1. Value

Refers to accuracy, quality, truthfulness, trustworthiness

Data flows can be inconsistent with periodic peaks

Provides business value

Real time/Fast data

  1. Social media and networks: Generating data
  1. Scientific instruments: collection all sorts of data
  1. Mobile devices: tracking all objects all the time
  1. Sensor technology and networks: Measuring all kinds of data

Fundamentals of big data analysis

  • worthless
  • Big data + "big" analysis = value
  • challenges:
  1. capture, store & analyze effectively & efficiently
  2. New breed of technologies needed

Limitation of data warehouse

  1. Schema (fixed)

  1. Scalability
  1. Speed
  1. Unable to handle sophisticated processing
  1. Unable to perform queries efficiently

Challenges of big data analytics

  1. Data Volume
  1. Data Integration
  1. Processing capabilities
  1. Data governance
  1. Skill availability
  1. Solution Cost

Success factors for Data Analytics

  1. A clear business needs
  1. Strong committed sponsorship
  1. Alignment between the business and IT strategy
  1. A fact based decision making culture
  1. A strong data infrastructure
  1. The right analytics tools
  1. Personnel with advanced analytical skills

Big Data Technologies

  1. Hadoop
  1. Mapreduce
  1. NoSQL (Not only SQL)
  1. HIVE (By Facebook)
  1. PIG (by Yahoo!)

-It is an open source framework for storing and analysing massive amounts of distributed, semi and unstructured data.
-Big data core technology: MapReduce + Hadoop
-consist of 2 components: Hadoop distributed file system (HDFS) and MapReduce

It is a technique to distribute the processing of very large multi-structured data files across a large cluster of community machines

A new style of database which process large volumes of multi-structured data and it works in conjunction with hadoop. Eg. Cassandra, MongoDB, CouchDB, Hbase

It is a Hadoop-based data warehouse-like framework that enables to write queries in an SQL-like language known as HiveSQL

Hadoop-based query language and is relatively easy to learn and is adept at very deep and long data pipelines

Big Data & Data Warehouse

Impact of big data on
data warehouse

Use cases
for Hadoop

Use cases for
data warehousing

They do not go well together

Hadoop as the repository and refinery and the active archive

  • Data warehouse performance
  • Integration data
  • Interactive BI tools

Stream analytics

  • Provides data-in-motion and real time data analytics
  • Ability to store everything when the number of data source increases
  • Ability for critical processing
  • An analytic process of extraction actionable information from continuously flowing data
  • needed for critical event processing - complex pattern variations that need to be detected and acted on as soon as they happen

Applications

  1. e-Commerce
  1. Law enforcement and cyber security
  1. Financial services
  1. Health Services
  1. Government

unable to handle huge amounts of new data source

Unable to handle speed

Ability to capture, store & process huge volume of data in a timely manner

ability to combine data quickly and less costly

ability to process data quickly

security, privacy, ownership, quality issues

shortage of data scientists

return on investment

Business problems addressed by big data analytics

  • process efficiency and cost reduction
  • brand management
  • revenue maximization, cross selling/upselling
  • Enhanced customer experience
  • churn identification, customer recruiting
  • improved customer service
  • identifying new products and market opportunities
  • risk management
  • regulatory compliance
  • enhanced security capabilities

Big data's high performance computing

In-database analytics

Grid computing & massively parallel processing (MPP)

In-memory analysis

Appliances

able to store and process complete data set in RAM

able to place analytic procedures close to where data is stored

use of many machines and processors in parallel

combine hardware, software and storage in a single unit for performance and scalability

How hadoop works?

Hadoop cluster

  1. Data is broken up into “parts” and are loaded into a file system (cluster) which is made up of multiple nodes
  1. each “part” is replicated many times and loaded into the file system for replication and failsafe processing
  1. jobs are distributed to the clients. Upon completion, results are collected and aggregated using MapReduce

Master

slave

-Name node: Keep track of the files and directories and provides information on where in the cluster data is stored and whether any nodes failure
-Job Tracker: process data known as compute node

-Data node: storage node where data is stored
-Task Tracker: process data known as compute node

Hadoop or data warehouse?

  • use Hadoop to store and archive multi-structured data
  • use Hadoop to filter, transform and consolidate multi-structured data
  • use Hadoop to analyse large volumes of multi-structured data and publish analytical results
  • use a relational DBMS that provides MapReduce capabilities as an investigative computing platform
  • use a front-end query tool to access and analyse data

Data scientist skill sets

  • Domain expertise, problem definition and decision making
  • Data access and management
  • programming, scripting and hacking
  • internet and social media/social networking technologies
  • curiosity and creativity
  • communication and interpersonal skills

Use of click-stream data to make product recommendations and bundles

Use video surveillance and face recognition for real-time situational awareness to improve crime prevention and law enforcement

Use transactional data to detect fraud and illegal activities

Use medical data to detect anomalies so as to improve patient conditions and save lives

Use data from traffic sensors to change traffic light sequences and traffic lanes to ease the pain caused by traffic congestion problems