Big Data Concepts and Tools

tools and technologies for Big Data
analytics

What Big Data is and how it is changing the world
of analytics

the applications of stream analytics

Compare and contrast the complementary uses of data warehousing and Big Data

Become familiar with the vendors of Big Data tools and services

what is big data?

Traditionally, “Big Data” = massive volumes of data
Today, Big Data describes the exponential growth, availability and use of information, both structured and unstructured
eg from social media, web, RFID, GPS, textual data, sensory etc

Where does Big Data come from?

Facebook, youtube, google, helathcare, govt military,
education

Characteristics of Big Data: 3V’s

Velocity

Variety

Volume

Data volume increase a lot

speed, Data is being generated fast and needs to
be processed fast
E-Promotions: Based on your current location, your purchase history, what you like
 send promotions right now for store next to you

-Structured Data eg Relational Data (Tables/Transaction/Legacy Data), spreadsheet
data
-Semi-structured data eg email, logs, documents
-Unstructured data eg videos, images, audio files,
streaming data, graphs, GPS location data, simulation data , streaming data ,etc
-A single application can be generating/collecting
many types of data
-Big Public Data (online, weather, finance, etc)

Other Vs of Big Data

The other Vs that define Big Data
-Veracity : accuracy, quality, truthfulness, trustworthiness
-Variability : data flows can be inconsistent with periodic peaks
-Value : provides business value

Limitations of Data
Warehouse/Relational Database

-Schema (fixed)
-Scalability: Unable to handle huge amounts (terabytes,petabytes)of new/contemporary data sources
-Speed: Unable to handle speed at which big data is arriving
-Others:
Unable to handle sophisticated processing such as machine learning
Unable to perform queries on big data efficiently

Challenges of Big Data Analytics

Data volume

Data integration

Processing capabilities

Data governance

Skill availability : shortage of data scientists

Solution cost : Return on Investment

The ability to capture, store, and process the huge volume
of data in a timely manner

The ability to combine data quickly and at reasonable cost

The ability to process the data quickly, as it is captured (i.e.,
stream analytics)

Security, privacy, ownership, quality issues

Business Problems Addressed by
Big Data Analytics

Process efficiency and cost reduction
Brand management
Revenue maximization, cross-selling/up-selling
Enhanced customer experience
Churn identification, customer recruiting
Improved customer service
Identifying new products and market opportunities
Risk management
Regulatory compliance
Enhanced security capabilities

High-Performance Computing for Big Data

-In-memory analytics
Storing and processing the complete data set in RAM
In-database analytics
Placing analytic procedures close to where data
is stored
Grid computing & MPP
Use of many machines and processors in parallel (MPP - massively parallel processing)
Appliances
Combining hardware, software, and storage in a single unit for performance and scalability

Popular Big Data Technologies

click to edit

Hadoop

[MapReduce]

NoSQL (Not only SQL)

HIVE

PIG

-Designed to process big data with reasonable cost and time


-Hadoop is an open source framework for storing and analyzing massive amounts of distributed, semi and unstructured data


-open source many people continuously improve it


-Hadoop clusters run on inexpensive commodity
hardware so projects can scale-out inexpensively


-MapReduce + Hadoop = Big Data core technology

how does it work?

-Consists of 2 components: Hadoop Distributed File System (HDFS) and Map Reduce


-Distributed with some centralization: Data is broken up into “parts,” which are then loaded into a file system (cluster) made up of multiple nodes (machines)


-Each “part” is replicated multiple times and loaded into the file system for replication and failsafe processing


-Jobs are distributed to the clients, and once completed, the results are collected and aggregated using MapReduce

Hadoop Cluster

-2 Nodes :
-Master : Name Node and Job Tracker
-Slave : Data Node and Task Tracker


-Data Nodes referred to as storage node where the data is stored


-Name node keeps track of the files and directories and provides information on where in the cluster data is stored and whether any nodes have failed


-Job and Task tracker are for processing data and are known as compute node


-Job Tracker initiates and co-ordinates jobs or the processing of data and dispatches compute tasks to the Task Tracker

-Goal - achieving high performance with “simple” computers


-Developed and popularized by Google


-Good at processing and analyzing large volumes of multi-structured data in a timely manner


-Distributes the processing of very large multi-structured data files across a large cluster of ordinary machines/processors


-Used in indexing the Web for search, graph analysis, text analysis, machine learning, …

-A new style of database which process large volumes of multi-structured data


-Often works in conjunction with Hadoop


-Serves discrete data stored among large volumes of multi-structured data to end-users and Big Data applications


-Examples : Cassandra, MongoDB, CouchDB, Hbase, etc eBay’s Multi Data-Center Deployment

-Hadoop-based data warehousing-like framework
developed by Facebook
-Allows users to write queries in an SQL-like language called HiveQL, which are then converted to MapReduce

Hadoop-based query language developed by Yahoo! It is
relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL).

Coexistence of Hadoop and DW

  1. Use Hadoop for storing and archiving multistructured data
  1. Use Hadoop for filtering, transforming, and/or consolidating multi-structured data
  1. Use Hadoop to analyze large volumes of multistructured data and publish the analytical results
  1. Use a relational DBMS that provides MapReduce capabilities as an investigative computing platform
  1. Use a front-end query tool to access and analyze data

Software, Hardware, Service, …
• Big Data vendor landscape is developing very rapidly
• A representative list would include
– Cloudera - cloudera.com
– MapR – mapr.com
– Hortonworks - hortonworks.com
– Also, IBM (Netezza, InfoSphere), Oracle (Exadata, Exalogic), Microsoft, Amazon, Google, …

-Data-in-motion analytics and real-time data analytics
-One of the Vs in Big Data = Velocity


-Analytic process of extracting actionable information from continuously flowing data


-Why Stream Analytics?
-Store-everything approach infeasible when the number of data sources increases
-Need for critical event processing - complex pattern variations that need to be detected and acted on as soon as they happen

e-Commerce – use of click-stream data to make product recommendations and bundles


-Law Enforcement and Cyber Security – use video surveillance and face recognition for real-time situational awareness to improve crime prevention and law enforcement


-Financial Services – use transnational data to detect fraud and illegal activities


-Health Services – use medical data to detect anomalies so as to improve patient conditions and save lives.


-Government - use data from traffic sensors to change traffic light sequences and traffic lanes to ease the pain caused by traffic congestion problems