Please enable JavaScript.
Coggle requires JavaScript to display documents.
Week 7: Big Data Concepts and Tools (Characteristics of Big Data (Variety…
Week 7: Big Data Concepts and Tools
Characteristics of Big Data
Volume (Scale)
Data Volume
44x increase from 2009 to 2020
From 0.8 zettabytes to 35zb
Data volume is increasing exponentially
Variety (Complexity)
Structured Data e.g. Relational Data (Tables/Transaction/Legacy Data), spreadsheet data
Semi-structured data e.g. email,logs,documents
Unstructured data e.g. videos, images, audio files, streaming data, graphs, GPS location data, simulation data, streaming data, etc
A single application can be generating/collecting many types of data
Big Public Data (online, weather, finance, etc)
To extract knowledge -> all these types of data need to linked together
Velocity (Speed)
Data is being generated fast and needs to be processed fast
Online Data Analytics
Late decisions -> missing opportunities
Examples:
E-Promotions
: Based on your current location, your purchase history, what you like -> send promotions right now for store next to you
Healthcare monitoring
: Sensors monitoring your activities and body -> any abnormal measurements require immediate reaction
Others Vs:
Veracity
: accuracy, quality, truthfulness, trustworthiness
Variability
: data flows can be inconsistent with periodic peaks
Value
: provides business value
Fundamentals of Big Data Analytics
Big Data by itself, regardless of size, type or speed = worthless
Big Data + "Big" Analytics = VALUE
With the value proposition, Big Data also brought about big challenges
Effectively and efficiently capturing, storing and analyzing Big Data
New breed of technologies needed (developed or purchased or hired or outsourced)
Limitations of Data Warehouse/Relational Database
Schema (Fixed)
Scalability
Unable to handle huge amounts (terabytes, petabytes of new/contemporary data sources)
Speed
Unable to handle speed at which big data is arriving
Others
Unable to handle sophisticated processing such as machine learning
Unable to perform queries on big data efficiently
Challenges of Big Data Analytics
Data volume
The ability to capture, store, and process the huge volume of data in a timely manner
Data integration
The ability to combine data quickly and at reasonable cost
Processing capabilities
The ability to process the data quickly, as it is capture (i.e. stream analytics)
Data governance
Security, privacy, ownership, quality issues
Skill availability: shortage of data scientists
Solution cost: Return on Investment
Critical Success Factors for Big Data Analytics
A clear business need
Strong committed sponsorship
Alignment between the business & IT strategy
A fact-based decision making culture
A strong data infrastructure
The right analytics tools
Personnel with advanced analytical skills
Business Problems Addressed by Big Data Analytics
Process efficiency and cost reduction
Brand management
Revenue maximization, cross-selling/up-selling
Enhanced customer experience
Churn identification, customer recruiting
Improved customer service
Identifying new products and market opportunities - Risk management
Regulatory compliance
Enhanced security capabilities
…these are data collected in relation to behaviou
High-Performance Computing for Big Data
In-memory analytics
Storing and processing the complete data set in RAM
In-database analytics
Placing analytic procedures close to where data is stored
Grid computing & MPP
Use of many machines and processors in parallel (MPP -massively parallel processing)
Appliances
Combining hardware, software, and storage in a single unit for performance and scalability
Popular Big Data Technologies
Hadoop
MapReduce
NoSQL (Not only SQL)
HIVE
PIG
HADOOP
Hadoop is an open source framework for storing and analyzing massive amounts of distributed, semi and unstructured data
Open source -hundreds of contributors continuously improve the core technology
Hadoop clusters run on inexpensive commodity hardware so projects can scale-out inexpensively
MapReduce+ Hadoop = Big Data core technology
Hadoop Cluster
2 Nodes
Master : Name Node and Job Tracker
Slave : Data Node and Task Tracker
Data Nodes referred to as storage node where the data is stored
Name node keeps track of the files and directories and provides information on where in the cluster data is stored and whether any nodes have failed
Job and Task tracker are for processing data and are known as compute node
Job Tracker initiates and co-ordinates jobs or the processing of data and dispatches compute tasks to the Task Tracker
Demystifying Facts
Hadoop consists of multiple products
Hadoop is open source but available from vendors too
Hadoop is an ecosystem, not a single product
HDFS is a file system, not a DBMS
Hadoopand MapReduce are related but not the same
MapReduce provides control for analytics
Hadoop is about data diversity, not just data volume
Hadoop complements a DW; it’s rarely a replacement
Hadoop enables many types of analytics, not just Web analytics
Hadoop Technical Components
Hadoop Distributed File System (HDFS)
Name Node (primary facilitator)
Secondary Node (backup to Name Node)
Job Tracker
Slave Nodes (the grunts of any Hadoop cluster)
Additionally, Hadoop ecosystem is made up of a number of complementary sub-projects: NoSQL (Cassandra, Hbase), DW (Hive), …
NoSQL = not only SQL
NoSQL (Not Only SQL)
A new style of database which process large volumes of multi-structured data
Often works in conjunction with Hadoop
Serves discrete data stored among large volumes of multi-structured data to end-users and Big Data applications
Examples : Cassandra, MongoDB, CouchDB, Hbase, etc
How Does Hadoop Work?
Consists of 2 components : Hadoop Distributed File System (HDFS) and Map Reduce
Distributed with some centralization: Data is broken up into “parts,” which are then loaded into a file system (cluster) made up of multiple nodes (machines)
Each “part” is replicated multiple times and loaded into the file system for replication and failsafe processing
Jobs are distributed to the clients, and once completed, the results are collected and aggregated using MapReduce
Big Data and Data Warehousing
What is the impact of Big Data on DW?
Big Data and RDBMS do not go nicely together
Will Hadoop replace data warehousing/RDBMS?
Use Cases for Hadoop
Hadoop as the repository and refinery
Hadoop as the active archive
Use Cases for Data Warehousing
Data warehouse performance
Integrating data that provides business value Interactive BI tools
Use Hadoop for storing and archiving multistructured data
Use Hadoop for filtering, transforming, and/or consolidating multi-structured data
Use Hadoop to analyze large volumes of multistructured data and publish the analytical results
Use a relational DBMS that provides MapReduce capabilities as an investigative computing platform
Use a front-end query tool to access and analyze data
MapReduce
Goal -achieving high performance with “simple” computers
Developed and popularized by Google
Good at processing and analyzing large volumes of multistructured data in a timely manner
Distributes the processing of very large multi-structured data files across a large cluster of ordinary machines/processors
Used in indexing the Web for search, graph analysis, text analysis, machine learning, ....
HIVE
Hadoop-based data warehousing-like framework developed by Facebook
Allows users to write queries in an SQL-like language called HiveQL, which are then converted to MapReduce
PIG
Hadoop-based query language developed by Yahoo! It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL).
Skill That Define a Data Scientist
Domain Expertise, Problem Definition and Decision Modeling
Data Access and Management (both traditional and new data systems)
Programming, Scripting and Hacking
Internet and Social Media/Social Networking Technologies
Curiosity and Creativity
Communications and Interpersonal
Big Data and Stream Analytics
Stream Analytics Applications
e-Commerce
–use of click-stream data to make product recommendations and bundles
Law Enforcement and Cyber Security
–use video surveillance and face recognition for real-time situational awareness to improve crime prevention and law enforcement
Financial Services
–use transactional data to detect fraud and illegal activities
Health Services
–use medical data to detect anomalies so as to improve patient conditions and save lives.
Government
-use data from traffic sensors to change traffic light sequences and traffic lanes to ease the pain caused by traffic congestion problems
Data-in-motion analytics and real-time data analytics
One of the Vs in Big Data = Velocity
Analytic process of extracting actionable information from continuously flowing data
Why Stream Analytics?
Store-everything approach infeasible when the number of data sources increases
Need for critical event processing -complex pattern variations that need to be detected and acted on as soon as they happen.