Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 4: Data Munging, MapReduce VS Spark - Coggle Diagram
Chapter 4: Data Munging
Data Processing
Process of collecting, processing, manipulating and managing the data to generate meaningful information to end user
Data capture
Data may be originated from different sources in the form of transactions, observations, and so forth
-
-
Data wrangling
Data transformation and mapping involves converting raw data into a more useful format for various purposes like analytics
Access
It starts with accessing raw data from a source,
Transform
then transforming it using algorithms or parsing it into structured data,
-
Parallel data processing
Parallel processing is when several smaller tasks are executed at the same time to complete a larger task faster.
-
-
-
-
Cluster
-
-
Big data processing as large dataset can be divided into smaller and then processed in parallel in distributed manner
-
cluster will be comprised of low-cost commodity nodes that collectively provide increased processing capacity.
They also provide redundancy and fault tolerance, ensuring resilient processing and analysis even in case of network or node failures.
MapReduce
-
It works on the principle of divide-and-conquer, ensuring built-in fault tolerance and redundancy.
-
The framework breaks down datasets into smaller parts and processes each part independently and in parallel.
Map and Reduce Task
-
It includes a map task and a reduce task, each of which has multiple stages.
Step MapReduce task
Take a file as input, which contains lines with key-value pairs.
Split the file into key-value pairs, where the key is the offset and the value is the data.
Process the value of each line by counting the number of individuals separated by spaces. Each key is associated with the count.
Perform shuffling to associate each key with a group of numbers from the mapping step. Keys become strings, and values become lists of numbers.
In the reducer phase, count the occurrences of each number associated with the keys.
The output of the reducer phase provides the final result, which includes the count of each word.
MapReduce Algorithm
Task Parallelism
Parallelization of data processing involves dividing a task into smaller sub-tasks and running each on a separate processor, often on different nodes in a cluster.
These sub-tasks can execute different algorithms and work on their own copy of the data in parallel.
Finally, the outputs from these sub-tasks are combined to get the final results.
Data Parallelism
Parallelization of data processing involves splitting a dataset into multiple smaller datasets and processing each one simultaneously.
-
-
-
Real-Time Processing
-
-
Real-time mode emphasizes speed, addressing the velocity characteristic.
Also known as event or stream processing, data arrives continuously (stream) or at intervals (event).
While individual data points are small, the continuous flow results in large datasets.
Spark
Apache Spark is a fast, general-purpose data processing framework based on cluster computing.
It utilizes in-memory distributed computing and can operate within existing Hadoop environments or as a standalone system.
It supports batch and streaming data workloads and provides an interactive programming environment through shell support.
Spark uses a master/slave/worker architecture, where a driver program on the master node communicates with executors on worker nodes.
Spark applications run independently, coordinated by the SparkContext object created by the driver program.
Spark can run standalone or on a cluster, with the SparkContext managing resource allocation across applications.
Spark's core concept is the Resilient Distributed Dataset (RDD), a fault-tolerant collection of elements that can be operated on in parallel.
RDDs are immutable distributed collections of records, abstracting away parallel processing complexity.
They are resilient, distributed across multiple nodes, and represent the records of data being processed.
-
-
Interactive mode, used for query processing in real-time, is closely related.
-