Please enable JavaScript.

Coggle requires JavaScript to display documents.

Chapter 4: Data Munging, MapReduce VS Spark - Coggle Diagram

- - - - Data Capturing
      - Data collection from subsystem
      - Data collection from web portal
      - Data Transmission
    - - Classify
      - Sort/Merge
      - Mathematical Operational
      - Transform
    - - Storage
      - Retrieval
      - Archival
      - Governance
    - - Advanced Computing
      - Format
      - Present
- - - - Consistency
        
        Consistency refers to result accuracy and precision, with accurate results being close to correct values and precise results being consistent with each other.
      - Volume
        
        It represents the amount of data that can be processed, especially large volumes that require distributed processing.
      - Speed
        
        It indicates how quickly data can be processed once generated, especially in real-time analytics.
        
        Processing large volumes of data while maintaining speed and consistency is challenging.
- - - - It causes delays and results in high-latency responses.
      - Queries can be complex with multiple joins.
      - They focus on highly read-intensive tasks with large volumes of data.
      - Offline processing handles data in batches.
      - OLAP systems and strategic BI analytics are often batch-oriented.
    - - Transactional online processing operates interactively with low-latency responses.
      - Transaction workloads handle small data amounts with random reads and writes.
      - OLTP and operational systems are typical for transactional processing, which is write-intensive.
      - These workloads contain a mix of read/write queries, but are more write-intensive.
      - Transactional workloads involve random reads/writes with fewer joins compared to BI and reporting tasks.
- - - - Take a file as input, which contains lines with key-value pairs.
      - Split the file into key-value pairs, where the key is the offset and the value is the data.
      - Process the value of each line by counting the number of individuals separated by spaces. Each key is associated with the count.
      - Perform shuffling to associate each key with a group of numbers from the mapping step. Keys become strings, and values become lists of numbers.
      - In the reducer phase, count the occurrences of each number associated with the keys.
      - The output of the reducer phase provides the final result, which includes the count of each word.
  - - - Parallelization of data processing involves dividing a task into smaller sub-tasks and running each on a separate processor, often on different nodes in a cluster.
      - These sub-tasks can execute different algorithms and work on their own copy of the data in parallel.
      - Finally, the outputs from these sub-tasks are combined to get the final results.
    - - Parallelization of data processing involves splitting a dataset into multiple smaller datasets and processing each one simultaneously.
      - These sub-datasets are distributed across multiple nodes and processed using the same algorithm..
      - The results from each processed sub-dataset are then combined to produce the final set of results