Please enable JavaScript.
Coggle requires JavaScript to display documents.
Week4 (Query Processing with Map-Reduce (Database queries operate on large…
Week4
Query Processing with Map-Reduce
Database queries operate on large-scale data.
Although the database may be large, many queries only retrieve small amounts of data, because, among other reasons, optimizers are so adept (as we have seen) at minimizing I/O.
A paradigmatic query might ask for the balance of one particular bank account.
Such queries do not constitute good applications of mapreduce engines.
A class of queries for which applying map-reduce engines seems well motivated comprises those in extract-transformload (ETL) systems for data warehousing.
Even for relatively small operations DB this process can give rise to very processing-intensive queries.
Map-Reduce Semantics for RA Operators
Selection
Selection is captured by the mapper alone, the reducer has
no work to do: it’s the identity function.
Define:
map : For each t in the input, if C(t) is true, then emit the keyvalue pair (t, t), i.e., set both key and value to t.
reduce : For each (t, t) from any of many mappers, emit (t, t).
Projection
Define:
map : For each t in the input, construct from t a tuple t’ containing only the columns in A and emit the key-value pair (t’, t’), i.e., set both key and value to t’.
reduce : For each key t’ from any of many mappers, there will be one or more (t’, t’) pairs (i.e., the reducer takes pairs of the form (t’, [t’, t’, … , t’]), emit exactly one pair (t’, t’) for each key t’.
Union
Union has a merely formatting mapper and a duplicateeliminating reducer functions.
Define:
map : For each t in the input, emit the key-value pair (t, t),
i.e., set both key and value to t.
reduce : For each key t from any of many mappers, there will be one or two (t, t) pairs (i.e., the reducer takes pairs such as (t, [t]) or (t, [t, t]), emit exactly one pair (t, t) for each key t.
Join
For R S, where R and S are the input relations with schemas
(A, B) and (B, C), resp.
Define:
map: For each t in he input, if t ∈ R, emit the key-value pair (b.('R',a)), otherwise emit (b, ('S',c)), where a,b,c are values in the columns A,B,C resp.
reduce: For each key b from any of many mappers, the associated value list will contain pairs of the form ('R', a) or ('S', c).
The Map-Reduce Model
Framework
A high level programming paradigm allows many data-oriented processes to be written simply.
Processes large data by:
– applying a function to each logical record in the input (map).
– categorize and combine the intermediate results into summary values (reduce).
Commodity Clusters
MapReduce is designed to efficiently process large volumes of data by connecting many commodity computers together to work in parallel.
A theoretical 1000-CPU machine would cost a very large amount of money, far more than 1000 single-CPU or 250 quad-core machines.
MapReduce ties smaller and more reasonably priced machines together into a single cost-effective commodity cluster.
SPARK
What is it?
Separate, fast, MapReduce-like engine
– In-memory data storage for very fast iterative queries
– General execution graphs and powerful optimizations
– Up to 40x faster than Hadoop – Up to 100x faster (2-10x on disk)
Compatible with Hadoop’s storage APIs.
Can read/write to any Hadoop-supported system, including HDFS, HBase, SequenceFiles, etc
SPARQL on
Spark