Please enable JavaScript.
Coggle requires JavaScript to display documents.
DP-203 - Chapter 14 - Optimizing and Troubleshooting Data Storage and Data…
DP-203 - Chapter 14 - Optimizing and Troubleshooting Data Storage and Data Processing
Compacting small files
ADF
strategies
Activity:
Copy data
source
Wildcard file path: <folderpath>/
*
Sink
Copy behavior:
Merge files
Incremental load:
update SQL table with small incoming changes
Synapse
Pipelines
Spark
delta lake
bin packing
OPTIMIZE command
Spark.sql("OPTIMIZE delta.' abfss://path/to/delta/files'")
OPTIMZE by default (while creating table)
delta.autoOptimize.optimizeWrite = true
delta.autoOptimize.autoCompact = true
OSS storage layer on top of Data Lakes
ACID
transactions
Unified
batch
,
interactive
and
streaming
system
Updates and deletes
support for
automated SCD
upsert
...
Spark notebook - pyspark
df.write.mode("overwrite").
format("delta")
.save("abfss://path/to/delta/files")
Val df: DataFrame = spark.read.
format("delta")
.load(abfss://path/to/delta/files)
Spark.sql("CREATE TABLE CUSTOMER
USING DELTA
LOCATION "abfss://path/to/delta/files")
Rewriting user-defined functions (
UDF
s)
SQL
CREATE FUNCTION
Spark
val fndouble = udf((s: Long) => 2*s)
Stream Analytics
register UDF
Handling skews in data
at storage level
select other partition key
smaller chunks
add second partition key
Opt for round-robin
at compute level
enable statistics
ignore outlier data
Handling data spills
causes
data partition
too big
memory
too small
data explosion
as result of joins, unions, etc.
solutions
increase memory
reduce data partition size
remove data skews
identifying data spills
in synapse SQL
TempDB runs out of space
Monitoring Queries:
memory usage
TempDB usage
in Spark
spills are published in task summary
Tuning shuffle partitions
Spark
spark.conf.set("spark.sql.shuffle.partitions",200)
Finding shuffling in a pipeline
Synapse SQL
EXPLAIN
WITH RECOMMENDATIONS
SELECT ....
Query plan
look for 'SHUFFLE MOVE'
Spark
DataFrame.
EXPLAIN
Physical plan
look for 'Exchange'
Spark DAG
look for 'Exchange' stages
Optimizing resource management
SQL pools
pause SQL pool when not in use
size of compute units (DWU)
scale-out/-in
manually
automatically (using Azure functions)
Spark
auto-scale
auto-terminate
spot-instances
type of cluster
memory-intesive
CPU-intensive
network intensive
Tuning queries by using indexers
Synapse SQL
Clustered columnstore index
Clustered index
(+ non-clustered)
heap
Spark
Hyperspace
Tuning queries by using cache
Synapse SQL
enable cahe
at db level
for a session
Spark
df.cache()
Optimizing pipelines for analytical or transactional purposes
Hybrid Transactional Analytical Processing
(
HTAP
)
CosmosDB
Synapse Link
Synapse SQL
Linked Service: CosmosDB
Optimizing pipelines for descriptive versus analytical workloads
Troubleshooting a failed Spark job
Troubleshooting a failed pipeline run
:check:better to merge small files into bigger ones
skew
Uneven distribution of data
data spill
data does not fit in memory and is written to disk
:check:200: tune it to a number that suits your query
:!:how to reduce your billing expenses while using Azure analytic services