Data Engineering on Microsoft Azure — Question 118
You have an Azure subscription that contains an Azure Data Lake Storage Gen2 container named Container1 and an Azure Synapse Analytics workspace named Workspace1.
Workspace1 contains multiple Apache Spark jobs that reference a large dataset in Container1.
You need to optimize the run times of the jobs.
What should you do?
Answer options
- A. For Container1, disable hierarchical namespaces.
- B. Cache the dataset.
- C. Increase the spark.sql.autoBroadcastJoinThreshold value.
- D. Use Resilient Distributed Datasets (RDDs).
Correct answer: B
Explanation
Caching the dataset (option B) significantly improves job performance by storing data in memory, reducing the need for repeated disk access. Disabling hierarchical namespaces (option A) does not directly affect job performance, while increasing the spark.sql.autoBroadcastJoinThreshold (option C) may help with specific joins but is not as effective as caching for overall job execution. Using RDDs (option D) is less efficient than using DataFrames or Datasets in Spark for most scenarios.