Data Engineering on Microsoft Azure — Question 118

You have an Azure subscription that contains an Azure Data Lake Storage Gen2 container named Container1 and an Azure Synapse Analytics workspace named Workspace1.

Workspace1 contains multiple Apache Spark jobs that reference a large dataset in Container1.

You need to optimize the run times of the jobs.

What should you do?

Answer options

Correct answer: B

Explanation

Caching the dataset (option B) significantly improves job performance by storing data in memory, reducing the need for repeated disk access. Disabling hierarchical namespaces (option A) does not directly affect job performance, while increasing the spark.sql.autoBroadcastJoinThreshold (option C) may help with specific joins but is not as effective as caching for overall job execution. Using RDDs (option D) is less efficient than using DataFrames or Datasets in Spark for most scenarios.