Data Engineering on Microsoft Azure — Question 49
You have an Azure Databricks workspace and an Azure Data Lake Storage Gen2 account named storage1.
New files are uploaded daily to storage1.
You need to recommend a solution that configures storage1 as a structured streaming source. The solution must meet the following requirements:
• Incrementally process new files as they are uploaded to storage1.
• Minimize implementation and maintenance effort.
• Minimize the cost of processing millions of files.
• Support schema inference and schema drift.
Which should you include in the recommendation?
Answer options
- A. COPY INTO
- B. Azure Data Factory
- C. Auto Loader
- D. Apache Spark FileStreamSource
Correct answer: C
Explanation
The correct answer is C, Auto Loader, as it is specifically designed for incrementally processing new files in a cost-effective manner while supporting schema inference and schema drift. Options A and D, while related, do not provide the same level of integration with Azure Databricks for streaming and incremental file processing. Option B, Azure Data Factory, is more focused on data orchestration and may involve higher maintenance and implementation efforts compared to Auto Loader.