Databricks Certified Data Engineer Professional — Question 73
A large company seeks to implement a near real-time solution involving hundreds of pipelines with parallel updates of many tables with extremely high volume and high velocity data.
Which of the following solutions would you implement to achieve this requirement?
Answer options
- A. Use Databricks High Concurrency clusters, which leverage optimized cloud storage connections to maximize data throughput.
- B. Partition ingestion tables by a small time duration to allow for many data files to be written in parallel.
- C. Configure Databricks to save all data to attached SSD volumes instead of object storage, increasing file I/O significantly.
- D. Isolate Delta Lake tables in their own storage containers to avoid API limits imposed by cloud vendors.
- E. Store all tables in a single database to ensure that the Databricks Catalyst Metastore can load balance overall throughput.
Correct answer: B
Explanation
The correct answer is B because partitioning ingestion tables by a small time duration allows for parallel writing of many data files, which is essential for handling high volume and velocity data. Options A, C, D, and E do not specifically address the requirement for parallel updates of multiple tables and may not effectively optimize for the high data throughput needed in this scenario.