Databricks Certified Data Engineer Professional — Question 54
All records from an Apache Kafka producer are being ingested into a single Delta Lake table with the following schema:
key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONG
There are 5 unique topics being ingested. Only the "registration" topic contains Personal Identifiable Information (PII). The company wishes to restrict access to PII. The company also wishes to only retain records containing PII in this table for 14 days after initial ingestion. However, for non-PII information, it would like to retain these records indefinitely.
Which of the following solutions meets the requirements?
Answer options
- A. All data should be deleted biweekly; Delta Lake's time travel functionality should be leveraged to maintain a history of non-PII information.
- B. Data should be partitioned by the registration field, allowing ACLs and delete statements to be set for the PII directory.
- C. Because the value field is stored as binary data, this information is not considered PII and no special precautions should be taken.
- D. Separate object storage containers should be specified based on the partition field, allowing isolation at the storage level.
- E. Data should be partitioned by the topic field, allowing ACLs and delete statements to leverage partition boundaries.
Correct answer: E
Explanation
Option E is correct because partitioning data by the topic field allows for setting up Access Control Lists (ACLs) and delete statements specifically for the 'registration' topic, facilitating the compliance requirement for handling PII. Options A and D do not effectively isolate PII from non-PII data or address the retention requirement specifically for PII. Option B does not consider the needs for access control at the topic level. Option C incorrectly assumes that binary data cannot contain PII.