Google Cloud Professional Data Engineer — Question 292
You are building a real-time prediction engine that streams files, which may contain PII (personal identifiable information) data, into Cloud Storage and eventually into BigQuery. You want to ensure that the sensitive data is masked but still maintains referential integrity, because names and emails are often used as join keys.
How should you use the Cloud Data Loss Prevention API (DLP API) to ensure that the PII data is not accessible by unauthorized individuals?
Answer options
- A. Create a pseudonym by replacing the PII data with cryptogenic tokens, and store the non-tokenized data in a locked-down button.
- B. Redact all PII data, and store a version of the unredacted data in a locked-down bucket.
- C. Scan every table in BigQuery, and mask the data it finds that has PII.
- D. Create a pseudonym by replacing PII data with a cryptographic format-preserving token.
Correct answer: D
Explanation
The correct answer is D because using a cryptographic format-preserving token allows for the masking of PII data while retaining the data's structure, essential for maintaining referential integrity. Option A is incorrect because storing non-tokenized data in any form poses a security risk. Option B does not maintain referential integrity since redacting all PII means losing essential join keys. Option C is not comprehensive, as scanning tables in BigQuery alone does not proactively protect PII as data is streamed.