SnowPro Advanced: Data Engineer — Question 49
A company is using Snowpipe to bring in millions of rows every day of Change Data Capture (CDC) into a Snowflake staging table on a real-time basis. The CDC needs to get processed and combined with other data in Snowflake and land in a final table as part of the full data pipeline.
How can a Data Engineer MOST efficiently process the incoming CDC on an ongoing basis?
Answer options
- A. Create a stream on the staging table and schedule a task that transforms data from the stream, only when the stream has data.
- B. Transform the data during the data load with Snowpipe by modifying the related COPY INTO statement to include transformation steps such as CASE statements and JOINS.
- C. Schedule a task that dynamically retrieves the last time the task was run from information_schema.task_history and use that timestamp to process the delta of the new rows since the last time the task was run.
- D. Use a CREATE OR REPLACE TABLE AS statement that references the staging table and includes all the transformation SQL. Use a task to run the full CREATE OR REPLACE TABLE AS statement on a scheduled basis.
Correct answer: A
Explanation
Option A is the most efficient because it allows for real-time processing of changes as they occur, utilizing Snowflake's stream functionality. Options B and D would require processing the entire dataset each time, which is less efficient. Option C, while it processes delta data, may not capture changes in real time and could lead to delays in data availability.