You are analyzing customer purchases in a Fabric notebook by using PySpark. You have the…

Question

You are analyzing customer purchases in a Fabric notebook by using PySpark.
You have the following DataFrames:
transactions: Contains five columns named transaction_id, customer_id, product_id, amount, and date and has 10 million rows, with each row representing a transaction. customers: Contains customer details in 1,000 rows and three columns named customer_id, name, and country.
You need to join the DataFrames on the customer_id column. The solution must minimize data shuffling.
You write the following code.
from pyspark.sql import functions as F
results =
Which code should you run to populate the results DataFrame?

Accepted Answer

Correct answer: A. A. transactions.join(F.broadcast(customers), transactions.customer_id == customers.customer_id) — The correct answer is A because using F.broadcast(customers) allows the smaller customers DataFrame to be broadcasted to all nodes, minimizing data shuffling during the join operation. Options B and C do not use broadcasting, which could lead to increased shuffling, while D performs a cross join, resulting in a much larger intermediate DataFrame that is inefficient and unnecessary.

Implementing Analytics Solutions Using Microsoft Fabric — Question 41

Answer options

Correct answer: A

Explanation