Implementing Analytics Solutions Using Microsoft Fabric — Question 41
You are analyzing customer purchases in a Fabric notebook by using PySpark.
You have the following DataFrames:
transactions: Contains five columns named transaction_id, customer_id, product_id, amount, and date and has 10 million rows, with each row representing a transaction. customers: Contains customer details in 1,000 rows and three columns named customer_id, name, and country.
You need to join the DataFrames on the customer_id column. The solution must minimize data shuffling.
You write the following code.
from pyspark.sql import functions as F
results =
Which code should you run to populate the results DataFrame?
Answer options
- A. transactions.join(F.broadcast(customers), transactions.customer_id == customers.customer_id)
- B. transactions.join(customers, transactions.customer_id == customers.customer_id).distinct()
- C. transactions.join(customers, transactions.customer_id == customers.customer_id)
- D. transactions.crossJoin(customers).where(transactions.customer_id == customers.customer_id)
Correct answer: A
Explanation
The correct answer is A because using F.broadcast(customers) allows the smaller customers DataFrame to be broadcasted to all nodes, minimizing data shuffling during the join operation. Options B and C do not use broadcasting, which could lead to increased shuffling, while D performs a cross join, resulting in a much larger intermediate DataFrame that is inefficient and unnecessary.