Free Microsoft DP-203 Practice Test Questions MCQs
Stop wondering if you're ready. Our Microsoft DP-203 practice test is designed to identify your exact knowledge gaps. Validate your skills with Data Engineering on Microsoft Azure questions that mirror the real exam's format and difficulty. Build a personalized study plan based on your free DP-203 exam questions mcqs performance, focusing your effort where it matters most.
Targeted practice like this helps candidates feel significantly more prepared for Data Engineering on Microsoft Azure exam day.
22100+ already prepared
Updated On : 3-Mar-2026210 Questions
Data Engineering on Microsoft Azure
4.9/5.0
Topic 1, Contoso Case StudyTransactional Date
Contoso has three years of customer, transactional, operation, sourcing, and supplier data
comprised of 10 billion records stored across multiple on-premises Microsoft SQL Server
servers. The SQL server instances contain data from various operational systems. The
data is loaded into the instances by using SQL server integration Services (SSIS)
packages.
You estimate that combining all product sales transactions into a company-wide sales
transactions dataset will result in a single table that contains 5 billion rows, with one row
per transaction.
Most queries targeting the sales transactions data will be used to identify which products
were sold in retail stores and which products were sold online during different time period.
Sales transaction data that is older than three years will be removed monthly.
You plan to create a retail store table that will contain the address of each retail store. The
table will be approximately 2 MB. Queries for retail store sales will include the retail store
addresses.
You plan to create a promotional table that will contain a promotion ID. The promotion ID
will be associated to a specific product. The product will be identified by a product ID. The
table will be approximately 5 GB.
Streaming Twitter Data
The ecommerce department at Contoso develops and Azure logic app that captures
trending Twitter feeds referencing the company’s products and pushes the products to
Azure Event Hubs.
Planned Changes
Contoso plans to implement the following changes:
* Load the sales transaction dataset to Azure Synapse Analytics.
* Integrate on-premises data stores with Azure Synapse Analytics by using SSIS packages.
* Use Azure Synapse Analytics to analyze Twitter feeds to assess customer sentiments
about products.
Sales Transaction Dataset Requirements
Contoso identifies the following requirements for the sales transaction dataset:
• Partition data that contains sales transaction records. Partitions must be designed to
provide efficient loads by month. Boundary values must belong: to the partition on the right.
• Ensure that queries joining and filtering sales transaction records based on product ID
complete as quickly as possible.
• Implement a surrogate key to account for changes to the retail store addresses.
• Ensure that data storage costs and performance are predictable.
• Minimize how long it takes to remove old records.
Customer Sentiment Analytics Requirement
Contoso identifies the following requirements for customer sentiment analytics:
• Allow Contoso users to use PolyBase in an A/ure Synapse Analytics dedicated SQL pool
to query the content of the data records that host the Twitter feeds. Data must be protected
by using row-level security (RLS). The users must be authenticated by using their own
A/ureAD credentials.
• Maximize the throughput of ingesting Twitter feeds from Event Hubs to Azure Storage
without purchasing additional throughput or capacity units.
• Store Twitter feeds in Azure Storage by using Event Hubs Capture. The feeds will be
converted into Parquet files.
• Ensure that the data store supports Azure AD-based access control down to the object
level.
• Minimize administrative effort to maintain the Twitter feed data records.
• Purge Twitter feed data records;itftaitJ are older than two years.
Data Integration Requirements
Contoso identifies the following requirements for data integration:
Use an Azure service that leverages the existing SSIS packages to ingest on-premises
data into datasets stored in a dedicated SQL pool of Azure Synaps Analytics and transform
the data.
Identify a process to ensure that changes to the ingestion and transformation activities can
be version controlled and developed independently by multiple data engineers.
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
You are designing an Azure Stream Analytics solution that will analyze Twitter data.
You need to count the tweets in each 10-second window. The solution must ensure that each tweet is counted only once.
Solution: You use a tumbling window, and you set the window size to 10 seconds.
Does this meet the goal?
A. Yes
B. No
Explanation:
Azure Stream Analytics provides various windowing functions to group data streams into temporal segments. For counting tweets in fixed time intervals where each event must belong to exactly one window, the windowing choice is critical. Tumbling windows are a series of fixed-size, non-overlapping, and contiguous time intervals that effectively partition the stream into segments where each event falls into exactly one window, preventing double-counting across windows.
Correct Option:
A. Yes
The solution correctly uses a tumbling window with a 10-second size. Tumbling windows are designed specifically for scenarios requiring non-overlapping time intervals. Each tweet will be assigned to exactly one 10-second window based on its timestamp, ensuring accurate counting without duplication. This window function guarantees that every event is processed once per window segment, meeting both the temporal grouping requirement and the single-count constraint specified in the goal.
Incorrect Option:
B. No
This option is incorrect because the tumbling window perfectly satisfies the requirements. Other window types like hopping windows with non-zero hop size could cause overlapping intervals and potential double-counting, while sliding windows might generate continuous outputs. The tumbling window's non-overlapping nature ensures each tweet contributes to exactly one count within its 10-second window, making it the appropriate choice for this scenario.
Reference:
Azure Stream Analytics Windowing Functions Documentation
You are designing the folder structure for an Azure Data Lake Storage Gen2 account.
You identify the following usage patterns:
• Users will query data by using Azure Synapse Analytics serverless SQL pools and Azure Synapse Analytics serverless Apache Spark pods.
• Most queries will include a filter on the current year or week.
• Data will be secured by data source.
You need to recommend a folder structure that meets the following requirements:
• Supports the usage patterns
• Simplifies folder security
• Minimizes query times
Which folder structure should you recommend?
A. Option A
B. Option B
C. Option C
D. Option D
E. Option E
Explanation:
When designing folder structures in Azure Data Lake Storage Gen2 for use with Synapse serverless SQL and Spark, partitioning strategy is critical for query performance and security management. The structure should align with common query filters (year/week) and enable easy security implementation at appropriate levels. The goal is to minimize data scanning during queries and simplify permission management based on data source requirements.
Correct Option:
D. Option D
/
This structure prioritizes data source at the root level, which simplifies security by allowing permissions to be set at the data source folder level. The year and week partitions align perfectly with the common query filters, enabling partition elimination in serverless SQL and Spark. This minimizes query times by reducing data scanning to relevant partitions only, meeting all requirements efficiently.
Incorrect Options:
A. Option A
/Year=
This structure buries data source at a lower level, making security implementation complex as permissions would need to be set on multiple subfolders across different year/week combinations. While it supports year/week filtering, the nested data source folders create security management overhead.
B. Option B
/Year=
This structure places year at the root and data source at the next level, partially supporting security but requiring permissions to be set under each year folder. This duplicates security configuration across multiple year folders, increasing management complexity.
C. Option C
/
While this structure supports data source security at root level, it uses month partitioning instead of week. Since most queries filter on week, this would cause additional data scanning across entire months to filter weeks, increasing query times.
E. Option E
/
This structure over-partitions with month and day levels. While it supports data source security, it does not align with the week-based query pattern. Queries filtering on week would need to scan multiple day partitions, leading to inefficient query performance.
Reference:
Azure Data Lake Storage Gen2 Partitioning Best Practices
Synapse Serverless SQL Query Optimization on Data Lake
You have an Azure Databricks resource.
You need to log actions that relate to changes in compute for the Databricks resource.
Which Databricks services should you log?
A. clusters
B. workspace
C. DBFS
D. SSH
E. lobs
Explanation:
Azure Databricks integrates with Azure Monitor to provide logging capabilities for different services within the Databricks resource. When monitoring changes related to compute resources, specific services capture different types of operational activities. Understanding which service logs compute-related changes is essential for proper audit and monitoring configuration.
Correct Option:
B. workspace
The workspace service in Azure Databricks logs control plane operations including compute-related changes. When you create, modify, or delete clusters, resize existing clusters, or change autoscaling configurations, these actions are captured in the workspace diagnostic logs. This service provides comprehensive auditing of all compute resource management activities.
Incorrect Options:
A. clusters
While clusters are the actual compute resources being modified, there is no separate "clusters" logging service in Azure Databricks. Cluster-related actions are logged under the workspace service category. This option might cause confusion as it seems logical but is not a valid diagnostic log category.
C. DBFS
Databricks File System (DBFS) logs capture actions related to file operations, mount points, and storage interactions. These logs do not contain information about compute resource changes. They are focused on data plane operations within the Databricks environment.
D. SSH
SSH logging captures secure shell connection attempts and activities to Databricks clusters. While this relates to compute resources indirectly, it logs connection events rather than actual changes to compute configurations, cluster creation, or modifications.
E. lobs
This appears to be a typo or invalid option. The correct term would be "jobs" but even then, jobs logs capture notebook execution and job run activities, not compute resource changes. Job logs track workload execution rather than infrastructure modifications.
Reference:
Azure Databricks Diagnostic Logging Documentation
Monitor Azure Databricks with Azure Monitor
You have an Azure Data lake Storage account that contains a staging zone.
You need to design a daily process to ingest incremental data from the staging zone, transform the data by executing an R script, and then insert the transformed data into a data warehouse in Azure Synapse Analytics.
Solution You use an Azure Data Factory schedule trigger to execute a pipeline that executes an Azure Databricks notebook, and then inserts the data into the data warehouse Dow this meet the goal?
A. Yes
B. No
Explanation:
Azure Data Factory provides comprehensive orchestration capabilities for data integration workflows. The scenario requires a daily incremental ingestion process with R-based transformation and loading into Synapse Analytics. The solution must effectively coordinate these different components while supporting the R script execution requirement within the Azure ecosystem.
Correct Option:
A. Yes
The solution correctly uses Azure Data Factory with a schedule trigger to orchestrate the daily process. Azure Databricks notebooks support R execution through R kernels, making them suitable for running the required R script. The pipeline can then load transformed data into Synapse Analytics using copy activities or native connectors. This creates an end-to-end automated solution that meets all requirements.
Incorrect Option:
B. No
This option is incorrect because the proposed solution fully addresses all requirements. Azure Data Factory provides the scheduling and orchestration, Azure Databricks supports R execution through notebooks, and the integration with Synapse Analytics is well-established. The combination of these services creates a robust incremental processing solution that meets the daily ingestion, R transformation, and data warehouse loading needs.
Reference:
Azure Data Factory Scheduling Pipelines
Run R Scripts in Azure Databricks
Azure Data Factory Synapse Analytics Connector
You need to implement a Type 3 slowly changing dimension (SCD) for product category data in an Azure Synapse Analytics dedicated SQL pool.
You have a table that was created by using the following Transact-SQL statement.
Which two columns should you add to the table? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
A. [EffectiveScarcDate] [datetime] NOT NULL,
B. [CurrentProduccCacegory] [nvarchar] (100) NOT NULL,
C. [EffectiveEndDace] [dacecime] NULL,
D. [ProductCategory] [nvarchar] (100) NOT NULL,
E. [OriginalProduccCacegory] [nvarchar] (100) NOT NULL,
E. [OriginalProduccCacegory] [nvarchar] (100) NOT NULL,
Explanation:
Type 3 slowly changing dimensions track changes by adding columns to store both current and previous values of specific attributes. Unlike Type 2 which creates new rows, Type 3 maintains one row per entity with separate columns for original and current values. For product category tracking, this approach preserves history for one level of change while maintaining the existing row structure.
Correct Option:
B. [CurrentProductCategory] nvarchar NOT NULL
This column will store the most recent or current product category value. In a Type 3 SCD implementation, you need columns to hold both the original and current values of the attribute being tracked. This column captures the latest category assignment for the product.
E. [OriginalProductCategory] nvarchar NOT NULL
This column stores the original or previous product category value. When combined with the current category column, it enables reporting on both the original and current category values. Type 3 SCD typically tracks only one level of change, preserving the original value while updating the current value column.
Incorrect Options:
A. [EffectiveStartDate] [datetime] NOT NULL
Effective dates are characteristic of Type 2 SCD implementations where each version of a dimension member has a validity period. Type 3 SCD does not require date columns as it maintains only original and current values without tracking multiple versions over time.
C. [EffectiveEndDate] [datetime] NULL
Similar to effective start dates, end dates are used in Type 2 SCD to indicate when a version becomes inactive. Type 3 SCD does not use date ranges for version tracking since it preserves only the original value and overwrites the current value.
D. [ProductCategory] nvarchar NOT NULL
Adding a single product category column would not provide historical tracking capability. This column would simply be overwritten each time the category changes, losing the previous value entirely. Type 3 requires separate columns for original and current values.
Reference:
Slowly Changing Dimension Type 3 Concepts
Implementing SCD in Azure Synapse Analytics
You plan to build a structured streaming solution in Azure Databricks. The solution will count new events in five-minute intervals and report only events that arrive during the interval. The output will be sent to a Delta Lake table.
Which output mode should you use?
A. complete
B. update
C. append
Explanation:
Structured Streaming in Azure Databricks provides different output modes that determine how results are written to sinks. The choice depends on whether you want to output all aggregated data, only updated records, or only new records. For interval-based counting where only events from the current window matter, the output mode must align with this requirement.
Correct Option:
C. append
The append output mode writes only the new rows added to the result table since the last trigger. Since the solution counts events in five-minute intervals and only needs to report events from the current interval, append mode is appropriate. Each micro-batch will produce a new row for that interval's count without modifying previous results.
Incorrect Options:
A. complete
Complete mode rewrites the entire output table with all aggregated data after each trigger. This would continuously update previous interval counts unnecessarily and create excessive write operations. It is typically used when you need to maintain a complete result table that reflects all historical aggregations.
B. update
Update mode outputs only rows that were updated since the last trigger. Since the solution counts events per interval and does not modify previous interval results, update mode would not write any data. New interval results are insertions, not updates to existing rows.
Reference:
Azure Databricks Structured Streaming Output Modes
Delta Lake Streaming with Structured Streaming
You need to trigger an Azure Data Factory pipeline when a file arrives in an Azure Data Lake Storage Gen2 container.
Which resource provider should you enable?
A. Microsoft.Sql
B. Microsoft-Automation
C. Microsoft.EventGrid
D. Microsoft.EventHub
Explanation:
Azure Data Factory supports event-based triggers that respond to file arrival events in Azure Storage. These triggers rely on Azure Event Grid to monitor storage events and initiate pipeline executions. The integration between Event Grid and Data Factory enables automated workflows triggered by blob creation, deletion, or modification events.
Correct Option:
C. Microsoft.EventGrid
Event Grid must be enabled as a resource provider to use storage event triggers in Azure Data Factory. When files arrive in ADLS Gen2, Event Grid captures these blob created events and can automatically trigger Data Factory pipelines. This provider handles the event routing from storage to the pipeline execution engine.
Incorrect Options:
A. Microsoft.Sql
SQL resource provider manages relational database services including Azure SQL Database and SQL Server on VMs. It has no role in storage event monitoring or triggering Data Factory pipelines based on file arrival events.
B. Microsoft-Automation
This appears to be a typo (should be Microsoft.Automation). The Automation resource provider manages Azure Automation accounts, runbooks, and configuration management. While Automation can respond to events, it is not the provider needed for Data Factory storage event triggers.
D. Microsoft.EventHub
Event Hub is a big data streaming platform and event ingestion service. It handles high-throughput data ingestion but does not provide the event-driven trigger capability for Data Factory pipelines. Event Grid is specifically designed for event routing and trigger scenarios.
Reference:
Azure Data Factory Event-Based Triggers
Azure Event Grid Storage Events Integration
You are designing a financial transactions table in an Azure Synapse Analytics dedicated SQL pool. The table will have a clustered columnstore index and will include the following columns:
TransactionType: 40 million rows per transaction type CustomerSegment: 4 million per customer segment TransactionMonth: 65 million rows per month AccountType: 500 million per account type You have the following query requirements:
Analysts will most commonly analyze transactions for a given month.
Transactions analysis will typically summarize transactions by transaction type, customer segment, and/or account type You need to recommend a partition strategy for the table to minimize query times.
On which column should you recommend partitioning the table?
A. CustomerSegment
B. AccountType
C. TransactionType
D. TransactionMonth
Explanation:
Partitioning in Azure Synapse dedicated SQL pools helps improve query performance through partition elimination, where queries only scan relevant partitions. The choice of partition column should align with the most common query filters. For columnstore tables, partition size should balance between partition elimination benefits and row group density within partitions.
Correct Option:
D. TransactionMonth
TransactionMonth is the correct choice because analysts most commonly analyze transactions for a given month. Partitioning on this column enables partition elimination when queries include month filters, significantly reducing data scanned. With 65 million rows per month, this creates well-sized partitions that optimize columnstore compression and query performance while aligning with the primary query pattern.
Incorrect Options:
A. CustomerSegment
With only 4 million rows per segment, partitioning on CustomerSegment would create too many small partitions. Small partitions degrade columnstore performance because each partition needs minimum rows (around 1 million) for optimal compression and segment elimination benefits.
B. AccountType
AccountType has 500 million rows per value, which is too large for effective partitioning. Single large partitions provide minimal partition elimination benefits. This distribution would not help queries that filter on other columns like month while creating management overhead.
C. TransactionType
TransactionType has 40 million rows per type, which could work but does not align with the primary query pattern of analyzing by month. Partitioning on transaction type would not benefit month-based queries, which are the most common analysis requirement.
Reference:
Azure Synapse Dedicated SQL Pool Partitioning Guidance
Columnstore Index Partitioning Best Practices
You have an Azure Stream Analytics job. You need to ensure that the job has enough streaming units provisioned You configure monitoring of the SU % Utilization metric. Which two additional metrics should you monitor? Each correct answer presents part of the solution.
NOTE Each correct selection is worth one point
A. Out of order Events
B. Late Input Events
C. Baddogged Input Events
D. Function Events
A. Out of order Events
Explanation:
Azure Stream Analytics requires appropriate streaming unit (SU) allocation to handle input event rates and query complexity. Monitoring SU utilization alone doesn't provide complete visibility into job health. Additional metrics help identify if backlog is due to processing capacity issues or data characteristics. Understanding these metrics ensures optimal resource allocation.
Correct Option:
C. Backlogged Input Events
This is one correct answer. Backlogged Input Events indicates the number of input events that are waiting to be processed. When this metric increases consistently, it suggests the job needs more streaming units to keep up with the incoming event rate. High backlog combined with high SU utilization confirms the need for more SUs.
A. Out of order Events
This is the other correct answer. Out of order events require additional processing as Stream Analytics must reorder them based on timestamps using the allowed late arrival window. High out-of-order rates can increase processing load and contribute to backlog, potentially requiring more streaming units even if input rates appear normal.
Incorrect Options:
B. Late Input Events
Late input events are those arriving after the late arrival window. These events are dropped or handled based on configuration. While important for data completeness monitoring, they don't directly indicate whether streaming units are sufficient for processing capacity.
D. Function Events
Function Events relates to Azure Machine Learning functions integration in Stream Analytics. This metric tracks calls to external ML functions but does not provide meaningful information about streaming unit provisioning requirements or processing capacity needs.
Reference:
Azure Stream Analytics Streaming Units Monitoring
Stream Analytics Performance Metrics and Tuning
You plan to perform batch processing in Azure Databricks once daily.
Which type of Databricks cluster should you use?
A. High Concurrency
B. automated
C. interactive
Explanation:
Azure Databricks offers different cluster types optimized for various workloads. Batch processing jobs that run on a schedule have different requirements compared to interactive analytics or development work. The cluster choice affects cost, performance, and resource management for scheduled production workloads.
Correct Option:
B. automated
Automated clusters (also called job clusters) are specifically designed for running scheduled production jobs. They terminate automatically after job completion, reducing costs compared to always-on clusters. Azure Databricks creates a new cluster for each job run based on the configured specifications, ensuring consistent environments and isolation between runs.
Incorrect Options:
A. High Concurrency
High concurrency clusters are optimized for multiple users running interactive queries simultaneously. They use a shared architecture with separate Spark contexts for isolation. These clusters are designed for BI analysts and data exploration, not for scheduled batch jobs where a fresh cluster per job is preferable.
C. interactive
Interactive clusters are designed for ad-hoc analysis, development, and collaborative work. They remain running until manually terminated and are shared among users. Using interactive clusters for scheduled batch jobs would keep resources running continuously, increasing costs unnecessarily and potentially causing resource contention.
Reference:
Azure Databricks Cluster Types and Configurations
Job Clusters vs All-Purpose Clusters
| Page 1 out of 21 Pages |