In this post, I’m sharing a quick review of the Perform Data Science with Azure Databricks course, along with useful insights to support your prep for the DP-100 certification exam.
Just completed this fourth course in the Microsoft Azure Data Scientist Associate Professional Certificate? You’re now deep into the path toward certification, and this course focuses on leveraging Azure Databricks to handle real-world data science workloads using Apache Spark. With powerful clusters and cloud-scale processing, it’s all about working efficiently with big data. If you’re serious about mastering the tools Azure offers for machine learning, this course will help expand your cloud data science toolkit — and I’ve got the review to help guide your next steps.
Perform Data Science with Azure Databricks
Table of Contents
Test prep Quiz Answers | Week 1
Question 1)
Azure Databricks Runtime adds several key capabilities to Apache Spark workloads that can increase performance and reduce costs. Which of the following are features of Azure Databricks? Select all that apply.
- Indexing
- Caching
- Auto-scaling and auto-termination
- Parallel Cluster Drivers
- High-speed connectors to Azure storage services
Question 2)
Apache Spark supports which of the following languages? Select all that apply.
- Scala
- Java
- Python
- ORC
Question 3)
Which of the following statements are True Select all that apply.
- Once created a notebook can only be connected to the original cluster.
- To use your Azure Databricks notebook to run code, you must attach it to a cluster
- To use your Azure Databricks notebook to run code you do not require a cluster
- You can detach a notebook from a cluster and attach it to another cluster.
Question 4)
Which of the following Databricks features are not Open-Source Spark?
- Databricks Workflows
- Databricks Workspace
- Databricks Runtime
- MLFlow
Question 5)
How many drivers does a Cluster have?
- Configurable between one and eight
- Only one
- Two, running in parallel
Question 6)
What type of process are the driver and the executors?
- Java processes
- C++ processes
- Python processes
Question 7)
You work with Big Data as a data engineer, and you must process real-time data. This is referred to as having which of the following characteristics?
- High velocity
- Variety
- High volume
Question 8)
Spark’s performance is based on parallelism. Which of the following Scalability methods is limited to a finite amount of RAM, Threads and CPU speeds?
- Vertical Scaling
- Diagonal Scaling
- Horizontal Scaling
Question 9)
Spark Cluster use two levels of parallelization. Which of the following are levels of parallelization?
- Job
- Partition
- Executor
- Slot
Question 10)
In an Apache Spark Cluster jobs are divided into which of the following?
- Drivers
- Executors
- Slots
- Tasks
Test prep Quiz Answers | Week 2
Question 1)
How do you list files in DBFS within a notebook?
- %fs dir /my-file-path
- ls /my-file-path
- %fs ls /my-file-path
Question 2)
How do you infer the data types and column names when you read a JSON file?
- spark.read.inferSchema(“true”).json(jsonFile)
- spark.read.option(“inferData”, “true”).json(jsonFile)
- spark.read.option(“inferSchema”, “true”).json(jsonFile)
Question 3)
Which of the following SparkSession functions returns a DataFrameReader?
- read(..)
- emptyDataFrame(..)
- readStream(..)
- createDataFrame(..)
Question 4)
When using a notebook and a spark session. We can read a CSV file. Which of the following can be used to view the first couple thousand characters of a file?
- %fs ls /mnt/training/wikipedia/pageviews/
- %fs dir /mnt/training/wikipedia/pageviews/
- %fs head /mnt/training/wikipedia/pageviews/pageviews_by_second.tsv
Question 5)
You have created an Azure Databricks cluster, and you have access to a source file.
fileName = “dbfs:/mnt/training/wikipedia/clickstream/2015_02_clickstream.tsv”
You need to determine the structure of the file. Which of the following commands will assist with determining what the column and data types are?
- .option(“inferSchema”, “false”)
- .option(“header”, “false”)
- .option(“inferSchema”, “true”)
- .option(“header”, “true”)
Question 6)
In an Azure Databricks workspace you run the following command:
%fs head /mnt/training/wikipedia/pageviews/pageviews_by_second.tsv
The partial output from this command is as follows:
[Truncated to first 65536 bytes]
“timestamp” “site” “requests”
“2015-03-16T00:09:55” “mobile” 1595
“2015-03-16T00:10:39” “mobile” 1544
“2015-03-16T00:19:39” “desktop” 2460
“2015-03-16T00:38:11” “desktop” 2237
“2015-03-16T00:42:40” “mobile” 1656
“2015-03-16T00:52:24” “desktop” 2452
Which of the following pieces of information can be inferred from the command and the output?
Select all that apply.
- The column is Tab separated
- The file has no header
- All columns are strings
- The file has a header
- Two columns are strings, and one column is a number
- the file is a comma separated or CSV file
Question 7)
In an Azure Databricks you wish to create a temporary view that will be accessible to multiple notebooks. Which of the following commands will provide this feature?
- createOrReplaceTempView(set_scope “Global”)
- createOrReplaceGlobalTempView(..)
- createOrReplaceTempView(..)
Question 8)
Which of the following is true in respect of Parquet Files? Select all that apply.
- Efficient data compression
- Is a Row-Oriented data store
- Is a Column-Oriented data store
- Designed for performance on small data sets
- Open Source
- Is a splittable “file format”.
Test prep Quiz Answers | Week 3
Question 1)
Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for DDL modifications. This functionality is referred to as?
- ACID Transactions
- Time Travel
- Schema Evolution
- Schema Enforcement
Question 2)
One of the core features of Delta Lake is performing upserts. Which of the following statements is true regarding Upsert?
- Upsert is supported in traditional data lakes
- Upsert is literally TWO operations. Update / Insert
- Upsert is a new DML statement for SQL syntax
Question 3)
What is the Databricks Delta command to display metadata?
- MSCK DETAIL tablename
- SHOW SCHEMA tablename
- DESCRIBE DETAIL tableName
Question 4)
What optimization does the following command perform: OPTIMIZE Customers ZORDER BY City?
- Ensures that all data backing, for example, City=’London’ is colocated, then rewrites the sorted data into new Parquet files.
- Creates an order-based index on the City field to improve filters against that field
- Ensures that all data backing, for example, City=”London” is colocated, then updates a graph that routes requests to the appropriate files.
Question 5)
You are planning on registering a user-defined function, g, as g_function in a SQL namespace. How would you achieve this programmatically?
- spark.udf.register(g, “g_function”)
- spark.udf.register(“g_function”, g)
- spark.register_udf(“g_function”, g)
Question 6)
User-defined Functions cannot operate on DataFrames.
- Yes
- No
Question 7)
Suppose you already have a dataframe which only contains relevant columns.
The columns are: id, employee_name, age, gender.
You want to retrieve the first initial from the employee_name field by creating a local function in Python/Scala. Which of the following code can be used to get the first initial from the host_name column?
- def firstInitialFunction(name):
return name[0]
firstInitialFunction(“Steven”)
Test prep Quiz Answers | Week 4
Question 1)
How are qualitative variables also known as? Select all that apply.
- Numerical
- Discrete
- Continuous
- Categorical
Question 2)
Which type of supervised learning problem tends to output quantitative values?
- Regression
- Clustering
- Classification
Question 3)
In the process of explanatory data analysis, when we want to calculate the number of observations in the data set, which of the following will tell us if there are missing values in the dataset?
- Standard deviation
- Count
- Mean
Question 4)
In terms of correlations, what does a negative correlation of -1 means?
- There is no association between the variables.
- For each unit increase in one variable, the same increase is seen in the other..
- For each unit increase in one variable, the same decrease is seen in the other
Question 5)
Regarding visualization tools, which of the following can help you visualize quantiles and outliers?
- t-SNE
- Heat maps
- Box plots
- Q-Q plots
Question 6)
You have an AirBnB dataset where one categorical variable is room type.
There are three types of rooms: private room, entire home/apt, and shared room.
You must first encode each unique string into a number so that the machine learning model knows how to handle these room types.
How should you code that?
- from pyspark.ml.feature import StringIndexer
uniqueTypesDF = airbnbDF.select(“room_type”).distinct()
indexer = StringIndexer(inputCol=”room_type”, outputCol=”room_type_index”)
indexerModel = indexer.fit(uniqueTypesDF)
indexedDF = indexerModel.transform(uniqueTypesDF)
display(indexedDF)
Question 7)
You have an AirBnB dataset where one categorical variable is room type.
There are three types of rooms: private room, entire home/apt, and shared room.
After you’ve encoded each unique string into a number, each room has a unique numerical value assigned.
Now you must one-hot encode each of those values to a location in an array, so that the machine learning algorithm can effect each category.
How should you code that?
- from pyspark.ml.feature import OneHotEncoder
encoder = OneHotEncoder(inputCols=[“room_type_index”], outputCols=[“encoded_room_type”])
encoderModel = encoder.fit(indexedDF)
encodedDF = encoderModel.transform(indexedDF)
display(encodedDF)
Test prep Quiz Answers | Week 5
Question 1)
You can query previous runs programmatically by using the MlflowClient object as the pathway.
How would you code that in Python?
- from mlflow.tracking import MlflowClient
client = MlflowClient()
client.list_experiments()
Question 2)
You can also use the search_runs method to find all runs for a given experiment.
How would you code that in Python?
- experiment_id = run.info.experiment_id
runs_df = mlflow.search_runs(experiment_id)
display(runs_df)
Question 3)
You need to retrieve the last run from the list of experiments.
How would you code that in Python?
- runs = client.search_runs(experiment_id, order_by=[“attributes.start_time desc”], max_results=1)
runs[0].data.metrics
Question 4)
Knowing that each algorithm has different hyperparameter available for tuning, which method can you use to explore the hyperparameters on a model?
- showParams()
- explainParams()
- exploreParams()
- getParams()
Question 5)
Which method from the PySpark class can you use to string together all the different possible hyperparameters you want to test?
- ParamGridBuilder()
- ParamGridSearch()
- ParamBuilder()
- ParamSearch()
Question 6)
Which of the following belong to the exhaustive type of cross-validation techniques?
- K-fold cross-validation
- Holdout cross-validation
- Leave-one-out cross-validation
- Leave-p-out cross-validation
Question 7)
In which of the following non-exhaustive cross validation techniques do you randomly assign data points to the training set and the test set?
- Holdout cross-validation
- Repeated random sub-sampling validation
- K-fold cross-validation
Test prep Quiz Answers | Week 6
Question 1)
When developing a distributed training program using HorovodRunner you would generally follow these steps:
- Create a HorovodRunner instance initialized with the number of nodes.
- Define a Horovod training method according to the methods described in Horovod usage, making sure to add any import statements inside the method.
- Pass the training method to the HorovodRunner instance.
How would you code that in Python?
- hr = HorovodRunner(np=2)
def train():
import tensorflow as tf
hvd.init()
hr.run(train)
Question 2)
You’re using Horovod to train a distributed neural network using Parquet files and Petastorm.
You have a dataset of housing prices in California named cal_housing.
After loading the data, you want to create a Spark DataFrame from the Pandas DataFrame so that you can concatenate the features and labels of the model.
How would you code that in Python?
- data = pd.concat([pd.DataFrame(X_train, columns=cal_housing.feature_names), pd.DataFrame(y_train, columns=[“label”])], axis=1)
trainDF = spark.createDataFrame(data)
display(trainDF)
Question 3)
You’re using Horovod to train a distributed neural network using Parquet files and Petastorm.
You have a dataset of housing prices in California named cal_housing.
After loading the data, you created a Spark DataFrame from the Pandas DataFrame so that you can concatenate the features and labels of the model.
Now you need to create Dense Vectors for the features.
How would you code that in Python?
- from pyspark.ml.feature import VectorAssembler
vecAssembler = VectorAssembler(inputCols=cal_housing.feature_names, outputCol=”features”)
vecTrainDF = vecAssembler.transform(trainDF).select(“features”, “label”)
display(vecTrainDF)
Question 4)
True or false?
Petastorm requires a Vector as an input, not an Array.
- True
- False
Question 5)
You’re working with Azure Machine Learning and you want to train a Diabetes Model and build a container image for the trained model.
- You will use the scikit-learn ElasticNet linear regression model.
- You want to deploy the model to production using Azure Kubernetes Service (AKS).
- You don’t have an active AKS cluster, so you need to create one using the Azure ML SDK.
- You’ll be using the default configuration.
How would you code that?
- aks_target = ComputeTarget.create(workspace = workspace,
- name = aks_cluster_name,
- provisioning_configuration = prov_config)
Question 6)
You’re working with Azure Machine Learning and you want to train a Diabetes Model and build a container image for the trained model.
You will use the scikit-learn ElasticNet linear regression model.
You want to deploy the model to production using Azure Kubernetes Service (AKS).
You’ve created a AKS cluster for model deployment.
You’ve deployed the model’s image to the specified AKS cluster.
After you’ve trained a new model with different hyperparameters, you need to deploy the new model’s image to the AKS cluster.
How would you code that?
- prod_webservice.update(image=model_image_updated)
prod_webservice.wait_for_deployment(show_output = True)
Question 7)
After working with Azure Machine Learning, you want to clean up the deployments and terminate the “dev” ACI webservice using the Azure ML SDK.
Which method should do the job?
- dev_webservice.delete()
- dev_webservice.flush()
- dev_webservice.remove()
- dev_webservice.terminate()
You might also like: Build and Operate Machine Learning Solutions with Azure Quiz Answers + Review
Review
I recently finished the “Perform Data Science with Azure Databricks” course on Coursera, and it’s a strong deep dive into scalable data science using Spark on the Azure Databricks platform. With six focused modules, the course walks you through setting up clusters, preparing and modeling data, and running distributed ML workloads—all while reinforcing key concepts for the DP-100 exam.
What I found especially helpful was the hands-on approach to managing and processing data at scale. Whether you’re building models or exploring large datasets, this course shows how to integrate Databricks seamlessly with Azure Machine Learning.
If you’ve already got Python and ML experience, this course will push your skills into real enterprise-ready territory. It’s a valuable step for anyone preparing for the certification or working on high-performance data science projects in the cloud.