Azure Databricks repository is a set of blogposts as a Advent of 2020 present to readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
- Dec 01: What is Azure Databricks
- Dec 02: How to get started with Azure Databricks
- Dec 03: Getting to know the workspace and Azure Databricks platform
- Dec 04: Creating your first Azure Databricks cluster
- Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and jobs
- Dec 06: Importing and storing data to Azure Databricks
- Dec 07: Starting with Databricks notebooks and loading data to DBFS
- Dec 08: Using Databricks CLI and DBFS CLI for file upload
- Dec 09: Connect to Azure Blob storage using Notebooks in Azure Databricks
- Dec 10: Using Azure Databricks Notebooks with SQL for Data engineering tasks
- Dec 11: Using Azure Databricks Notebooks with R Language for data analytics
- Dec 12: Using Azure Databricks Notebooks with Python Language for data analytics
- Dec 13: Using Python Databricks Koalas with Azure Databricks
- Dec 14: From configuration to execution of Databricks jobs
- Dec 15: Databricks Spark UI, Event Logs, Driver logs and Metrics
- Dec 16: Databricks experiments, models and MLFlow
- Dec 17: End-to-End Machine learning project in Azure Databricks
- Dec 18: Using Azure Data Factory with Azure Databricks
- Dec 19: Using Azure Data Factory with Azure Databricks for merging CSV files
- Dec 20: Orchestrating multiple notebooks with Azure Databricks
- Dec 21: Using Scala with Spark Core API in Azure Databricks
- Dec 22: Using Spark SQL and DataFrames in Azure Databricks
- Dec 23: Using Spark Streaming in Azure Databricks
- Dec 24: Using Spark MLlib for Machine Learning in Azure Databricks
- Dec 25: Using Spark GraphFrames in Azure Databricks
- Dec 26: Connecting Azure Machine Learning Services Workspace and Azure Databricks
- Dec 27: Connecting Azure Databricks with on premise environment
- Dec 28: Infrastructure as Code and how to automate, script and deploy Azure Databricks with Powershell
- Dec 29: Performance tuning for Apache Spark
Yesterday we looked into performance tuning for improving day to day usage of Spark and Azure Databricks. And today we will look explore monitoring (as we have started on Day 15) and troubleshooting for most common mistakes or error a user in Azure Databricks will encounter.
Spark in Databricks is relatively taken care of and can be monitored from Spark UI. Since Databricks is a encapsulated platform, in a way Azure is managing many of the components for you, from Network, to JVM (Java Virtual Machine), hosting operating system and many of the cluster components, Mesos, YARN and any other spark cluster application.
As we have seen on the Day 15 post, you can monitor Query, tasks, jobs, Spark Logs and Spark UI in Azure Databricks. Spark Logs will help you pinpoint the problem that you are encountering. It is also good for creating a history logs to understand the behaviour of the job or the task over time and for possible future troubleshooting.
Spark UI is a good visual way to monitor what is happening to your cluster and offers a great value of metrics for troubleshooting.
It also gives you detailed information on Spark Tasks and great visual presentation of the task run, SQL run and detailed run of all the stages.
All of the tasks can be visualized also as DAG:
Approaching Spark debugging, let me give you some causes or views and symptoms of problems in your Spark Jobs and Spark engine itself. There are many issues one can account, I will try to tackle couple of those that can be found as a return message in Databricks notebooks or in Spark UI in general.
This issue can appear frequently, especially if you are beginner but can also happen when there is Spark running standalone (not in Azure Databricks).
Sign and symptoms:
- Spark job don't start
- Spark UI does not show any nodes on cluster (except the driver)
- Spark UI is reporting vague information
Potential Solution:
- Cluster is not started or is starting up,
- This often happens with poorly configured cluster (usually when running Spark applications and (almost) never with Azure Databricks), either IP or network or VNet,
- It can be a memory configuration issue and should be reconfigured in the start up scripts.
During work in notebooks on a cluster that is already running, it can happen that some part of the code or the Spark Job, that was previously running ok, started to fail.
Sign and symptoms:
- a job on a Cluster runs successfully over all clusters, but on one fails
- code blocks in notebook runs normally in sequences, but one run fails
- HiveSQL table or R/Python Dataframe, that used to to be created normally, can not be created
Potential Solution:
- check if your data still exists on the expected location or if the data is still in the same file format
- if you are running a SQL query, check if the query is valid and all the column names are correct
- try to go through stack trace and try to figure out which component is failing
When running notebook commands or using Spark Apps (widgets, etc), you can get a message that cluster. This is a severe error and should be
Sign and symptoms:
- code block is not executed and fails with loads of JVM responses
- you get a error message, that cluster is unresponsive
- Spark job is running, with no return or error message.
Potential Solution:
- restart the cluster and attach the notebook to a cluster
- check the dataset for any inconsistencies, data size (limitations of the file uploaded or distribution of the files over DBFS),
- check the compatibility of the installed libraries and spark version on your cluster.
- change the cluster setting from standard, GPU, ML to LTS. Long-Term Support Spark installation tend to have greater span of compatibility.
- if you are using high-concurrency cluster, check who and what they are doing, it there is a potential "dead-lock" in some tasks that consume too many resources.
Loading data is probably the most important task in Azure Databricks. And there can be many ways, that data can not be presented in the notebook.
Sign and symptoms:
- data is stored in blob storage and can not be accessed or loaded to Databricks
- data is taking too long to load, and I stop the load process
- data should be at the location, but it is not
Potential Solution:
- if you are reading the data from Azure blob storage, check that Azure Databricks have all the needed credentials for access
- loading data files that are wide (have 1000+ columns) might cause some problems with Spark. Load schema first and create a DataFrame with Scala and later insert the data into the frame
- check if the persistent data (DBFS) is on the correct location and in expected data format. It can also happen that different sample files are used, which in this case might be missing from standard DBFS path
- DataFrame or Dataset was created in different language that the one, you are trying to read it from. Languages sit on top of Structured API and should be interchangeable, so you should check your code for some inconsistencies.
Sign and symptoms:
- unexpected Null values in Spark transformations
- scheduled jobs that use to work no longer work, or no longer produce the correct result
Potential Solution:
- it can be the cause of underlying data that has had format changed,
- use accumulator to run and try to count the number of rows (or records or observations) or try to parse or process the error where a row (record/observation) is missing,
- check and ensure that transformations in data return a valid SQL query plan; check for some implicit data type conversions (a "15" is a string and not a number, respectively) and Spark can return a strange result or no result.
This is somehow common problem and also hardest to tackle. Usually happen because of unevenly distributed workload across cluster or because of hardware failure (one disk / VM is unresponsive).
Sign and symptoms:
- slow task by .groupBy() call
- after data aggregation, jobs are still slow
Potential Solution:
- try changing the partitioning of the data, to have less data per partition.
- try changing the partition key on your dataset
- check that your SELECT statement is gaining performance from the partitions
- if you are using RDD, try to create a DataFrame or Dataset to get the aggregations done faster
Tomorrow we will finish with series with looking into sources, documentations and next learning steps and it should be a nice way to wrap up the series.
Complete set of code and the Notebook is available at the Github repository.
Happy Coding and Stay Healthy!