The Databricks cluster is using UTC as the default timezone. So when you run some time-related code, the displayed time is not the local time, which is not ideal. In this post, I want to share how to change the timezone setting for Databricks cluster.
Change the system timezone#
In the notebook cell, run the following to check what Linux system Databricks is using:
%sh
lsb_release -a
It shows the underlying system is Ubuntu.
On Ubuntu, we can use timedatectl
command line tool to change the timezone.
List timezones:
%sh
timedatectl list-timezones
Pick a timezone we want:
%sh
timedatectl set-timezone Asia/Shanghai
Note the above command does not impact the timezone setting of spark, since spark has already been started.
ref:
- change timezone Ubuntu: https://askubuntu.com/a/594186/768311
Change timezone for databricks spark#
For the current session#
If you only want to set the timezone for current spark session, just run the following statement:
spark.conf.set('spark.sql.session.timeZone', 'Asia/Shanghai')
The explanation for spark.sql.session.timeZone
:
The ID of session local timezone in the format of either region-based zone IDs or zone offsets. Region IDs must have the form ‘area/city’, such as ‘America/Los_Angeles’. Zone offsets must be in the format ‘(+|-)HH’, ‘(+|-)HH: mm’ or ‘(+|-)HH:mm:ss’, e.g ‘-08’, ‘+01:00’ or ‘-13:33:33’. Also ‘UTC’ and ‘Z’ are supported as aliases of ‘+00:00’. Other short names are not recommended to use because they can be ambiguous.
Then you can run display(spark.sql("select current_timezone()"))
in Databricks notebook cell to verify the change.
However, if you create a new notebook using the same cluster, the timezone setting does not persist.
ref:
- spark configuration: https://spark.apache.org/docs/latest/configuration.html#runtime-sql-configuration
- set default timezone databricks: https://stackoverflow.com/a/68268838/6064933
- use different timezone for spark session: https://stackoverflow.com/a/68957356/6064933
In the cluster config#
In the Advanced options
section of cluster setting page, under Spark
tab, you can add the following config:
spark.sql.session.timeZone Asia/Shanghai
This will make sure every notebooks that attached to this cluster has the correct timezone setup.
databricks cluster init script#
You can also set up the timezone in init script for the cluster. Just add someting like the following to the cluster init script:
timedatectl set-timezone Asia/Shanghai
Precedence of different settings#
The precedence of setting up timezone of the above three methods:
setting in notebook session > cluster config > init script
ref:
- what are init script: https://learn.microsoft.com/en-us/azure/databricks/init-scripts/
- cluster-scope init script: https://learn.microsoft.com/en-us/azure/databricks/init-scripts/cluster-scoped