Change Timezone in Databricks Spark
Contents
The Databricks cluster is using UTC as the default timezone. So when you run some time-related code, the displayed time is not the local time, which is not ideal. In this post, I want to share how to change the timezone setting for Databricks cluster.
Change the system timezone
In the notebook cell, run the following to check what Linux system Databricks is using:
%sh
lsb_release -a
It shows the underlying system is Ubuntu.
On Ubuntu, we can use timedatectl
command line tool to change the timezone.
List timezones:
%sh
timedatectl list-timezones
Pick a timezone we want:
%sh
timedatectl set-timezone Asia/Shanghai
Note the above command does not impact the timezone setting of spark, since spark has already been started.
ref:
- change timezone Ubuntu: https://askubuntu.com/a/594186/768311
Change timezone for databricks spark
For the current session
If you only want to set the timezone for current spark session, just run the following statement:
spark.conf.set('spark.sql.session.timeZone', 'Asia/Shanghai')
The explanation for spark.sql.session.timeZone
:
The ID of session local timezone in the format of either region-based zone IDs or zone offsets. Region IDs must have the form ‘area/city’, such as ‘America/Los_Angeles’. Zone offsets must be in the format ‘(+|-)HH’, ‘(+|-)HH: mm’ or ‘(+|-)HH:mm:ss’, e.g ‘-08’, ‘+01:00’ or ‘-13:33:33’. Also ‘UTC’ and ‘Z’ are supported as aliases of ‘+00:00’. Other short names are not recommended to use because they can be ambiguous.
Then you can run display(spark.sql("select current_timezone()"))
in Databricks notebook cell to verify the change.
However, if you create a new notebook using the same cluster, the timezone setting does not persist.
ref:
- spark configuration: https://spark.apache.org/docs/latest/configuration.html#runtime-sql-configuration
- set default timezone databricks: https://stackoverflow.com/a/68268838/6064933
- use different timezone for spark session: https://stackoverflow.com/a/68957356/6064933
In the cluster config
In the Advanced options
section of cluster setting page, under Spark
tab, you can add the following config:
spark.sql.session.timeZone Asia/Shanghai
This will make sure every notebooks that attached to this cluster has the correct timezone setup.
databricks cluster init script
You can also set up the timezone in init script for the cluster. Just add someting like the following to the cluster init script:
timedatectl set-timezone Asia/Shanghai
Precedence of different settings
The precedence of setting up timezone of the above three methods:
setting in notebook session > cluster config > init script
ref:
- what are init script: https://learn.microsoft.com/en-us/azure/databricks/init-scripts/
- cluster-scope init script: https://learn.microsoft.com/en-us/azure/databricks/init-scripts/cluster-scoped