Skip to main content
  1. Posts/

Databricks Init Scripts

·330 words·2 mins·
Note Databricks
Table of Contents

General
#

Init script is just a shell script, which will be run for each node in the cluster, before Apache Spark driver or executor JVM starts.

A cluster can have multiple init script if you want. These init scripts will be executed in the order provided.

Cluster scope init script
#

If your cluster is not in Edit mode, you can not see the button to add init script. You need to click Edit in the cluster configuration page, then you can add init script to your cluster settings.

Note that if you use a workspace folder to store the init script, you do not need to specify the top level /Workspace in the script path. For example, if you init script path is /Workspace/Shared/cluster-conf.sh, in the cluster init setup, if you choose Workspace, the path you need to fill in is /Shared/cluster-conf.sh.

You can also store init script in DBFS path and ABFSS path, but DBFS path is being deprecated by Databricks. Here is how you can create and update the content of the init script if you use DBFS:

dbutils.fs.mkdir("dbfs:/databricks/cluster-init/")

dbutils.fs.put("dbfs:/databricks/cluster-init/cluster-init.sh",
"""
#!/bin/bash
timedatectl set-timezone Asia/Shanghai
""",
overwrite=True)

For ABFSS path, the setup is the same.

Environment variables
#

Databricks exposes some environment variables for the init script. For the details, refer to documentation here.

Global init script
#

It is not recommended by Databricks official to use global init script.

Init script logs
#

There will possibly be some errors when running the init scripts. The logging path for init scripts can be configured in the advanced configuration section for the cluster. Go to edit mode of cluster, under Advanced options for cluster, in the logging tab, we configure the log path. For example, the default options is dbfs:/cluster-logs.

The init script log path will be the following format:

dbfs:/cluster-logs/<cluster-id>/init_scripts/<cluster-id>_<node-ip>

Under the above path, there will log for stdout and stderr. You can check them using the dbutils.fs utilities like this:

%fs
head dbfs:/cluster-logs/<cluster-id>/init_scripts/<cluster-id>_<node-ip>/<time-stamp>_<init-script-name>.stderr.log

ref:

Related

How to Download Files from Google Cloud Storage in the Databricks Workspace Notebook
··551 words·3 mins
Note Databricks GCP Ubuntu
Databricks Cli Usage
·141 words·1 min
Note Databricks
Working with Databricks Workspace Files
··568 words·3 mins
Note Databricks