Some observation and finding in working with Databricks workspace files.

How to read/access workspace files

For regular Python

The behavior to access the workspace file is also different based on the databricks runtime (abbreviation, DBR) version. For the following code:

with open('/Workspace/Users/<user-email>/path/to/file') as f:
    content = f.readlines()

print(content)

In DBR 10.4, I get the following error:

FileNotFoundError: [Errno 2] No such file or directory: ‘/Workspace/Users//path/to/file’

Since DBR 11.3, we can access the files under the databricks workspace using their absolute paths (source here). So the above code should work as expected to print the file content. However, this does not apply to the notebooks under the workspace (source here). I think this is fine, because most people don’t have such needs to read notebooks directly.

Since DBR 14.0, as discussed later, the current working directory is changed to the folder where the notebook is run. So you can additionally use relative path to access workspace files. For example, if there is test.py in the folder as the notebook, you can run the following code without error:

with open('./test.py', 'r') as f:
    content = f.readlines()
print(content)

For spark code

For spark code, it is also possible to access the workspace files. However, there are two requirements:

  • you must use the fully-qualified path for the workspace files, e.g., the path should be something like file:/Workspace/Users/<user-name>/<folder-name>/MOCK_DATA.csv
  • the cluster can’t be in shared access mode, otherwise, you will see the following error when trying to access the workspace files:

    java.lang.SecurityException: Cannot use com.databricks.backend.daemon.driver.WorkspaceLocalFileSystem - local filesystem access is forbidden

If both condition is satisfied, you should be able to run the following code without error:

df = spark.read.csv("file:/Workspace/Users/<user-name>/<folder-name>/MOCK_DATA.csv", header=True)
display(df)

comparison

yeah, databricks just makes things f*king complicated. I am scratching my hair out trying to figuring out these complicated rules and cases. Here is a comparison table (hopefully it makes it easier to understand):

DBR versionsopen() with absolute pathopen() with relative pathspark.read with absolute pathspark.read with relative path
DBR 11.3 single usernot supported, cwd is not workspace folder❌, path must be absolute
DBR 11.3 sharednot supported, cwd is not workspace folder❌, path must be absolute
DBR 14.1 single user❌, path must be absolute
DBR 14.1 shared❌, path must be absolute

ref:

Current working directory

In the old DBR, when you run the Python code, the current working directory is /databricks/driver. To check the DBR version and your current working directory, use this:

import os

print(spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion"))
print(os.path.abspath('./'))

In my DBR 10.4 (single user access mode, the directory is different if you use shared access mode), I see the following output:

10.4.x-scala2.12
/databricks/driver

Starting in databricks 14.0, the current working directory is changed to the directory where the notebook runs (source here). In DBR 14.1 cluster, I see the following output:

14.1.x-scala2.12
/Workspace/Users/<user-email>/<current-folder-name>

You can use relative path to write and read file, but their location is different in different DBR. For example, for the following code:

with open('./demo.txt', 'w') as f:
    f.write("hello world\n")

If you use 10.4, the file is saved in /databricks/driver/demo.txt, under the driver node. If you use 11.3, the file is saved in /home/spark-<some-random-string>/demo.txt If you use 14.1, the file is saved in /Workspace/<user-email>/<current-folder>/demo.txt.

ref: