Working with Databricks Workspace Files

Table of Contents

Some observation and finding in working with Databricks workspace files.

How to read/access workspace files
#

For regular Python
#

The behavior to access the workspace file is also different based on the databricks runtime (abbreviation, DBR) version. For the following code:

with open('/Workspace/Users/<user-email>/path/to/file') as f:
    content = f.readlines()

print(content)

In DBR 10.4, I get the following error:

FileNotFoundError: [Errno 2] No such file or directory: ‘/Workspace/Users//path/to/file’

Since DBR 11.3, we can access the files under the databricks workspace using their absolute paths (source here). So the above code should work as expected to print the file content. However, this does not apply to the notebooks under the workspace (source here). I think this is fine, because most people don’t have such needs to read notebooks directly.

Since DBR 14.0, as discussed later, the current working directory is changed to the folder where the notebook is run. So you can additionally use relative path to access workspace files. For example, if there is test.py in the folder as the notebook, you can run the following code without error:

with open('./test.py', 'r') as f:
    content = f.readlines()
print(content)

For spark code
#

For spark code, it is also possible to access the workspace files. However, there are two requirements:

you must use the fully-qualified path for the workspace files, e.g., the path should be something like file:/Workspace/Users/<user-name>/<folder-name>/MOCK_DATA.csv
the cluster can’t be in shared access mode, otherwise, you will see the following error when trying to access the workspace files:
java.lang.SecurityException: Cannot use com.databricks.backend.daemon.driver.WorkspaceLocalFileSystem - local filesystem access is forbidden

If both condition is satisfied, you should be able to run the following code without error:

df = spark.read.csv("file:/Workspace/Users/<user-name>/<folder-name>/MOCK_DATA.csv", header=True)
display(df)

comparison
#

yeah, databricks just makes things f*king complicated. I am scratching my hair out trying to figuring out these complicated rules and cases. Here is a comparison table (hopefully it makes it easier to understand):

DBR versions	open() with absolute path	open() with relative path	spark.read with absolute path	spark.read with relative path
DBR 11.3 single user	✅	not supported, cwd is not workspace folder	✅	❌, path must be absolute
DBR 11.3 shared	✅	not supported, cwd is not workspace folder	❌	❌, path must be absolute
DBR 14.1 single user	✅	✅	✅	❌, path must be absolute
DBR 14.1 shared	✅	✅	❌	❌, path must be absolute

ref:

how to access databricks workspace files: https://stackoverflow.com/q/77498069/6064933
read workspace files: https://learn.microsoft.com/en-us/azure/databricks/files/workspace-interact#read-data-workspace-files

Current working directory
#

In the old DBR, when you run the Python code, the current working directory is /databricks/driver. To check the DBR version and your current working directory, use this:

import os

print(spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion"))
print(os.path.abspath('./'))

In my DBR 10.4 (single user access mode, the directory is different if you use shared access mode), I see the following output:

10.4.x-scala2.12
/databricks/driver

Starting in databricks 14.0, the current working directory is changed to the directory where the notebook runs (source here). In DBR 14.1 cluster, I see the following output:

14.1.x-scala2.12
/Workspace/Users/<user-email>/<current-folder-name>

You can use relative path to write and read file, but their location is different in different DBR. For example, for the following code:

with open('./demo.txt', 'w') as f:
    f.write("hello world\n")

If you use 10.4, the file is saved in /databricks/driver/demo.txt, under the driver node. If you use 11.3, the file is saved in /home/spark-<some-random-string>/demo.txt If you use 14.1, the file is saved in /Workspace/<user-email>/<current-folder>/demo.txt.

ref: