Some observation and finding in working with Databricks workspace files.
How to read/access workspace files#
For regular Python#
The behavior to access the workspace file is also different based on the databricks runtime (abbreviation, DBR) version. For the following code:
with open('/Workspace/Users/<user-email>/path/to/file') as f:
content = f.readlines()
print(content)
In DBR 10.4, I get the following error:
FileNotFoundError: [Errno 2] No such file or directory: ‘/Workspace/Users/
/path/to/file’
Since DBR 11.3, we can access the files under the databricks workspace using their absolute paths (source here). So the above code should work as expected to print the file content. However, this does not apply to the notebooks under the workspace (source here). I think this is fine, because most people don’t have such needs to read notebooks directly.
Since DBR 14.0, as discussed later, the current working directory is changed to the folder where the notebook is run.
So you can additionally use relative path to access workspace files.
For example, if there is test.py
in the folder as the notebook, you can run the following code without error:
with open('./test.py', 'r') as f:
content = f.readlines()
print(content)
For spark code#
For spark code, it is also possible to access the workspace files. However, there are two requirements:
- you must use the fully-qualified path for the workspace files, e.g., the path should be something like
file:/Workspace/Users/<user-name>/<folder-name>/MOCK_DATA.csv
- the cluster can’t be in shared access mode, otherwise, you will see the following error when trying to access the workspace files:
java.lang.SecurityException: Cannot use com.databricks.backend.daemon.driver.WorkspaceLocalFileSystem - local filesystem access is forbidden
If both condition is satisfied, you should be able to run the following code without error:
df = spark.read.csv("file:/Workspace/Users/<user-name>/<folder-name>/MOCK_DATA.csv", header=True)
display(df)
comparison#
yeah, databricks just makes things f*king
complicated.
I am scratching my hair out trying to figuring out these complicated rules and cases.
Here is a comparison table (hopefully it makes it easier to understand):
DBR versions | open() with absolute path | open() with relative path | spark.read with absolute path | spark.read with relative path |
---|---|---|---|---|
DBR 11.3 single user | ✅ | not supported, cwd is not workspace folder | ✅ | ❌, path must be absolute |
DBR 11.3 shared | ✅ | not supported, cwd is not workspace folder | ❌ | ❌, path must be absolute |
DBR 14.1 single user | ✅ | ✅ | ✅ | ❌, path must be absolute |
DBR 14.1 shared | ✅ | ✅ | ❌ | ❌, path must be absolute |
ref:
- how to access databricks workspace files: https://stackoverflow.com/q/77498069/6064933
- read workspace files: https://learn.microsoft.com/en-us/azure/databricks/files/workspace-interact#read-data-workspace-files
Current working directory#
In the old DBR, when you run the Python code, the current working directory is /databricks/driver
.
To check the DBR version and your current working directory, use this:
import os
print(spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion"))
print(os.path.abspath('./'))
In my DBR 10.4 (single user
access mode, the directory is different if you use shared
access mode),
I see the following output:
10.4.x-scala2.12
/databricks/driver
Starting in databricks 14.0, the current working directory is changed to the directory where the notebook runs (source here). In DBR 14.1 cluster, I see the following output:
14.1.x-scala2.12
/Workspace/Users/<user-email>/<current-folder-name>
You can use relative path to write and read file, but their location is different in different DBR. For example, for the following code:
with open('./demo.txt', 'w') as f:
f.write("hello world\n")
If you use 10.4, the file is saved in /databricks/driver/demo.txt
, under the driver node.
If you use 11.3, the file is saved in /home/spark-<some-random-string>/demo.txt
If you use 14.1, the file is saved in /Workspace/<user-email>/<current-folder>/demo.txt
.
ref:
- default current working directory in dbr 14.0: https://learn.microsoft.com/en-us/azure/databricks/files/cwd-dbr-14