A summary of different file systems in Databricks.
DBFS#
DBFS in databricks is a distributed file system, which maps the cloud storage to a file system for ease of use.
There are two different styles of notation when we work with file paths in DBFS.
The first is the spark API format: dbfs:/some/path/
.
The second is the file API format: /dbfs/some/path/
spark API format#
We use the spark API format when:
- we are using the spark operations:
spark.write.format('csv').save('dbfs:/path/to/my-csv')
spark.read.load('dbfs:/path/to/my-csv')
- we are using the
dbutils.fs
interface:dbutils.fs.ls('dbfs:/some/path')
- we can also use the
%fs
magic for DBFS operations:%fs ls dbfs:/path/to/file
- we run spark sql statement:
SELECT * FROM JSON.`dbfs:/FileStore/shared_uploads/my-json-table`
ref:
- select from a file: https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-file.html
file API format#
We use the file API format when:
- we use shell command:
%sh ls -l /dbfs/path/to/file
- when use python code that is not related to spark:
f = open('/dbfs/path/to/file')
- we use other packages that does not support spark.
Local driver file system#
When we want to specify file path in local driver node, for bash and Python code, we just use the path as is:
- shell:
%sh ls /usr/include/zlib.h
- python:
f = open('/usr/include/zlib.h')
However, if want to access the local driver file system using the dbutils.fs
interface,
we need to use this notation: file:/path/to/file
:
dbutils.fs.ls('file:/usr/include/zlib.h')
%fs ls file:/usr/include/zlib.h
move file from local file system to DBFS#
We can also move a file from local file system to DBFS:
dbutils.fs.mv("file:/tmp/some_file.csv", "dbfs:/tmp/my_file.csv")
ref:
- https://learn.microsoft.com/en-us/azure/databricks/files/download-internet-files#moving-data-with-dbutils
- access file in local driver file system: https://learn.microsoft.com/en-us/azure/databricks/files/#--access-files-on-the-driver-filesystem
Workspace files#
To read a file under workspace folder in spark, we need to use the full path with file:
notation.
For example:
df = spark.read.csv('file:/some/csv/file/path/under/workspace', header=True)
ref:
references#
- DBFS: https://learn.microsoft.com/en-gb/azure/databricks/dbfs/
- DBFS default locations: https://learn.microsoft.com/en-gb/azure/databricks/dbfs/root-locations
- DBFS vs files on local driver: https://www.youtube.com/watch?v=ThjvQt3QBvI
- how to specify DBFS path: https://kb.databricks.com/dbfs/how-to-specify-dbfs-path