File Systems in Databricks
Contents
A summary of different file systems in Databricks.
DBFS
DBFS in databricks is a distributed file system, which maps the cloud storage to a file system for ease of use.
There are two different styles of notation when we work with file paths in DBFS.
The first is the spark API format: dbfs:/some/path/.
The second is the file API format: /dbfs/some/path/
spark API format
We use the spark API format when:
- we are using the spark operations:
spark.write.format('csv').save('dbfs:/path/to/my-csv')spark.read.load('dbfs:/path/to/my-csv')
- we are using the
dbutils.fsinterface:dbutils.fs.ls('dbfs:/some/path')- we can also use the
%fsmagic for DBFS operations:%fs ls dbfs:/path/to/file
- we run spark sql statement:
SELECT * FROM JSON.`dbfs:/FileStore/shared_uploads/my-json-table`
ref:
- select from a file: https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-file.html
file API format
We use the file API format when:
- we use shell command:
%sh ls -l /dbfs/path/to/file - when use python code that is not related to spark:
f = open('/dbfs/path/to/file') - we use other packages that does not support spark.
Local driver file system
When we want to specify file path in local driver node, for bash and Python code, we just use the path as is:
- shell:
%sh ls /usr/include/zlib.h - python:
f = open('/usr/include/zlib.h')
However, if want to access the local driver file system using the dbutils.fs interface,
we need to use this notation: file:/path/to/file:
dbutils.fs.ls('file:/usr/include/zlib.h')%fs ls file:/usr/include/zlib.h
move file from local file system to DBFS
We can also move a file from local file system to DBFS:
dbutils.fs.mv("file:/tmp/some_file.csv", "dbfs:/tmp/my_file.csv")
ref:
- https://learn.microsoft.com/en-us/azure/databricks/files/download-internet-files#moving-data-with-dbutils
- access file in local driver file system: https://learn.microsoft.com/en-us/azure/databricks/files/#--access-files-on-the-driver-filesystem
Workspace files
To read a file under workspace folder in spark, we need to use the full path with file: notation.
For example:
df = spark.read.csv('file:/some/csv/file/path/under/workspace', header=True)
ref:
references
- DBFS: https://learn.microsoft.com/en-gb/azure/databricks/dbfs/
- DBFS default locations: https://learn.microsoft.com/en-gb/azure/databricks/dbfs/root-locations
- DBFS vs files on local driver: https://www.youtube.com/watch?v=ThjvQt3QBvI
- how to specify DBFS path: https://kb.databricks.com/dbfs/how-to-specify-dbfs-path