A summary of different file systems in Databricks.

DBFS
#

DBFS in databricks is a distributed file system, which maps the cloud storage to a file system for ease of use.

There are two different styles of notation when we work with file paths in DBFS. The first is the spark API format: dbfs:/some/path/. The second is the file API format: /dbfs/some/path/

spark API format
#

We use the spark API format when:

we are using the spark operations:
- spark.write.format('csv').save('dbfs:/path/to/my-csv')
- spark.read.load('dbfs:/path/to/my-csv')
we are using the dbutils.fs interface:
- dbutils.fs.ls('dbfs:/some/path')
- we can also use the %fs magic for DBFS operations: %fs ls dbfs:/path/to/file
we run spark sql statement: SELECT * FROM JSON.`dbfs:/FileStore/shared_uploads/my-json-table`

ref:

select from a file: https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-file.html

file API format
#

We use the file API format when:

we use shell command: %sh ls -l /dbfs/path/to/file
when use python code that is not related to spark: f = open('/dbfs/path/to/file')
we use other packages that does not support spark.

Local driver file system
#

When we want to specify file path in local driver node, for bash and Python code, we just use the path as is:

shell: %sh ls /usr/include/zlib.h
python: f = open('/usr/include/zlib.h')

However, if want to access the local driver file system using the dbutils.fs interface, we need to use this notation: file:/path/to/file:

dbutils.fs.ls('file:/usr/include/zlib.h')
%fs ls file:/usr/include/zlib.h

move file from local file system to DBFS
#

We can also move a file from local file system to DBFS:

dbutils.fs.mv("file:/tmp/some_file.csv", "dbfs:/tmp/my_file.csv")

ref:

https://learn.microsoft.com/en-us/azure/databricks/files/download-internet-files#moving-data-with-dbutils
access file in local driver file system: https://learn.microsoft.com/en-us/azure/databricks/files/#--access-files-on-the-driver-filesystem

Workspace files
#

To read a file under workspace folder in spark, we need to use the full path with file: notation. For example:

df = spark.read.csv('file:/some/csv/file/path/under/workspace', header=True)

ref:

workspace file: https://learn.microsoft.com/en-us/azure/databricks/files/workspace-interact

references
#

DBFS: https://learn.microsoft.com/en-gb/azure/databricks/dbfs/
DBFS default locations: https://learn.microsoft.com/en-gb/azure/databricks/dbfs/root-locations
DBFS vs files on local driver: https://www.youtube.com/watch?v=ThjvQt3QBvI
how to specify DBFS path: https://kb.databricks.com/dbfs/how-to-specify-dbfs-path

DBFS #

spark API format #

file API format #

Local driver file system #

move file from local file system to DBFS #

Workspace files #

references #

DBFS
#

spark API format
#

file API format
#

Local driver file system
#

move file from local file system to DBFS
#

Workspace files
#

references
#