A summary of different file systems in Databricks.

DBFS

DBFS in databricks is a distributed file system, which maps the cloud storage to a file system for ease of use.

There are two different styles of notation when we work with file paths in DBFS. The first is the spark API format: dbfs:/some/path/. The second is the file API format: /dbfs/some/path/

spark API format

We use the spark API format when:

  • we are using the spark operations:
    • spark.write.format('csv').save('dbfs:/path/to/my-csv')
    • spark.read.load('dbfs:/path/to/my-csv')
  • we are using the dbutils.fs interface:
    • dbutils.fs.ls('dbfs:/some/path')
    • we can also use the %fs magic for DBFS operations: %fs ls dbfs:/path/to/file
  • we run spark sql statement: SELECT * FROM JSON.`dbfs:/FileStore/shared_uploads/my-json-table`

ref:

file API format

We use the file API format when:

  • we use shell command: %sh ls -l /dbfs/path/to/file
  • when use python code that is not related to spark: f = open('/dbfs/path/to/file')
  • we use other packages that does not support spark.

Local driver file system

When we want to specify file path in local driver node, for bash and Python code, we just use the path as is:

  • shell: %sh ls /usr/include/zlib.h
  • python: f = open('/usr/include/zlib.h')

However, if want to access the local driver file system using the dbutils.fs interface, we need to use this notation: file:/path/to/file:

  • dbutils.fs.ls('file:/usr/include/zlib.h')
  • %fs ls file:/usr/include/zlib.h

move file from local file system to DBFS

We can also move a file from local file system to DBFS:

dbutils.fs.mv("file:/tmp/some_file.csv", "dbfs:/tmp/my_file.csv")

ref:

Workspace files

To read a file under workspace folder in spark, we need to use the full path with file: notation. For example:

df = spark.read.csv('file:/some/csv/file/path/under/workspace', header=True)

ref:

references