Skip to main content

File Systems in Databricks

·304 words·2 mins

A summary of different file systems in Databricks.

DBFS
#

DBFS in databricks is a distributed file system, which maps the cloud storage to a file system for ease of use.

There are two different styles of notation when we work with file paths in DBFS. The first is the spark API format: dbfs:/some/path/. The second is the file API format: /dbfs/some/path/

spark API format
#

We use the spark API format when:

  • we are using the spark operations:
    • spark.write.format('csv').save('dbfs:/path/to/my-csv')
    • spark.read.load('dbfs:/path/to/my-csv')
  • we are using the dbutils.fs interface:
    • dbutils.fs.ls('dbfs:/some/path')
    • we can also use the %fs magic for DBFS operations: %fs ls dbfs:/path/to/file
  • we run spark sql statement: SELECT * FROM JSON.`dbfs:/FileStore/shared_uploads/my-json-table`

ref:

file API format
#

We use the file API format when:

  • we use shell command: %sh ls -l /dbfs/path/to/file
  • when use python code that is not related to spark: f = open('/dbfs/path/to/file')
  • we use other packages that does not support spark.

Local driver file system
#

When we want to specify file path in local driver node, for bash and Python code, we just use the path as is:

  • shell: %sh ls /usr/include/zlib.h
  • python: f = open('/usr/include/zlib.h')

However, if want to access the local driver file system using the dbutils.fs interface, we need to use this notation: file:/path/to/file:

  • dbutils.fs.ls('file:/usr/include/zlib.h')
  • %fs ls file:/usr/include/zlib.h

move file from local file system to DBFS
#

We can also move a file from local file system to DBFS:

dbutils.fs.mv("file:/tmp/some_file.csv", "dbfs:/tmp/my_file.csv")

ref:

Workspace files
#

To read a file under workspace folder in spark, we need to use the full path with file: notation. For example:

df = spark.read.csv('file:/some/csv/file/path/under/workspace', header=True)

ref:

references
#