Skip to main content
  1. Posts/

File Systems in Databricks

·304 words·2 mins·
Note Databricks
Table of Contents

A summary of different file systems in Databricks.

DBFS
#

DBFS in databricks is a distributed file system, which maps the cloud storage to a file system for ease of use.

There are two different styles of notation when we work with file paths in DBFS. The first is the spark API format: dbfs:/some/path/. The second is the file API format: /dbfs/some/path/

spark API format
#

We use the spark API format when:

  • we are using the spark operations:
    • spark.write.format('csv').save('dbfs:/path/to/my-csv')
    • spark.read.load('dbfs:/path/to/my-csv')
  • we are using the dbutils.fs interface:
    • dbutils.fs.ls('dbfs:/some/path')
    • we can also use the %fs magic for DBFS operations: %fs ls dbfs:/path/to/file
  • we run spark sql statement: SELECT * FROM JSON.`dbfs:/FileStore/shared_uploads/my-json-table`

ref:

file API format
#

We use the file API format when:

  • we use shell command: %sh ls -l /dbfs/path/to/file
  • when use python code that is not related to spark: f = open('/dbfs/path/to/file')
  • we use other packages that does not support spark.

Local driver file system
#

When we want to specify file path in local driver node, for bash and Python code, we just use the path as is:

  • shell: %sh ls /usr/include/zlib.h
  • python: f = open('/usr/include/zlib.h')

However, if want to access the local driver file system using the dbutils.fs interface, we need to use this notation: file:/path/to/file:

  • dbutils.fs.ls('file:/usr/include/zlib.h')
  • %fs ls file:/usr/include/zlib.h

move file from local file system to DBFS
#

We can also move a file from local file system to DBFS:

dbutils.fs.mv("file:/tmp/some_file.csv", "dbfs:/tmp/my_file.csv")

ref:

Workspace files
#

To read a file under workspace folder in spark, we need to use the full path with file: notation. For example:

df = spark.read.csv('file:/some/csv/file/path/under/workspace', header=True)

ref:

references
#

Related

Databricks Cli Usage
·141 words·1 min
Note Databricks
How to Download Files from Google Cloud Storage in the Databricks Workspace Notebook
··551 words·3 mins
Note Databricks GCP Ubuntu
Working with Databricks Workspace Files
··568 words·3 mins
Note Databricks