Skip to main content
  1. Posts/

How to Download Files from Google Cloud Storage in the Databricks Workspace Notebook

··551 words·3 mins·
Table of Contents

In this post, I want to share the complete process and setup to download files from GCP in a Databricks workspace notebook. Since the notebook itself is non-interactive when you run the shell command, the setup process is a bit different from the normal GCP authentication.

Install gcloud cli
#

For my Databricks clusters, I checked that the under system is Ubuntu Linux (version 20.04). So we can use apg-get to update and install packages like a normal Ubuntu system.

First we need to update the dependency libraries:

sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates gnupg curl sudo

Set up the correct public GPG key for the google cloud deb source:

# download the google cloud gpg key, turn it into binary format and save it under directory /user/share/keyrings/.
# you can use the `file` command to check it: `file /usr/share/keyrings/cloud.google.gpg`
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | gpg --dearmor | sudo tee -a /usr/share/keyrings/cloud.google.gpg > /dev/null

# create the google cloud source entry under directory /etc/apt/sources.list.d/,
# the signed-by field specifies the key file to use for authentication,
# which is exactly the keyring we created in the first command.
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list > /dev/null

In the above commands, we import and google cloud repo keyring and set up the meta info about the repo.

Now we should be able to install gcloud cli without issue:

sudo apt-get update && sudo apt-get install google-cloud-cli

ref:

Authentication
#

User or service account credential
#

If you want to user or service account credentials, you need to do it interactively in a shell. This is difficult to do in the databricks notebooks, because the shell command you run in notebooks can not accept input. However, if you have access to the web terminal provided by databricks, you can do the authentication interactively:

gcloud auth application-default login

Note that the authentication is not permanent, it will be gone when you restart your cluster. So you need to do this authentication each time you run your code.

ref:

Service account keys
#

First, we need to create a service account key, the documentation is here. After getting the key file (JSON format), we can upload it to databricks workspace. Usually, we just set the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to the key file path. But this does not work for databricks if you use shell commands: %sh export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key_file". Instead, we can directly set this environment variable in the Python code:

import os

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/key_file"

ref:

Download file from cloud storage
#

First, install the gcloud storage package:

pip install --upgrade google-cloud-storage

Then you can download a file from cloud storage bucket like this:

from google.cloud import storage
import os

# this is for authentication
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/key_file"

# Initialise a client
storage_client = storage.Client("<project-name>")
# Create a bucket object for our bucket
bucket = storage_client.get_bucket("<bucket-name>")

# Create a blob object from the filepath
blob = bucket.blob("<sub-directory>/file_name.ext")

# Download the file to a destination
blob.download_to_filename('./my_file.ext')

ref:

Related

Databricks Cli Usage
·141 words·1 min
Working with Databricks Workspace Files
··568 words·3 mins
Databricks Init Scripts
·330 words·2 mins