In this post, I want to share the complete process and setup to download files from GCP in a Databricks workspace notebook. Since the notebook itself is non-interactive when you run the shell command, the setup process is a bit different from the normal GCP authentication.
Install gcloud cli#
For my Databricks clusters, I checked that the under system is Ubuntu Linux (version 20.04).
So we can use apg-get
to update and install packages like a normal Ubuntu system.
First we need to update the dependency libraries:
sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates gnupg curl sudo
Set up the correct public GPG key for the google cloud deb source:
# download the google cloud gpg key, turn it into binary format and save it under directory /user/share/keyrings/.
# you can use the `file` command to check it: `file /usr/share/keyrings/cloud.google.gpg`
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | gpg --dearmor | sudo tee -a /usr/share/keyrings/cloud.google.gpg > /dev/null
# create the google cloud source entry under directory /etc/apt/sources.list.d/,
# the signed-by field specifies the key file to use for authentication,
# which is exactly the keyring we created in the first command.
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list > /dev/null
In the above commands, we import and google cloud repo keyring and set up the meta info about the repo.
Now we should be able to install gcloud cli without issue:
sudo apt-get update && sudo apt-get install google-cloud-cli
ref:
- add repo and public key in Debian/Ubuntu
- https://unix.stackexchange.com/q/332672/221410
- why apt-key add is deprecated: https://askubuntu.com/a/1307181/768311
- format for sources.list file:
- install gcloud cli: https://cloud.google.com/sdk/docs/install#deb
Authentication#
User or service account credential#
If you want to user or service account credentials, you need to do it interactively in a shell. This is difficult to do in the databricks notebooks, because the shell command you run in notebooks can not accept input. However, if you have access to the web terminal provided by databricks, you can do the authentication interactively:
gcloud auth application-default login
Note that the authentication is not permanent, it will be gone when you restart your cluster. So you need to do this authentication each time you run your code.
ref:
- enable web terminal in databricks: https://learn.microsoft.com/en-us/azure/databricks/administration-guide/clusters/web-terminal
Service account keys#
First, we need to create a service account key, the documentation is here.
After getting the key file (JSON format), we can upload it to databricks workspace.
Usually, we just set the environment variable GOOGLE_APPLICATION_CREDENTIALS
to point to the key file path.
But this does not work for databricks if you use shell commands: %sh export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key_file"
.
Instead, we can directly set this environment variable in the Python code:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/key_file"
ref:
- set up application credential: https://cloud.google.com/docs/authentication/provide-credentials-adc#local-dev
- how application default credential works: https://cloud.google.com/docs/authentication/application-default-credentials#personal
Download file from cloud storage#
First, install the gcloud storage package:
pip install --upgrade google-cloud-storage
Then you can download a file from cloud storage bucket like this:
from google.cloud import storage
import os
# this is for authentication
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/key_file"
# Initialise a client
storage_client = storage.Client("<project-name>")
# Create a bucket object for our bucket
bucket = storage_client.get_bucket("<bucket-name>")
# Create a blob object from the filepath
blob = bucket.blob("<sub-directory>/file_name.ext")
# Download the file to a destination
blob.download_to_filename('./my_file.ext')
ref:
- cloud storage python client examples: https://github.com/googleapis/python-storage/tree/main/samples
- cloud storage download objects: https://cloud.google.com/storage/docs/downloading-objects#storage-download-object-python