Upload Pandas DataFrame to GCP Bucket for Dataproc

1314 views python

I have been working on Spark Cluster using Data Proc google cloud services for Machine Learning Modelling. I have been successful to load the data from the Google Storage bucket. However, I am not sure how to write the panda's data frame and spark data frame to the cloud storage bucket as csv.

When I use the below command it gives me an error

FileNotFoundError: [Errno 2] No such file or directory: 'gs://mybucket/'

however the following command work but I am not sure where it is saving the file


I also followed the below article and it gives the following error Write a Pandas DataFrame to Google Cloud Storage or BigQuery

import google.datalab.storage as storage
ModuleNotFoundError: No module named 'google.datalab'

I am relatively new to Google Cloud Data Proc and Spark and I was hoping if someone can help me understand how can I save my output pandas data frame to gcloud bucket

Thanks in Advance !!

answered question

1 Answer


If you have the gsutil tool, you can invoke it through subprocess once your CSV has been written.

import subprocess
import os

# Save CSV file to local machine. 
# Invoke gsutil and upload CSV to your GCP bucket.
subprocess.check_output("gsutil -m cp data.csv gs://mybucket/", shell=True)
# Remove CSV from your local machine.

posted this

Have an answer?


Please login first before posting an answer.