Upload Pandas DataFrame to GCP Bucket for Dataproc

1314 views python
-3

I have been working on Spark Cluster using Data Proc google cloud services for Machine Learning Modelling. I have been successful to load the data from the Google Storage bucket. However, I am not sure how to write the panda's data frame and spark data frame to the cloud storage bucket as csv.

When I use the below command it gives me an error

df.to_csv("gs://mybucket/")
FileNotFoundError: [Errno 2] No such file or directory: 'gs://mybucket/'

however the following command work but I am not sure where it is saving the file

df.to_csv("data.csv")

I also followed the below article and it gives the following error Write a Pandas DataFrame to Google Cloud Storage or BigQuery

import google.datalab.storage as storage
ModuleNotFoundError: No module named 'google.datalab'

I am relatively new to Google Cloud Data Proc and Spark and I was hoping if someone can help me understand how can I save my output pandas data frame to gcloud bucket

Thanks in Advance !!

answered question

1 Answer

3

If you have the gsutil tool, you can invoke it through subprocess once your CSV has been written.

import subprocess
import os

# Save CSV file to local machine. 
df.to_csv('data.csv')
# Invoke gsutil and upload CSV to your GCP bucket.
subprocess.check_output("gsutil -m cp data.csv gs://mybucket/", shell=True)
# Remove CSV from your local machine.
os.remove('data.csv')

posted this

Have an answer?

JD

Please login first before posting an answer.