I have been working on Spark Cluster using Data Proc google cloud services for Machine Learning Modelling. I have been successful to load the data from the Google Storage bucket. However, I am not sure how to write the panda's data frame and spark data frame to the cloud storage bucket as csv.
When I use the below command it gives me an error
df.to_csv("gs://mybucket/") FileNotFoundError: [Errno 2] No such file or directory: 'gs://mybucket/'
however the following command work but I am not sure where it is saving the file
I also followed the below article and it gives the following error Write a Pandas DataFrame to Google Cloud Storage or BigQuery
import google.datalab.storage as storage ModuleNotFoundError: No module named 'google.datalab'
I am relatively new to Google Cloud Data Proc and Spark and I was hoping if someone can help me understand how can I save my output pandas data frame to gcloud bucket
Thanks in Advance !!
If you have the
gsutil tool, you can invoke it through
subprocess once your CSV has been written.
import subprocess import os # Save CSV file to local machine. df.to_csv('data.csv') # Invoke gsutil and upload CSV to your GCP bucket. subprocess.check_output("gsutil -m cp data.csv gs://mybucket/", shell=True) # Remove CSV from your local machine. os.remove('data.csv')