Writing large dataset from spark dataframe

1 vote

0 answers

210 views

postgresql python jdbc apache-spark databricks

                          We have a azure databricks job that retrieves some large dataset with pyspark. The dataframe has about 11 billion rows. We are currently writing this out to a postgresql DB (also in azure). Currently we are using the jdbc connector to write row out in batch to the existing table (batch size 10,000,000). This table does have a handful of indexes on it, so inserts take awhile. It is dozens of hours to complete this operation (assuming if finishes successfully at all). 

I feel like it would make more sense to use COPY to load the data into the database, but I don't see any well establish patterns for doing that in databricks. 

I don't have a ton of spark or databricks experience, so any tips are appreciated.

Asked by Kyle Chamberlin (13 rep)

Feb 16, 2024, 12:57 AM

Writing large dataset from spark dataframe

Related Questions