Writing large dataset from spark dataframe
1
vote
0
answers
210
views
We have a azure databricks job that retrieves some large dataset with pyspark. The dataframe has about 11 billion rows. We are currently writing this out to a postgresql DB (also in azure). Currently we are using the jdbc connector to write row out in batch to the existing table (batch size 10,000,000). This table does have a handful of indexes on it, so inserts take awhile. It is dozens of hours to complete this operation (assuming if finishes successfully at all).
I feel like it would make more sense to use
COPY
to load the data into the database, but I don't see any well establish patterns for doing that in databricks.
I don't have a ton of spark or databricks experience, so any tips are appreciated.
Asked by Kyle Chamberlin
(13 rep)
Feb 16, 2024, 12:57 AM