Sample Header Ad - 728x90

Writing large dataset from spark dataframe

1 vote
0 answers
210 views
We have a azure databricks job that retrieves some large dataset with pyspark. The dataframe has about 11 billion rows. We are currently writing this out to a postgresql DB (also in azure). Currently we are using the jdbc connector to write row out in batch to the existing table (batch size 10,000,000). This table does have a handful of indexes on it, so inserts take awhile. It is dozens of hours to complete this operation (assuming if finishes successfully at all). I feel like it would make more sense to use COPY to load the data into the database, but I don't see any well establish patterns for doing that in databricks. I don't have a ton of spark or databricks experience, so any tips are appreciated.
Asked by Kyle Chamberlin (13 rep)
Feb 16, 2024, 12:57 AM