Best practices for large JOINs - Warehouse or External Compute (e.g. Spark)

2 votes

1 answer

677 views

I am working on a problem that requires a very large join. The JOIN itself is pretty straightforward but the amount of data I am processing is very large. I am wondering for very large JOINs, is there a preferred type of technology. For example, is it more effective to a Data Warehouse (like Snowflake) or in some other MPP system like Spark? To make the problem more concrete I created a hypothetical problem similar to my actual problem. Assume I have a table that looks like this: enter image description here

I am working on some logic that requires account pairs that have the same name. To find pairs of accounts with the same account I can easily do something like this:

SELECT 
	account1.name, 
    account2.name
FROM accounts as account1
JOIN accounts as account2 ON account1.name = account2.name AND account1.acount_id != account2.acount_id

The problem I am facing is due to the amount of data I am processing. There are roughly ~2 trillion records I am trying to self JOIN on. Obviously, this will take some time and some pretty serious compute. I have run a similar query in Snowflake using XL and 3XL warehouses but after several hours of running, I canceled them. I am hoping there is a more cost-effective or time-efficient way. Has anyone had success with massive JOINs? Are there any other tricks I could deploy? What tool did you find the most effective?

Asked by Arthur Putnam (553 rep)

Feb 24, 2022, 09:06 PM
Last activity: Dec 18, 2024, 12:01 PM

Best practices for large JOINs - Warehouse or External Compute (e.g. Spark)

Related Questions