How can I perform a
GROUP BY
in *SQL* when the
group_name
values are similar but not exactly the same?
In my dataset, the group_name values may differ slightly (e.g., "Apple Inc.", "AAPL", "Apple"), but conceptually they refer to the same entity. The similarity might not be obvious or consistent, so I might need to define a custom rule or function like is_similar()
to cluster them.
For simple cases, I can extract a common pattern using regex or string functions (e.g., strip suffixes, lowercase, take prefixes). But how should I handle more complex scenarios, like fuzzy or semantic similarity?
Case:
group_name | val
---------------|-----
'Apple Inc.' | 100
'AAPL' | 50
'Apple' | 30
'Microsoft' | 80
'MSFT' | 70
What I want to achieve:
new_group_name | total_val
----------------|----------
'Apple' | 180
'Microsoft' | 150
What are the best approaches to achieve this in *SQL*?
And how would I write a query like this:
SELECT some_characteristic(group_name) AS new_group_name,
SUM(val)
FROM tb1
GROUP BY new_group_name;
Asked by Ahamad
(1 rep)
May 14, 2025, 08:59 AM
Last activity: May 15, 2025, 05:31 AM
Last activity: May 15, 2025, 05:31 AM