Aggregate sampler function for programmer express predictions

0 votes
2 answers
316 views
                          Are there any SQL-standard or DBMS implementation that offers "sample aggregator" for predictably repeated columns?

I will explain, see bold text at the end.

PS: this question is about "little performance gain" in big tables, and about "semantic enhancements" in the *SQL language*, in general.

----

As typical table, we can imagine a SQL-view v1 of some piece of data:

    Organization   | City           |  Country | Info1 | Info2
    ---            | ---            | ---      | ---   | ---
    LocalOrg1      | San Francisco  |  US      | 10    | 23
    LocalOrg1      | San Francisco  |  US      | 2     | 24
    Armarinhos Fer | São Paulo      |  BR      | 11    | 55
    Armarinhos Fer | São Paulo      |  BR      | 12    | 56

My (programmer's) prediction is that all organizations of v1 exists in only one city (eg. there are only one "San Francisco City Hall"). So,on basis of my prediction, the query


    SELECT organization, city, country, 
           sum(info1) as tot1, avg(info2) as avg2, count(*) as n
    FROM v1
    GROUP BY organization, city, country  -- city,country predicted repeat


have some redundancy and some semantic limitation: 
the grouping criteria is  organization, there is no need to spend CPU checking city  and country. And I can say "Hey John look that query grouped by organization", because it does not make sense to say to another human   "organization, city and country" when he have the same prediction in mind. It was a SQL obligation but is not the semantic essence of the query.

Of course, the semantic may be better with GROUP by organization, 2,3 syntax sugar, or using max(),


    SELECT organization, max(city) as city, max(country) as country, 
           sum(info1) as tot1, avg(info2) as avg2, count(*) as n
    FROM v1
    GROUP BY organization  -- better semantic, worse performance 

but max() lost time "comparing nothing" (because as I predicted there are only repeated things), and perhaps the query cost will be bigger than the first query.

Some DBMS also offer first() and last() which might be more efficient (!) then max(), and I guess that this is the simplest solution today: use some kind of first() function optimize performance and semantic, in basis of my prediction that city and country repeats when grouping by organization.

But no matter if the sample of a repeated column comes from first or last sampled row. The first/last can be also an internal optimization choice, so, the real need in this context is a kind of aggsample() function: 
 
    SELECT organization, 
           aggsample(city), aggsample(country) -- better semantic and perfornace
           sum(info1) as tot1, avg(info2) as avg2, count(*) as n
    FROM v1
    GROUP BY organization  -- better performance 
 

So, putting in this detailed context: **Are there are any *SQL language variation* where this kind of function (aggsample) was defined?**

-----

EDITED after comments and some homework...

## NOTES

Candidates and limitations: 

* [PostgreSQL's DISTINCT ON](http://www.postgresql.org/docs/current/static/sql-select.html#SQL-DISTINCT) . Seems to address the similar problem, but not solves the use in usual GROUP BY summarizations.

* [MySQL's ANY_VALUE](https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html) , as suggested by @AndriyM. Seems perfect (!), but I never used... Will try later, install mysql and do some tests.

Similar discussions:

* https://stackoverflow.com/a/8373384/287948 
* https://stackoverflow.com/q/36134657/287948 
* ... https://stackoverflow.com/a/20347763/287948 

### Trying


    CREATE TABLE v1 (
      Organization text,  City text,
      Country text, Info1  int, Info2 int
    );
    INSERT INTO v1 VALUES 
     ('LocalOrg1', 'San Francisco', ' US', '10  ', '23'),
     ('LocalOrg1', 'San Francisco', ' US', '2   ', '24'),
     ('Armarinhos Fer', 'São Paulo', ' BR', '11  ', '55'),
     ('Armarinhos Fer', 'São Paulo', ' BR', '12  ', '56');

Extrange, PostgreSQL offer as valid syntax "DISTINCT ON + GROUP BY", 

    SELECT DISTINCT ON (organization) organization, city, country, 
           sum(info1) as tot1, avg(info2) as avg2, count(*) as n
    FROM v1
    GROUP BY organization, city, country

but not works better (worst performance) and is not valid the reduction ou grouping with only GROUP BY organization.

... no other test.

PS: the only valid syntax (and result) with DISTINCT ON of postgresql is

    SELECT DISTINCT ON (organization) organization, City, Country, 
    	sum(Info1) OVER w AS tot1,
    	avg(info2) OVER w AS avg2,
    	count(*)   OVER w as n
    FROM v1
    WINDOW w AS (PARTITION BY organization);

as we see, ugly syntax, and with worst performance.

                        
Asked by Peter Krauss (476 rep)
Mar 28, 2016, 09:01 AM
Last activity: Apr 28, 2025, 01:00 AM
Aggregate sampler function for programmer express predictions

Related Questions