Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

4 votes

1 answers

493 views

Pagination - Text comparison with greater than and less than with DESC

I am implementing a [seek method][1] for pagination and am wondering about how to best query on a text column with DESC. The queries for this seek approach use a less than or greater than depending on if you are sorting ASC or DESC. This works great for integers and dates but I am wondering how best...

                                  I am implementing a seek method  for pagination and am wondering about how to best query on a text column with DESC. The queries for this seek approach use a less than or greater than depending on if you are sorting ASC or DESC. This works great for integers and dates but I am wondering how best to do it with text columns, specifically for the first page.

For example, for the first page when sorting by name it would be

    SELECT *
    FROM users
    WHERE first_name > ''
    ORDER BY first_name ASC
    LIMIT 5;

Then the next page would be

    SELECT *
    FROM users
    WHERE first_name > 'Caal'
    ORDER BY first_name ASC
    LIMIT 5;

This works great. I am unsure about DESC order though. This seems to work but I am unsure if it is 'correct'.

     SELECT 	*
     FROM 	users
     WHERE 	last_name < 'ZZZ'
     ORDER BY last_name DESC
     LIMIT 5;

Second page

    SELECT 	*
    FROM 	users
    WHERE 	last_name < 'Smith'
    ORDER BY last_name DESC
    LIMIT 5;

P.S. I am using the jooq  support for the seek method and prefer to not have to hack around the native support, so ideally there is a proper parameter to put in the 'ZZZ' place above. i.e. there WHERE part of the clause is mandatory.

                                

Collin Peters (775 rep)

Aug 12, 2015, 07:28 PM • Last activity: May 6, 2025, 07:05 AM

1 votes

2 answers

1989 views

MySQL: Data pagination according data on another table

mysql join pagination

I have a database with the following structure - Items table: itemID, itemName - Ads table: AdsId, itemID I would like to make pagination on items table, 10 items in each page, but first I must retrieve items whose IDs are in the Ads table, and then retrieve other items. I know I must use `limit` su...

                                  I have a database with the following structure 

- Items table: itemID, itemName
- Ads table: AdsId, itemID

I would like to make pagination on items table, 10 items in each page, but first I must retrieve items whose IDs are in the Ads table, and then retrieve other items.

I know I must use limit such as:

    SELECT * FROM Items LIMIT $offset, $no_of_records_per_page
And may I joined it with adsTable, such as:

    SELECT * FROM Items inner join Ads on Ads.itemID = Items.itemID
    LIMIT $offset, $no_of_records_per_page

but how can I achieve what I described?
Thanks in advance.

Ali A. Jalil (113 rep)

Feb 4, 2021, 05:38 PM • Last activity: Apr 4, 2025, 12:06 PM

2 votes

1 answers

82 views

Optimizing Pagination Query with Multiple User Filters on Large Table in SQL Server

sql-server index-tuning pagination

I have a table with over 10 million rows tracking user activities. I created a nonclustered index on (UserID, DtModified DESC), which performs well for queries filtering a single user. However, when querying multiple users, SQL Server first joins on the UserActivities table, then sorts the results by last modified before selecting the rows. Since I’m using this query for pagination, if the combined users have 10,000 rows, SQL Server retrieves all of them, sorts them, and then selects only the first 50 rows. This approach becomes inefficient when searching for users with a large number of records. Is there a way to improve performance with better indexing? Any advice would be greatly appreciated. Thanks! **P.S.** Erik Darling previously suggested columnstore indexes in another post, but that isn’t an option for me right now. **Plan with single user:** https://www.brentozar.com/pastetheplan/?id=uvYVLDaq9D **Plan with multiple users:** https://www.brentozar.com/pastetheplan/?id=EmkR1GGa3p

-- Create a temporary table to demonstrate the issue
CREATE TABLE #UserActivities (
    ActivityID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
    UserID INT NOT NULL,
    ActivityType VARCHAR(50) NOT NULL,
    DtModified DATETIME2 NOT NULL,
    Details NVARCHAR(MAX) NULL
);

-- Create the index we want to evaluate
CREATE NONCLUSTERED INDEX IX_UserID_DtModified 
ON #UserActivities(UserID, DtModified DESC);

-- Insert sample data (10,000 rows for demonstration)
INSERT INTO #UserActivities (UserID, ActivityType, DtModified, Details)
SELECT 
    ABS(CHECKSUM(NEWID())) % 1000 + 1 AS UserID,  -- 1,000 distinct users
    CASE WHEN n % 10 = 0 THEN 'Login' 
         WHEN n % 5 = 0 THEN 'Purchase'
         ELSE 'PageView' END AS ActivityType,
    DATEADD(MINUTE, -ABS(CHECKSUM(NEWID())) % 525600, GETDATE()) AS DtModified,
    'Sample activity details for user ' + CAST(ABS(CHECKSUM(NEWID())) % 1000 + 1 AS VARCHAR(10)) AS Details
FROM (
    SELECT TOP 10000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n
    FROM sys.objects a
    CROSS JOIN sys.objects b  -- Cross join to get enough rows
) AS Numbers;

-- Demonstrate the index behavior
-- Good case: Single user (uses seek)
SELECT UserID, DtModified
FROM #UserActivities
WHERE UserID = 42
ORDER BY DtModified DESC;

-- Problem case: Multiple users (often uses scan)
SELECT UserID, DtModified
FROM #UserActivities
WHERE UserID IN (42, 100, 789, 1024)
ORDER BY DtModified DESC;

-- Clean up
DROP TABLE #UserActivities;

lifeisajourney (751 rep)

Apr 3, 2025, 02:14 PM • Last activity: Apr 4, 2025, 09:33 AM

8 votes

3 answers

896 views

Most cost efficient way to page through a poorly ordered table?

sql-server pagination

I have a table that has three columns: HashUID1, HashUID2, Address_Name (which is a textual email address, and the previous two hash colunms are of some crazy creation to link event participant tables to email addresses. its ugly, it barely works its out of my control. Focus on the address_name inde...

                                  I have a table that has three columns:
HashUID1, HashUID2, Address_Name (which is a textual email address, and the previous two hash colunms are of some crazy creation to link event participant tables to email addresses. its ugly, it barely works its out of my control. Focus on the address_name index)

It has 78 million rows. Not properly sorted. Regardless, this index is split onto a lot of fast LUN's and performs REALLY fast index seeks. 

I need to create a series of queries to extract only 20,000 "rows per page" at a time, but avoid conflicts or dupes. Since there is no identity column, **or easily ordered column,** is there an easy way to select all, and page through it? 

Am I correct in saying that if I do a select * from hugetablewithemails into a temp table, then select through it by row_number that the table remains in memory for the duration of the transaction, which, to me, is an excessive amount of memory resources?
This seems the preferred method of paging. I'd rather page by statistical percentages. :(

There is one index which maintains the address_name email address in order, and is well maintained. For the past week I have been meaning to help this other developer by spending some time on looking into building a proc that spits out ranges based on windowing functions based on statistics (which I am not great at, but this query really interested me) to provide a range of characters 1 through (variable) LEFT LIKE chars of the index, that meets 20,000 rows--But I have not had time to even start the query...

Couple questions:

1. Any suggestions? Not looking for actual code, just some hints or suggestions based on experiences, maybe caveats. I want to avoid additional index scans after the initial scan.

2. Is this the right approach?

3. I'm thinking of breaking the sum of the index of all email addresses, gathering rowcount(*), /20,000, and usinng that as a windowing function to group min/max substring(1,5) values based on percentages of total rowcount to build grouping ranges. Thoughts?

This is for an ETL process that cannot modify source databases.

I am hoping with one full index scan I can do a:

* Query to get a histograph based on index usage (alphabetically sorted) and break it out (windowed) using min/max to create some ranges like this, so to easily seek the needed index:

* A-> AAAX, (20k rows for example) AAA-Z, B-> (another 20k), B->BAAR -> BAAR-> CDEFG -> CDEFH > FAAH, etc.

We run read committed in these databases for this ETL process. We are only attempting to batch it out in scores of 20k rows because the DBA's say we are using too much network resources by grabbing tables in full. If the data has changed (which is a concern) we update our DW and staging tables on the fly.

I would *love* to use temp tables, but if I did, I'd spill into tempdb and get lashings via e-mail from the DBAs regarding it, and that the database is too big.

beeks (1251 rep)

Nov 15, 2014, 04:23 AM • Last activity: Mar 23, 2025, 03:41 AM

3 votes

1 answers

932 views

Efficient Pagination In Postgresql (Keyset Pagination)

postgresql postgresql-performance pagination

I'm working on implementing pagination for a PostgreSQL database in which I have a table that stores users for my application. I have a query that is intended to fetch the next page of users based on their forename and creation timestamp. However, I'm encountering some difficulties and would appreciate some guidance. Types for the columns: userId (string and stores uuidv5), userForename (string), userCreatedAt (timestamp) Here's the query I'm using for fetching the next page: is just a placeholder

SELECT "userId", "userForename", "userCreatedAt"
FROM iatropolis."User"
WHERE
	LOWER("userForename") > 'aaliyah'
	OR (LOWER("userForename") = 'aaliyah' AND
     "userCreatedAt" > TIMESTAMP '2024-03-25 14:50:39.481197')
ORDER BY "userForename" ASC, "userCreatedAt" ASC
LIMIT ;

The goal of this query is to retrieve the first user after 'aaliyah' alphabetically, and if there are multiple users with the same forename, to select the one with a creation timestamp later than '2024-03-25 14:50:39.481197'. However, I'm unsure if this query is correctly achieving the desired pagination. Even though I tested it and it outputted correct results, I don't know if it will behave in the long run as desired (not sure if I am missing something). I have also tried to go to the previous page buy creating a similar query.

SELECT "userId", "userForename", "userCreatedAt"
FROM iatropolis."User"
WHERE
	LOWER("userForename") ;

This is some ascending random data that I am working with: | userId | userForename | userCreatedAt | | ------------------------------------ | ------------ | -------------------------- | | 38ee48f9-6d79-5690-9e51-df4b1cc5d428 | Aaliyah | 2024-03-25 14:50:39.481197 | | ba2c1a86-280d-573f-ad35-ed1d023e3e5d | Aaliyah | 2024-03-25 14:51:35.52505 | | 4f40dd7a-8f0f-54e4-bf14-3de3626bde67 | Abagail | 2024-03-25 19:27:10.985665 | | e7b68316-188d-509e-9fd3-7fc780986c71 | Abbey | 2024-03-25 17:45:43.584704 | | 1084f33e-9183-501d-b1af-70d61f1479d7 | Abbigail | 2024-03-25 19:27:29.720356 | | 561e71fd-f04a-54d2-9864-2324e28cdfbd | Abbigail | 2024-03-25 19:27:35.5478 | | ab5ad018-1866-50a4-8bf6-568f74988d21 | Abbigail | 2024-03-25 17:46:35.309003 | | 596e13f2-e413-5b50-bdfb-9e68f0edf3db | Abdiel | 2024-03-25 17:47:08.576102 | | d0d5bff9-f782-5451-bda4-e4d4a3e49f04 | Abdiel | 2024-03-25 14:51:24.257638 | | 33a02ded-b354-5a6b-b506-cad3595d19be | Abdul | 2024-03-25 17:46:49.809666 | | 6e1f68ff-f799-511e-ba91-83b35349e652 | Abdul | 2024-03-25 14:51:02.533247 | | f67226da-a560-5913-9770-c23499046dc5 | Abe | 2024-03-25 14:49:34.342507 | | 4e345a84-144d-55e2-a568-ce2064cb91fc | Abigale | 2024-03-25 19:26:32.349998 | | ecab4167-8275-5c45-a6ce-5f616edeb60b | Abigale | 2024-03-25 19:26:24.406462 | | 3689dc88-b57c-572b-90cd-3bd0c4603656 | Abigayle | 2024-03-25 14:50:41.608976 | | db26cecb-0532-5cd4-b45e-04c14ce48b97 | Abigayle | 2024-03-25 19:26:54.081197 | | 03d9e144-ba2e-5d08-b8bc-abaa42b59848 | Abner | 2024-03-25 17:47:12.313267 | | e7d3fd8b-9187-5e78-8178-93eef85c3edc | Abner | 2024-03-25 19:27:35.016205 | | 3a0c5db0-a19b-57b2-be99-6c0e0d33fb4c | adadadFN | 2024-03-22 11:33:03.466138 | | daada509-45b3-58e8-920f-2441395af0f8 | Adalberto | 2024-03-25 17:46:31.150524 | | dd3781a8-45ae-5ecb-9d5d-b4230347b0d8 | Adaline | 2024-03-25 17:46:05.821796 | | c2122ddb-fd31-59ea-b5db-3e8696f0a2fc | Adan | 2024-03-25 14:51:06.880816 | | 19ac3589-47ac-53ba-8e8c-c1537babbe97 | Addie | 2024-03-25 17:47:04.303399 | | c61c8d3a-02b7-5db3-8b84-dba2fbf522b8 | Addie | 2024-03-25 17:46:35.627732 | | 6800cc09-113a-5383-a9ec-463bfeb1f3aa | Addison | 2024-03-25 17:46:24.314788 | This is some descending random data that I am working with: | userId | userForename | userCreatedAt | | ------------------------------------ | ------------ | -------------------------- | | d64ce86e-1bf4-56d8-86dc-e0ffd9bae972 | Zula | 2024-03-25 14:51:53.062856 | | e6f98b6c-ae16-58e9-ba3a-979dac63058d | Zora | 2024-03-25 17:46:15.697886 | | 951b1493-0f86-5741-8b3d-5b7db7466f25 | Zora | 2024-03-25 14:51:38.315643 | | a2c4455e-5333-57ea-9e7e-6a8ca03a7cf4 | Zita | 2024-03-25 19:26:44.119388 | | d00c4951-de6a-5290-b956-13bf3b3e11eb | Zita | 2024-03-25 14:51:28.445206 | | 390b6dae-b47e-57d7-8a7f-10a002d9b3a3 | Zion | 2024-03-25 14:50:18.578468 | | 6f6371a0-b5bd-5839-aed2-81e8b8e0bff7 | Zetta | 2024-03-25 19:27:22.998847 | | 4750df40-ffda-55e2-b26b-7c0c2ecf87e5 | Zetta | 2024-03-25 19:26:01.682206 | | 14d137e7-83aa-5463-b47e-1a2376b53376 | Zelma | 2024-03-25 14:50:15.690687 | | 0defcb46-486c-5273-9c81-e553500de614 | Zelma | 2024-03-25 14:49:32.577772 | | 1a28a48b-bc3b-5f84-bbae-1ac0839bf311 | Zechariah | 2024-03-25 19:26:04.410123 | | 574bf44e-2604-5ff1-95a9-164287d3c880 | Zane | 2024-03-25 19:27:15.455109 | | e7b0e413-2eff-5240-8c18-321664a42416 | Zander | 2024-03-25 19:26:45.864325 | | 2a737734-2c45-54c0-b911-315c8d333884 | Zakary | 2024-03-25 14:50:34.264419 | | 782ddfaa-df67-50e3-841d-fa21b525e75c | Zackery | 2024-03-25 17:47:17.542445 | | d24c75f2-7152-50aa-96ef-3a2d3847f1d3 | Zackary | 2024-03-25 17:46:58.029806 | | c2d6d8c5-3891-57fb-bbea-57e22cb51da4 | Zack | 2024-03-25 17:46:27.203011 | | 3333d2de-8028-5f00-8e4c-db821e8fe1a5 | Zachary | 2024-03-25 17:46:25.890929 | | f11dfc68-78f1-5290-b203-9e04449c0538 | Zachary | 2024-03-25 14:50:27.447767 | | d88c05bc-5fe4-5597-94b5-ca29c99a2df4 | Zachariah | 2024-03-25 17:46:52.98815 | | cf8e2a99-e308-53ae-8908-8c2ff0188143 | Zachariah | 2024-03-25 14:51:47.465403 | | da1ed730-9965-56c2-9317-e1726f172afc | Yvonne | 2024-03-25 19:26:36.714501 | | 97a4213a-5fd5-5d0d-a860-567295ab9a23 | Yvonne | 2024-03-25 14:51:20.7755 | | 41ce8c6b-5cca-57b9-a74b-d9fbf8f32ade | Yvette | 2024-03-25 14:50:18.973821 | | 9f47693b-bcfd-5e2c-b92f-56aefb79ce03 | Yoshiko | 2024-03-25 19:27:33.647082 | My question is somebody that knows SQL properly and used it for years and years, would agree that this keyset pagination is implemented properly? Also, for my case, being able to order the columns is very important, this being the reason of why I included the userForename in the queries. In the event that my queries are not correct, could somebody show me how to do them properly? Additionally, I am open to exploring alternative approaches if there are more efficient solutions available.

odyssey (33 rep)

Mar 25, 2024, 07:09 PM • Last activity: Feb 10, 2025, 10:02 AM

3 votes

1 answers

2384 views

What are the methods to paginate a table 100 rows at a time?

mysql query-performance pagination

Let's say we have a table with a large number of records. Now to show all the data using pagination in a basic framework using a MySQL query, we can use limits to get a subset of rows: SELECT * FROM TABLE_NAME WHERE CONDITION ORDER BY COLUMN1 LIMIT 0,100; and so on. To my understanding, `LIMIT` will...

                                  Let's say we have a table with a large number of records. Now to show all the data using pagination in a basic framework using a MySQL query, we can use limits to get a subset of rows:

    SELECT * FROM TABLE_NAME WHERE CONDITION ORDER BY COLUMN1 LIMIT 0,100;
    
and so on.

To my understanding, LIMIT will work once the temporary result table is generated, or in other words, the search will go through all the rows once and then the final result will be populated. Am I right?

The query result takes much time to get data with limit and sometimes the system just reaches execution time limits. Is there any other better solution for this?

Anant (131 rep)

Jan 10, 2020, 06:26 AM • Last activity: Feb 4, 2025, 12:05 PM

1 votes

1 answers

824 views

effectiveness of creating index for ORDER BY column when used with many WHERE clauses

mysql index optimization mysql-5.7 pagination

Say I am trying to build a pagination for a simple e-commerce app where user can search, filter, and sort items, and the result is displayed in an infinite scroll UI. I'm planning to use the cursor pagination method. When user wants to sort by lowest price (non-unique column), this means the cursor...

SELECT * FROM items
WHERE price > {last_price} OR (price = {last_price} AND id > {id})
ORDER BY price ASC, id ASC LIMIT 25

To optimize this query, I can create a compound index (price, id). https://dev.mysql.com/doc/refman/5.7/en/order-by-optimization.html Now my question is, will this index still be effective if the query contains many WHERE clauses? Because user can filter based on the item's attributes, e.g.

...
WHERE (locationId = 30 AND categoryId = 20 AND price > 5000 ...)     // filtering
AND (price > {last_price} OR (price = {last_price} AND id > {id}))   // cursor pagination clause
...

Based on my understanding, if we don't have the index, then the WHERE clause filtering will be done first to reduce the number of rows, and then ORDER BY is done on the remaining rows. But how's the case when there's an index present for the ORDER BY columns? Will it be useful at all or causing performance issue instead?

hskris (111 rep)

Dec 21, 2022, 06:04 AM • Last activity: Jan 7, 2025, 06:03 PM

0 votes

0 answers

43 views

Postgres: How to select rows, grouping by rows with same column field into one row, with pagination

postgresql performance group-by pagination

I've a table, with fields like | id | batchId | senderId | recipientId| | -------- | -------------- | ------- |------| | uuid1 | uuid | uuid5 | uuid7 | | uuid2 | uuid | uuid5 | uuid8 | | uuid3 | uuid2 | uuid6 | uuid9 | I need to select rows with pagination, but handling rows with same batchId as one row, and joining it into array, like | batchId | senderId | recipientIds| | -------- | -------------- | ------- | {uuid} | {uuid5} | {uuid7, uuid8} {uuid2} | {uuid6}| {uuid9} It's a simplified version of table Currently I'm doing it with query but it's quite heavy, and takes like 5 second to execute on db about 2m records. I'm using indexes, but that doesn't help. Here below is query

SELECT "sendData"."id"                                                                   AS "id"
     , "sendData"."pictureLink"                                                          AS "pictureLink"
     , "sendData"."isPaid"                                                               AS "isPaid"
     , "sendData"."createdAt"                                                            AS "createdAt"
     , "sendData"."updatedAt"                                                            AS "updatedAt"
     , "sendData"."senderId"                                                             AS "senderId"
     , "sendData"."recipient"                                                            AS "recipient"
     , "sendData"."isAnonymous"                                                          AS "isAnonymous"
     , "sendData"."recipientAccountId"                                                   AS "recipientAccountId"
     , "sendData"."senderAccountId"                                                      AS "senderAccountId"
     , "sendData"."senderPhotoUrl"                                                       AS "senderPhotoUrl"
     , "sendData"."recipientPhotoUrl"                                                    AS "recipientPhotoUrl"
     , "sendData"."isDeleted"                                                            AS "isDeleted"
     , "sendData"."oldCategory"                                                          AS "oldCategory"
     , "sendData"."batchId"                                                              AS "batchId"
     , CONCAT("sender"."firstName", ' ', "sender"."lastName", ' ', "sender"."patronymic") AS sender
     , json_build_object('id', "category"."id"
    , 'createdAt', "category"."createdAt"
    , 'updatedAt', "category"."updatedAt"
    , 'name', "category"."name"
    , 'baseName', "category"."baseName"
    , 'emojiId', "category"."emojiId"
    , 'text', "category"."text"
    , 'isVisible', "category"."isVisible"
    , 'isLocked', "category"."isLocked"
    , 'widgetText', "category"."widgetText"
    , 'emoji', json_build_object('id', "emoji"."id"
                             , 'createdAt', "emoji"."createdAt"
                             , 'updatedAt', "emoji"."updatedAt"
                             , 'unicode', "emoji"."unicode")
       )                                                                                  as category

     , (SELECT COALESCE(json_agg(
                                json_build_object(
                                        'id', "reaction"."id"
                                    , 'userAccountId', "reaction"."userAccountId"
                                    , 'sendDataId', "reaction"."sendDataId"
                                    , 'type', "reaction"."type"
                                )
                        ), '[]')
        FROM sendData_reactions as reaction
        where reaction."sendDataId" = ANY (ids))                                         as reactions
     , "recipientIds"
     , "recipients"
     , "ids"
FROM sendData as sendData
          JOIN (SELECT "batchId",
                      "senderId",
                      json_agg(tableName."recipientAccountId") AS "recipientIds",
                      json_agg(tableName."recipient")          AS "recipients",
                      array_agg(tableName.id)                  as "ids",
                      min(tableName.id::text)::uuid            as "id",
                      tableName."createdAt"
               FROM sendData as tableName
                        LEFT JOIN "user" as "sender"
                                  ON "sender"."id" = tableName."senderId"
                 AND "sender"."groupId" = 'id-here'
                 and sender."deletedAt" is null
               WHERE tableName."isDeleted" = false
                 AND "sender"."groupId" = 'id-here'
                 and sender."deletedAt" is null
                 AND tableName."isAnonymous" = false
               GROUP BY "batchId", tableName."createdAt", "senderId"
               ORDER BY tableName."createdAt" DESC
               OFFSET 0 LIMIT 100) as groupped
              ON sendData."id" = groupped."id"
                  AND sendData."createdAt" = groupped."createdAt"
         LEFT JOIN "user" as "sender"
                   ON "sender"."id" = sendData."senderId"
         LEFT JOIN "sendData_categories" as category
                   ON "category"."id" = sendData."categoryId"
         LEFT JOIN "sendData_categories_emoji" as "emoji"
                   ON "emoji"."id" = "category"."emojiId"
WHERE sendData."groupId" = 'uuid'
  AND sendData."isAnonymous" = false
  AND sendData."isDeleted" = false
ORDER BY sendData."createdAt" DESC
OFFSET 0 LIMIT 10;

Sendoff74 (1 rep)

Nov 15, 2024, 03:09 AM • Last activity: Nov 15, 2024, 09:06 AM

1 votes

0 answers

39 views

Is it even possible to create a scalable rhyming dictionary for 10 million words in a single language like English?

postgresql scalability pagination in-memory-database

I'm going in circles brainstorming ideas and TypeScript or SQL code to implement basically a "rhyming database". The goal of the rhyming database is to find rhymes for all words, not just exact rhymes but "nearby or close rhymes" too (like Rap music, etc.). Here are some of the facts and features: 1...

                                  I'm going in circles brainstorming ideas and TypeScript or SQL code to implement basically a "rhyming database". The goal of the rhyming database is to find rhymes for all words, not just exact rhymes but "nearby or close rhymes" too (like Rap music, etc.). Here are some of the facts and features:

1. Estimate 10 million English words for now (but realistically I'm thinking about doing this for ~30 languages).
2. I think rhymes would be like a reverse exponential curve (let's just imagine), so many short words rhyme long words, but it tapers down as words get longer.
3. We only will support up to 3 syllables of word-end rhyming.
4. Don't worry about the system for capturing the phonetic information of words, we can use something like the [CMU pronunciation format/dictionary](https://en.wikipedia.org/wiki/CMU_Pronouncing_Dictionary#Database_format) . I have a system for computing phonetic information (too involved to describe for this post).
5. In a not-worse-but-bad-case, let's say there are 1000 rhymes for every word, that is 10m x 1k = 10 billion relationships between words. At 10,000 rhymes, that is 100 billion, so the database might start running into scaling problems.
6. Ideally, we compute a "similarity score", _comparing each word to every other word_ (cross-product), and have a threshold things must score higher than to count as a rhyme.
7. We then sort by the similarity score.
8. We allow _pagination_ based on an input pronunciation text, and you can jump to specific pages in the rhyme query results.

Well, all these features together seem like an impossible ask so far: **pagination**, **complex/robust similarity scoring** (not just hacky extremely simplified SQL functions for calculating basic scores, but advanced cosineSimilarity scoring, or even more custom stuff taking into account sound-sequences in each word), **10 million words**, **up to 3 syllables of rhyming**, **fast query time**, and ideally not requiring a huge memory-intensive server.

I have been essentially cycling through 5 or 10 solutions to this problem with ClaudeAI (either backed by SQL, or just in-memory), but it can't seem to solve all those problems at once, it leaves one key problem unsolved, so everything won't work.

- First solution was in-memory, for every word, compute a robust vector similarity score based on the pronunciation/phonemes/sounds of each word, cross-product style. This seems like the ideal solution (which would give you 100% accurate results), but it won't scale, because 10m x 10m is trillions and beyond that. Especially not possible in the DB. By precomputing all similarities between every pair of words, search is easy, as there is a map from input to array of rhymes, already sorted by score. Pagination is easy too. But it won't scale.
- Next "solution" was a SQL version, with an **extremely primitive** phoneme_similarity SQL function. Then a query for all rhymes would be something like:

        const query = `
          WITH scored_rhymes AS (
            SELECT 
              w.word,
              (
                phoneme_similarity(w.last_vowel, ?) * 3 +
                phoneme_similarity(w.penultimate_vowel, ?) * 1.5 +
                CASE WHEN w.final_consonant_cluster = ? THEN 2 ELSE 0 END +
                CASE WHEN substr(w.stress_pattern, -1) = substr(?, -1) THEN 1 ELSE 0 END +
                CASE WHEN substr(w.stress_pattern, -2) = substr(?, -2) THEN 0.5 ELSE 0 END
              ) AS score
            FROM words w
            WHERE w.word != ? 
              AND w.last_vowel = ?
          )
          SELECT word, score
          FROM scored_rhymes
          WHERE score > 0
          ORDER BY score DESC, word
          LIMIT ? OFFSET ?
        `;
    While it seems to handle pagination, the scoring logic is severely lacking. This won't give quality rhyme results, we need much more advanced phonetic sequence clustering and scoring logic. But it would scale, as there is just a single words table, with some phonetic columns. It's just not going to be accurate/robust enough scoring / rhyming-wise.
- A third solution it came up with, did the advanced scoring, but _after_ it made a DB query (DB-level pagination). This will not result in quality pagination, because a page worth of words are fetched based on non-scored data, then scores are computed on that subset in-memory, and then they are sorted. This is completely inaccurate.
- Then the fourth solution, after saying how it didn't meet all the constraints/criteria, it did a SQL version, with storing the cross product of every word pair, precomputing the score! Again, we did that already in memory, and it definitely won't scale storing 10m x 10m links in the DB.

So then it is basically cycling through these answers with small variations that don't have a large effect or improvement on the solution.

_BTW using AI to help think through this has gotten me way deeper into the weeds of solving this problem and making it a reality. I can think for days and weeks about a problem like this on my own, reading a couple papers, browsing a few GitHub repos, ... but then I think in my head "oh yeah I got something that is fast, scalable, and quality". Yeah right haha. Learning through AI back and forth helps getting working data structures and algorithms, and brings new insights and pros/cons lists to my attention which I would otherwise not have figured out in a timely manner._

So my question for you now is, after a few days of working on this rhyming dictionary idea is, is there a way to solve this to get all the constraints of the system satisfied (pagination/scoring/10m-words/3-syllables/fast-query/scalable)?

An [answer to my StackOverflow question](https://stackoverflow.com/questions/79101873/how-to-build-a-trie-for-finding-exact-phonetic-matches-sorted-globally-by-weigh/79102113?noredirect=1#comment139481345_79102113)  about finding phonetic matches in detail suggested I use a [succinct indexable dictionary](https://en.wikipedia.org/wiki/Succinct_data_structure#Succinct_indexable_dictionaries) , or even the [Roaring Compressed Bitmap](https://roaringbitmap.org/)  data structure. But from my understanding so far, this requires still computing the cross product and scoring, it just might save on some memory. I don't know though if it would efficiently story trillions and quadrillions of associations though (in-memory even, on large machine).

So I'm at a loss. Is it impossible to solve my problem as described? If so, what should I cut out to make this solvable? Either what constraints/desires should I cut out, or what other things can I cut corners on?

_I tagged this as PostgreSQL because that's what I'm using for the app in general, if that helps._
                                

Lance Pollard (221 rep)

Oct 19, 2024, 06:08 AM • Last activity: Oct 19, 2024, 06:14 AM

2 votes

1 answers

442 views

Emulate Loose Index Scan for multiple columns with alternating sort direction

postgresql postgresql-performance order-by distinct pagination

A while back I asked [this question](https://dba.stackexchange.com/questions/320064/use-skip-scan-index-to-efficiently-select-unique-permutations-of-columns-in-post) about efficiently selecting unique permutations of columns in Postgres. Now I have a follow-up question regarding how to do so, with the addition of being able to order any of the columns with any combination of ASC/DESC across the columns. The table contains hundreds of millions of rows, and while the accepted answer to my previous question is orders of magnitude faster than traditional approaches, not being able to order the columns in an ad-hoc way prevents me from putting this query to good use (I really need it to 'paginate', with LIMIT/OFFSET in small chunks). Is there a way to do this? The author of the previous answer kindly suggested a workaround (changing the row comparison for an explicit where clause), which I tried, but it doesn't seem to work (or I misunderstand it). Given the following generic query:

WITH RECURSIVE cte AS (
   (
   SELECT col1, col2, col3, col4
   FROM   tbl
   ORDER  BY 1,2,3,4
   LIMIT  1
   )
   UNION ALL
   SELECT l.*
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT t.col1, t.col2, t.col3, t.col4
      FROM   tbl t
      WHERE (t.col1, t.col2, t.col3, t.col4) > (c.col1, c.col2, c.col3, c.col4)
      ORDER  BY 1,2,3,4
      LIMIT  1
      ) l
   )
SELECT * FROM cte

Is there a way to order the columns in an ad-hoc way, whilst maintaining the performance? For example: ORDER BY by col1 DESC, col2 ASC, col3 ASC, col4 DESC Assume an index on each column, as well as a combined index across all 4 columns. Postgres version is 15.4. The table is read-only in the sense that the data can't / won't be modified, however it will be added to. Following is a CREATE TABLE script to replicate my problematic table (more or less):

CREATE TABLE tbl (id SERIAL primary key, col1 integer NOT NULL, col2 integer NOT NULL, col3 integer NOT NULL, col4 integer NOT NULL);

INSERT INTO tbl (col1, col2, col3, col4) SELECT (random()*1000)::int AS col1, (random()*1000)::int AS col2, (random()*1000)::int AS col3, (random()*1000)::int AS col4 FROM generate_series(1,10000000);

CREATE INDEX ON tbl (col1);
CREATE INDEX ON tbl (col2);
CREATE INDEX ON tbl (col3);
CREATE INDEX ON tbl (col4);
CREATE INDEX ON tbl (col1, col2, col3, col4);

hunter (217 rep)

May 2, 2024, 06:52 PM • Last activity: May 4, 2024, 03:49 AM

12 votes

2 answers

15139 views

MySQL - UUID/created_at cursor based pagination?

mysql select mysql-5.7 cursors pagination

For a large dataset, paginating with an `OFFSET` is known to be slow and not the best way to paginate. A much better way to paginate is with a cursor, which is just a unique identifier on the row so we know where to continue paginating from where we last left off from the last cursor position. When...

                                  For a large dataset, paginating with an OFFSET is known to be slow and not the best way to paginate. A much better way to paginate is with a cursor, which is just a unique identifier on the row so we know where to continue paginating from where we last left off from the last cursor position.

When it comes to a cursor where it is an auto incrementing id value, it's fairly easily to implement:

    SELECT * FROM users
    WHERE id <= %cursor // cursor is the auto incrementing id, ex. 100000
    ORDER BY id DESC
    LIMIT %limit

What we're not certain about, is if instead of an auto incrementing id cursor, the only unique sequential identifiers for the cursor are uuid and created_at on the table rows.

We can certainly query based on the uuid to get the created_at, and then select all users that are <= created_at but the issue is what if there are multiple instances of the same created_at timestamp in the users table? Any idea how to query the userstable based on uuid/created_at cursor combination to ensure we get the correct datasets (just as if we were using auto incrementing id)? Again, the only unique field is uuid since created_at may be duplicate, but their combination would be unique per row.
                                

Wonka (145 rep)

Apr 30, 2018, 06:27 PM • Last activity: Feb 14, 2024, 10:00 PM

12 votes

1 answers

23084 views

Efficient pagination for big tables

postgresql index performance postgresql-10 pagination query-performance

Using **PostgreSQL 10.5**. I'm trying to create a pagination system where the user can go back and forth between various of results. In an attempt to not use `OFFSET`, I pass the `id` from the last row in the previous page in a parameter called `p` (prevId). I then select the first three rows whose...

                                  Using **PostgreSQL 10.5**. I'm trying to create a pagination system where the user can go back and forth between various of results.

In an attempt to not use OFFSET, I pass the id from the last row in the previous page in a parameter called p (prevId). I then select the first three rows whose id is higher than the number passed in the p parameter. (as described in this article )

For example, if the id for the last row in the previous page was 5, I'd select the first 3 rows with an id is higher than 5:

    SELECT 
      id, 
      firstname, 
      lastname 
    FROM 
      people 
    WHERE 
      firstname = 'John'
      AND id > 5 
    ORDER BY 
      ID ASC 
    LIMIT 
      3;

This works great and the timing isn't very bad either:

    Limit  (cost=0.00..3.37 rows=3 width=17) (actual time=0.046..0.117 rows=3 loops=1)
       ->  Seq Scan on people  (cost=0.00..4494.15 rows=4000 width=17) (actual time=0.044..0.114 rows=3 loops=1)
             Filter: ((id > 5) AND (firstname = 'John'::text))
             Rows Removed by Filter: 384
     Planning time: 0.148 ms
     Execution time: 0.147 ms

Although, if the user, on the other hand, would like to return to the previous page, things look a bit different:

First, I'd pass the id for the first row and then put minus sign in front of it to indicate that I should select the rows with an id that's less than (a positive) p parameter. Namely, if the id for the first row is 6, the p parameter would be -6. Similarly, my query would look like the following:

    SELECT 
      * 
    FROM 
      (
        SELECT 
          id, 
          firstname, 
          lastname 
        FROM 
          people 
        WHERE 
          firstname = 'John' 
          AND id   Limit  (cost=4252.73..4252.73 rows=1 width=17) (actual time=194.460..194.460 rows=0 loops=1)
             ->  Sort  (cost=4252.73..4252.73 rows=1 width=17) (actual time=194.459..194.459 rows=0 loops=1)
                   Sort Key: people.id DESC
                   Sort Method: quicksort  Memory: 25kB
                   ->  Gather  (cost=1000.00..4252.72 rows=1 width=17) (actual time=194.448..212.010 rows=0 loops=1)
                         Workers Planned: 1
                         Workers Launched: 1
                         ->  Parallel Seq Scan on people  (cost=0.00..3252.62 rows=1 width=17) (actual time=18.132..18.132 rows=0 loops=2)
                               Filter: ((id < 13) AND (firstname = 'John'::text))
                               Rows Removed by Filter: 100505
    Planning time: 0.116 ms
    Execution time: 212.057 ms

With this being said, I appreciate that you have taken the time to read this far and my question is, **how can I make the pagination more efficient?**
                                

David (123 rep)

Oct 10, 2018, 03:47 PM • Last activity: Feb 10, 2024, 12:15 AM

1 votes

1 answers

614 views

Prevent duplicated data when paginated through records that has the same value in MySQL

mysql pagination

I have a MySQL database for example, | ID | Title | Purchase_At | -------- | -------------- | ------ | | 1 | Title A | 2023-12-01 | 2 | Title B | 2023-08-22 | 3 | Title C | 2023-12-01 | 4 | Title D | 2023-08-23 | 5 | Title E | 2023-12-01 | 6 | Title F | 2023-06-22 | 7 | Title G | 2023-12-01 | 8 | Ti...

                                  I have a MySQL database for example, 

| ID | Title | Purchase_At
| -------- | -------------- | ------ |
| 1    | Title A            | 2023-12-01
| 2   | Title B            | 2023-08-22
| 3    | Title C            | 2023-12-01
| 4   | Title D            | 2023-08-23
| 5    | Title E            | 2023-12-01
| 6   | Title F            | 2023-06-22
| 7    | Title G            | 2023-12-01
| 8   | Title H            | 2023-08-02

I'm building infinite loading in both direction. 

Say my initial retrieval is SELECT * FROM table ORDER BY Purchase_At DESC LIMIT 3 and it returns ID 1, 3 and 5. 

If i want to load anything before ID 1, I do SELECT * FROM table WHERE Purchase_At  '2023-12-01' ORDER BY Purchase_At  DESC LIMIT 3?

As you can see, there are chance that i might encounter repeated data since the ordering on multiple records with same value isn't reliable. I can't do WHERE ID  5 either.

I can't use paignation query like LIMIT or OFFSET either because during user viewing, new records will be added and that'll mess up the pagination. 

Purchase_At is something user entered via a datepicker for receipt date and that's it, there is no millisecond etc... Year, Month and Day is all I got. In my case, there could be user with 10 receipts in same day. If I were to paginate it 3 per page, how can i make sure my infinite loading doesn't produce repeated data?

---

To make my question easier to understand, think as of a fb chat system. Where user scroll down to load new messages and scroll up to load old messages. How can i achieve that same by only knowing my sort can have multiple same Purchase_At value with just year, month and day.  

                                

Gummi (13 rep)

Jan 14, 2024, 04:27 AM • Last activity: Jan 15, 2024, 11:56 PM

0 votes

1 answers

171 views

Optimized count for large table using triggers, views, or external cache

postgresql optimization trigger count pagination

I have a public API method that calls a Postgres (14) database and returns a paginated list of rows belonging to a user along with a total count and page index. The count is very costly to perform (according to `pg_stat_statements`) and I wish to optimize it. Would creating a trigger that executes o...

                                  I have a public API method that calls a Postgres (14) database and returns a paginated list of rows belonging to a user along with a total count and page index. The count is very costly to perform (according to pg_stat_statements) and I wish to optimize it. 

Would creating a trigger that executes on insert/delete of the table and adjusts a count for each user be a conventional way of solving this issue? Or should I consider view or some simple external caches like Redis?

Of note: the table has very high write and read rates.

JackMahoney (101 rep)

Jan 7, 2024, 09:01 AM • Last activity: Jan 7, 2024, 01:38 PM

2 votes

1 answers

5909 views

Optimal way to get a total count of rows in a paged query in Postgres?

postgresql cte greatest-n-per-group window-functions pagination

I need to improve the performance of a paged query for customer orders in a Type2 Postgres db (always insert a new record with a new ID, the newest ID is the current version of the record). Changing away from Type2 is not an option at this time. The query I have is two queries with the same CTE in b...

WITH customer_orders AS (
    select id, order_id, customer_id,
    "name", country, state, county, source_system, 
    is_deleted, created_at, updated_at, deleted_at,
    created_by, updated_by, deleted_by, 
    rank() over (partition by order_id order by id desc) as entity_rank 
    from orders WHERE customer_id = $1 and is_deleted= $2
  )
SELECT * FROM customer_orders where entity_rank = 1 ORDER BY id DESC LIMIT $3 OFFSET $4;


WITH customer_orders AS (
    select id, order_id, customer_id,
    "name", country, state, county, source_system, 
    is_deleted, created_at, updated_at, deleted_at,
    created_by, updated_by, deleted_by, 
    rank() over (partition by order_id order by id desc) as entity_rank 
    from orders WHERE customer_id = $1 and is_deleted= $2
    )
SELECT count(id) FROM customer_orders where entity_rank = 1;

But I wonder if there's a better way to do this, can I select from the CTE twice, once for the paging (limit + offset) and once for the total number of records? I'll be running this as two separate queries from a Node process. It seems like it should be doable in one query but I can't get it. Indexes: id (PK), order_id, customer_id, is_deleted (1 on each of those)

jcollum (229 rep)

Aug 18, 2023, 11:26 PM • Last activity: Aug 29, 2023, 08:07 PM

0 votes

1 answers

887 views

Can you make Postgres execute ORDER BY after OFFSET and LIMIT?

postgresql order-by pagination

SELECT * FROM users
ORDER BY id DESC
LIMIT 3;

The result set is going to be | id | name | last_name | |--------:|----------:|--------:| | 6 | Daisy| Duck| | 5 | Goofus | Dawg | | 4 | Minerva | Mouse | It may make sense from SQL's standpoint, but it makes little sense from the human standpoint. After all, the query was meant as "give me the first page of three in reversed order". "The first page" is clearly the *first* three rows of the table, not the *last* three rows so the human way of executing that query would result in | id | name | last_name | |--------:|----------:|--------:| | 3 | Scrooge | McDuck | | 2 | Donald | Duck | | 1 | Mickey | Mouse | The original result set, on the other hand, would be a better fit for this query, "give me the *second* page of three in reversed order"

SELECT * FROM users
ORDER BY id DESC
LIMIT 3
OFFSET 3;

Can you make Postgres execute ORDER BY after OFFSET and LIMIT?

Sergey Zolotarev (243 rep)

Jun 4, 2023, 10:20 PM • Last activity: Jun 5, 2023, 02:03 AM

2 votes

1 answers

264 views

How can I paginate when ordering by `date_bin`?

postgresql order-by row pagination

I have the following query ```sql SELECT u.update_time, about_me FROM users u ORDER BY date_bin('14 days', u.update_time, '2023-04-07 23:11:56.471560Z') DESC, LENGTH(u.about_me) DESC, u.user_id; ``` I get the following: | update_time | about_me | | -------- | -------------- | | 2023-04-06 19:59:56.7...

I have the following query

SELECT u.update_time, about_me
FROM users u
ORDER BY date_bin('14 days', u.update_time, '2023-04-07 23:11:56.471560Z') DESC, LENGTH(u.about_me) DESC, u.user_id;

I get the following: | update_time | about_me | | -------- | -------------- | | 2023-04-06 19:59:56.771388 +00:00 | Hello! How are you? | | 2023-04-02 03:31:09.833925 +00:00 | Hello!!! | | 2023-04-06 00:36:26.822102 +00:00 | Hello! | | 2023-04-05 19:16:20.968274 +00:00 | Hey! | I now want to only get everything after the 3rd row. So I would do the following:

SELECT u.update_time, about_me
FROM users u
WHERE (date_bin('14 days', u.update_time, '2023-04-07 23:11:56.471560Z'), LENGTH(u.about_me)) <
      ('2023-04-07 03:05:24.990233 +00:00', 6)
ORDER BY date_bin('14 days', u.update_time, '2023-04-07 23:11:56.471560Z') DESC, LENGTH(u.about_me) DESC, u.user_id;

But the issue is that I'm still getting the exact same results, it's as if the WHERE isn't working. How can I paginate the query?

DanMossa (145 rep)

Apr 7, 2023, 06:31 AM • Last activity: Apr 11, 2023, 12:26 AM

0 votes

1 answers

657 views

Best way of SELECT on large table with filter to filter out around 1m rows with indexes

postgresql postgresql-performance explain pagination

I have a table `billing_billcycleorders` that contains a foreign key `billing_cycle_id`. There are around 0.9m records for a particular `billing_cycle_id`. I want to select the data in the chunks for ~5000 (or any best way possible). The query result is exponentially increasing for values of `billin...

I have a table billing_billcycleorders that contains a foreign key billing_cycle_id. There are around 0.9m records for a particular billing_cycle_id. I want to select the data in the chunks for ~5000 (or any best way possible). The query result is exponentially increasing for values of billing_cycle_id with the higher number of records. I have run EXPLAIN ANALYZE on the query

EXPLAIN (ANALYZE, BUFFERS, VERBOSE, format text) 
SELECT DISTINCT 
    "billing_billcycleorders"."id", 
    "billing_billcycleorders"."created", 
    "billing_billcycleorders"."updated", 
    "billing_billcycleorders"."deleted", 
    "billing_billcycleorders"."deleted_date", 
    "billing_billcycleorders"."billing_cycle_id", 
    "billing_billcycleorders"."order_id", 
    "billing_billcycleorders"."pickrr_awb", 
    "billing_billcycleorders"."amount", 
    "billing_billcycleorders"."delivery_bill", 
    "billing_billcycleorders"."rto_bill", 
    "billing_billcycleorders"."pickup_bill", 
    "billing_billcycleorders"."surcharge", 
    "billing_billcycleorders"."cod_bill", 
    "billing_billcycleorders"."qc_bill", 
    "billing_billcycleorders"."qcf_bill", 
    "billing_billcycleorders"."secure_shipment_charge", 
    "billing_billcycleorders"."cod_amount", 
    "billing_billcycleorders"."meta_details" 
FROM 
    "billing_billcycleorders" 
WHERE 
    (
        "billing_billcycleorders"."billing_cycle_id" = 685081 AND "billing_billcycleorders"."id" > 0
    ) 
ORDER BY "billing_billcycleorders"."id" ASC 
LIMIT 1000

Explain Analyze: https://explain.depesz.com/s/k9rq7

sirajalam049 (353 rep)

Dec 24, 2022, 12:36 PM • Last activity: Dec 25, 2022, 02:25 AM

5 votes

1 answers

1788 views

Inconsistent keyset pagination when using (timestamp, uuid) fields

postgresql query pagination

I am using the keyset pagination method for uuids on my Postgres database described in this post: - https://dba.stackexchange.com/questions/267794/how-to-do-pagination-with-uuid-v4-and-created-time-on-concurrent-inserted-data However, I have noticed when I have two records where the date is the same...

SELECT id, created_at FROM collection
ORDER BY created_at DESC, id DESC

I get the records back as I expect them, with created_at being the primary order, then id acting as a tiebreaker: |id|created_at| |--|----------| |e327847a-7058-49cf-bd91-f562412aedd9|2022-05-23 23:07:22.592| |d35c6bb8-06dd-4b86-b5c6-d123340520e2|2022-05-23 23:07:22.592| |5167cf95-953f-4f7b-9881-03ef07adcf3c|2022-05-23 23:07:22.592| |d14f48dc-df22-4e98-871a-a14a91e8e3c1|2022-05-23 23:07:21.592| However when I run a query to paginate through like:

SELECT id, created_at
FROM collection
WHERE (created_at, id) < ('2022-05-23 23:07:22.592','d35c6bb8-06dd-4b86-b5c6-d123340520e2')
ORDER BY created_at DESC, id DESC
LIMIT 3

I would expect to get back the last two records, but my result set is instead |id|created_at| |--|----------| |d14f48dc-df22-4e98-871a-a14a91e8e3c1|2022-05-23 23:07:21.592| I've also tried some variations on the query to try to fix it, such as:

SELECT id, created_at
FROM collection
WHERE created_at < '2022-05-23 23:07:22.592' OR
     (created_at = '2022-05-23 23:07:22.592' AND id < 'd35c6bb8-06dd-4b86-b5c6-d123340520e2')
ORDER BY created_at DESC, id DESC

But I still get back the same result set. What's going on with my query?

Daniel (51 rep)

May 24, 2022, 05:53 PM • Last activity: May 24, 2022, 07:53 PM

14 votes

1 answers

837 views

Why am I seeing key lookups for all rows read, instead of all rows matching the where clause?

sql-server nonclustered-index pagination bookmark-lookup

I have a table such as the following: ``` create table [Thing] ( [Id] int constraint [PK_Thing_Id] primary key, [Status] nvarchar(20), [Timestamp] datetime2, [Foo] nvarchar(100) ) ``` with a non-clustered, non-covering index on the `Status` and `Timestamp` fields: ``` create nonclustered index [IX_S...

I have a table such as the following:

create table [Thing]
(
	[Id] int constraint [PK_Thing_Id] primary key,
	[Status] nvarchar(20),
	[Timestamp] datetime2,
	[Foo] nvarchar(100)
)

with a non-clustered, non-covering index on the Status and Timestamp fields:

create nonclustered index [IX_Status_Timestamp] on [Thing] ([Status], [Timestamp] desc)

If I query for a 'page' of these rows, using offset/fetch as follows,

select * from [Thing]
where Status = 'Pending'
order by [Timestamp] desc
offset 2000 rows
fetch next 1000 rows only

I understand that the query will need to read a total of 3000 rows to find the 1000 that I'm interested in. I would then expect it to perform key lookups for each of those 1000 rows to fetch the fields not included in the index. However, the execution plan indicates that it is doing key lookups for all 3000 rows. I don't understand why, when the only criteria (filter by [Status] and order by [Timestamp]) are both in the index.

If I rephrase the query with a cte, as follows, I get more or less what I expected the first query to do:

with ids as
(
	select Id from [Thing]
	where Status = 'Pending'
	order by [Timestamp] desc
	offset 2000 rows
	fetch next 1000 rows only
)

select t.* from [Thing] t
join ids on ids.Id = t.Id
order by [Timestamp] desc

Some statistics from SSMS to compare the 2 queries: | | Original | With CTE | |---------------|----------|----------| | Logical reads | 12265 | 4140 | | Subtree cost | 9.79 | 3.33 | | Memory grant | 0 | 3584 KB | The CTE version seems 'better' at first glance, although I don't know how much weight to place on the fact that it incurs a memory grant for a worktable. (The messages from set statistics io on indicate that there were zero reads of any kind on the worktable) Am I wrong in saying that the first query should be able to isolate the relevant 1000 rows first (even though that requires reading past 2000 other rows first), and then only do key lookups on those 1000? It seems a bit odd to have to try and 'force' that behaviour with the CTE query. (As a minor second question: I'm assuming that the last part of the CTE approach needs to do its own order by on the results of the join, even though the CTE itself had an order by, as the ordering might be lost during the join. Is this correct?)

Twicetimes (263 rep)

Feb 7, 2022, 05:35 AM • Last activity: Feb 7, 2022, 12:28 PM

Showing page 1 of 20 total questions