Is it even possible to create a scalable rhyming dictionary for 10 million words in a single language like English?

1 vote
0 answers
39 views
postgresql scalability pagination in-memory-database
                          I'm going in circles brainstorming ideas and TypeScript or SQL code to implement basically a "rhyming database". The goal of the rhyming database is to find rhymes for all words, not just exact rhymes but "nearby or close rhymes" too (like Rap music, etc.). Here are some of the facts and features:

1. Estimate 10 million English words for now (but realistically I'm thinking about doing this for ~30 languages).
2. I think rhymes would be like a reverse exponential curve (let's just imagine), so many short words rhyme long words, but it tapers down as words get longer.
3. We only will support up to 3 syllables of word-end rhyming.
4. Don't worry about the system for capturing the phonetic information of words, we can use something like the [CMU pronunciation format/dictionary](https://en.wikipedia.org/wiki/CMU_Pronouncing_Dictionary#Database_format) . I have a system for computing phonetic information (too involved to describe for this post).
5. In a not-worse-but-bad-case, let's say there are 1000 rhymes for every word, that is 10m x 1k = 10 billion relationships between words. At 10,000 rhymes, that is 100 billion, so the database might start running into scaling problems.
6. Ideally, we compute a "similarity score", _comparing each word to every other word_ (cross-product), and have a threshold things must score higher than to count as a rhyme.
7. We then sort by the similarity score.
8. We allow _pagination_ based on an input pronunciation text, and you can jump to specific pages in the rhyme query results.

Well, all these features together seem like an impossible ask so far: **pagination**, **complex/robust similarity scoring** (not just hacky extremely simplified SQL functions for calculating basic scores, but advanced cosineSimilarity scoring, or even more custom stuff taking into account sound-sequences in each word), **10 million words**, **up to 3 syllables of rhyming**, **fast query time**, and ideally not requiring a huge memory-intensive server.

I have been essentially cycling through 5 or 10 solutions to this problem with ClaudeAI (either backed by SQL, or just in-memory), but it can't seem to solve all those problems at once, it leaves one key problem unsolved, so everything won't work.

- First solution was in-memory, for every word, compute a robust vector similarity score based on the pronunciation/phonemes/sounds of each word, cross-product style. This seems like the ideal solution (which would give you 100% accurate results), but it won't scale, because 10m x 10m is trillions and beyond that. Especially not possible in the DB. By precomputing all similarities between every pair of words, search is easy, as there is a map from input to array of rhymes, already sorted by score. Pagination is easy too. But it won't scale.
- Next "solution" was a SQL version, with an **extremely primitive** phoneme_similarity SQL function. Then a query for all rhymes would be something like:

        const query = `
          WITH scored_rhymes AS (
            SELECT 
              w.word,
              (
                phoneme_similarity(w.last_vowel, ?) * 3 +
                phoneme_similarity(w.penultimate_vowel, ?) * 1.5 +
                CASE WHEN w.final_consonant_cluster = ? THEN 2 ELSE 0 END +
                CASE WHEN substr(w.stress_pattern, -1) = substr(?, -1) THEN 1 ELSE 0 END +
                CASE WHEN substr(w.stress_pattern, -2) = substr(?, -2) THEN 0.5 ELSE 0 END
              ) AS score
            FROM words w
            WHERE w.word != ? 
              AND w.last_vowel = ?
          )
          SELECT word, score
          FROM scored_rhymes
          WHERE score > 0
          ORDER BY score DESC, word
          LIMIT ? OFFSET ?
        `;
    While it seems to handle pagination, the scoring logic is severely lacking. This won't give quality rhyme results, we need much more advanced phonetic sequence clustering and scoring logic. But it would scale, as there is just a single words table, with some phonetic columns. It's just not going to be accurate/robust enough scoring / rhyming-wise.
- A third solution it came up with, did the advanced scoring, but _after_ it made a DB query (DB-level pagination). This will not result in quality pagination, because a page worth of words are fetched based on non-scored data, then scores are computed on that subset in-memory, and then they are sorted. This is completely inaccurate.
- Then the fourth solution, after saying how it didn't meet all the constraints/criteria, it did a SQL version, with storing the cross product of every word pair, precomputing the score! Again, we did that already in memory, and it definitely won't scale storing 10m x 10m links in the DB.

So then it is basically cycling through these answers with small variations that don't have a large effect or improvement on the solution.

_BTW using AI to help think through this has gotten me way deeper into the weeds of solving this problem and making it a reality. I can think for days and weeks about a problem like this on my own, reading a couple papers, browsing a few GitHub repos, ... but then I think in my head "oh yeah I got something that is fast, scalable, and quality". Yeah right haha. Learning through AI back and forth helps getting working data structures and algorithms, and brings new insights and pros/cons lists to my attention which I would otherwise not have figured out in a timely manner._

So my question for you now is, after a few days of working on this rhyming dictionary idea is, is there a way to solve this to get all the constraints of the system satisfied (pagination/scoring/10m-words/3-syllables/fast-query/scalable)?

An [answer to my StackOverflow question](https://stackoverflow.com/questions/79101873/how-to-build-a-trie-for-finding-exact-phonetic-matches-sorted-globally-by-weigh/79102113?noredirect=1#comment139481345_79102113)  about finding phonetic matches in detail suggested I use a [succinct indexable dictionary](https://en.wikipedia.org/wiki/Succinct_data_structure#Succinct_indexable_dictionaries) , or even the [Roaring Compressed Bitmap](https://roaringbitmap.org/)  data structure. But from my understanding so far, this requires still computing the cross product and scoring, it just might save on some memory. I don't know though if it would efficiently story trillions and quadrillions of associations though (in-memory even, on large machine).

So I'm at a loss. Is it impossible to solve my problem as described? If so, what should I cut out to make this solvable? Either what constraints/desires should I cut out, or what other things can I cut corners on?

_I tagged this as PostgreSQL because that's what I'm using for the app in general, if that helps._
                        
Asked by Lance Pollard (221 rep)
Oct 19, 2024, 06:08 AM
Last activity: Oct 19, 2024, 06:14 AM
Is it even possible to create a scalable rhyming dictionary for 10 million words in a single language like English?

Related Questions