Filtering out duplicate domains from URL column using Postgres full-text search parsers

1 vote

1 answer

867 views

                          I have a PostgreSQL database containing pages and links downloaded by a web crawler, with the following tables:

    pages
    ----------
    id:          Integer (primary key)
    url:         String  (unique)
    title:       String
    text:        String
    html:        String
    last_visit:  DateTime
    word_pos:    TSVECTOR
    
    
    links
    ----------
    id         Integer (primary key)
    source:    String
    target:    String  
    link_text: String
    UNIQUE(source,target)
    
    
    crawls
    ---------
    id:         Integer (primary key)
    query:      String
    
    
    crawl_results
    -------------
    id:       Integer (primary key)
    score:    Integer (constraint 0<=score<=1)
    crawl_id: Integer (foreign key, crawls.id)
    page_id:  Integer (foreign key, pages.id)

The source and target fields in the links table contain URLs. I am running the following query to extract scored links from the top-ranking search results, for pages that haven't been fetched yet:

    WITH top_results AS 
        (SELECT page_id, score FROM crawl_results 
         WHERE crawl_id=$1 
         ORDER BY score LIMIT 100)
    SELECT top_results.score, l.target
    FROM top_results 
        JOIN pages p ON top_results.page_id=p.id
        JOIN links l on p.url=l.source 
    WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE l.target=pp.url)

However, ***I would like to filter these results so that only one row is returned for a given domain (the one with the lowest score)***. So for instance, if I get (0.3, 'http://www.foo.com/bar ') and (0.8, 'http://www.foo.com/zor '), I only want the first because it has same domain foo.com and has the lower score. 

I was able to find documentation  for the builtin full text search parsers, which can parse URLS and extract the hostname. For instance, I can extract the hostname from a URL as follows:

    SELECT token FROM ts_parse('default', 'http://www.foo.com ') WHERE tokid = 6;
    
        token    
    -------------
     www.foo.com
    (1 row)


However, I can't figure out how I would integrate this into the above query to filter out duplicate domains from the results. And because this is the docs for "testing and debugging text search", I don't know if this use of ts_parse() is even related to how the URL parser is intended to be used in practice.

***How would I use the host parser  in my query above to return one row per domain? Also, how would I appropriately index the links table for host and url lookup?***

                        

Asked by J. Taylor (379 rep)

Apr 6, 2019, 08:34 AM
Last activity: Jun 27, 2022, 10:00 AM

Filtering out duplicate domains from URL column using Postgres full-text search parsers

Related Questions