Filtering out duplicate domains from URL column using Postgres full-text search parsers
1
vote
1
answer
867
views
I have a PostgreSQL database containing
pages
and links
downloaded by a web crawler, with the following tables:
pages
----------
id: Integer (primary key)
url: String (unique)
title: String
text: String
html: String
last_visit: DateTime
word_pos: TSVECTOR
links
----------
id Integer (primary key)
source: String
target: String
link_text: String
UNIQUE(source,target)
crawls
---------
id: Integer (primary key)
query: String
crawl_results
-------------
id: Integer (primary key)
score: Integer (constraint 0<=score<=1)
crawl_id: Integer (foreign key, crawls.id)
page_id: Integer (foreign key, pages.id)
The source
and target
fields in the links
table contain URLs. I am running the following query to extract scored links from the top-ranking search results, for pages that haven't been fetched yet:
WITH top_results AS
(SELECT page_id, score FROM crawl_results
WHERE crawl_id=$1
ORDER BY score LIMIT 100)
SELECT top_results.score, l.target
FROM top_results
JOIN pages p ON top_results.page_id=p.id
JOIN links l on p.url=l.source
WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE l.target=pp.url)
However, ***I would like to filter these results so that only one row is returned for a given domain (the one with the lowest score)***. So for instance, if I get (0.3, 'http://www.foo.com/bar ')
and (0.8, 'http://www.foo.com/zor ')
, I only want the first because it has same domain foo.com
and has the lower score.
I was able to find documentation for the builtin full text search parsers, which can parse URLS and extract the hostname. For instance, I can extract the hostname from a URL as follows:
SELECT token FROM ts_parse('default', 'http://www.foo.com ') WHERE tokid = 6;
token
-------------
www.foo.com
(1 row)
However, I can't figure out how I would integrate this into the above query to filter out duplicate domains from the results. And because this is the docs for "testing and debugging text search", I don't know if this use of ts_parse()
is even related to how the URL parser is intended to be used in practice.
***How would I use the host
parser in my query above to return one row per domain? Also, how would I appropriately index the links
table for host
and url
lookup?***
Asked by J. Taylor
(379 rep)
Apr 6, 2019, 08:34 AM
Last activity: Jun 27, 2022, 10:00 AM
Last activity: Jun 27, 2022, 10:00 AM