Sample Header Ad - 728x90

Optimizing regex based query in sqlite

0 votes
1 answer
496 views
I'm using an sqlite database to store manually created labels for some data automatically queried from a live system. The data from the live system consists primarily of an address, comprised of 3 parts. Let's use URLs an example, the three parts being the protocol, the domain and the path. Initially, I would load a couple 100k worth of addresses into a table with each field of the address being a column and together building the primary key. The labels are then in additional columns.
CREATE TABLE OldWebsites (
    protocol VARCHAR (255) NOT NULL,
    domain   VARCHAR (255) NOT NULL,
    path     VARCHAR (255) NOT NULL,
    label1   INTEGER,
    label2   TEXT,
    CONSTRAINT address PRIMARY KEY (
        protocol,
        domain,
        path
    )
);
I found myself repeating labels over and over based on certain patterns that the address would match. Since I would always extend this table with new data and remove old data, this became too much of a hazzle, so I tried a different approach, namely just loading the existing addresses into one table and then have other tables for the data where I would write regex matchers for the address components
CREATE TABLE Websites (
    protocol VARCHAR (255) NOT NULL,
    domain   VARCHAR (255) NOT NULL,
    path     VARCHAR (255) NOT NULL,
    CONSTRAINT address PRIMARY KEY (
        protocol,
        domain,
        path
    )
);

CREATE TABLE Label1 (
    protocol_re VARCHAR (255) NOT NULL,
    domain_re   VARCHAR (255) NOT NULL,
    path_re     VARCHAR (255) NOT NULL,
    label1   INTEGER
    CONSTRAINT address_matcher PRIMARY KEY (
        protocol_re,
        domain_re,
        path_re
    )
);

CREATE TABLE Label2 (
    protocol_re VARCHAR (255) NOT NULL,
    domain_re   VARCHAR (255) NOT NULL,
    path_re     VARCHAR (255) NOT NULL,
    label2   TEXT,
    CONSTRAINT address_matcher PRIMARY KEY (
        protocol_re,
        domain_re,
        path_re
    )
);
Assume that I have already (using other queries) guaranteed, that there is exactly one match in each label table for each address in the Websites table. I would now like to write a query that reconstructs a table like the original OldWebsites one by matching labels and automatically queried data. Something like this
SELECT Websites.*,
       Label1.label1,
       Label2.label2
  FROM Websites
       JOIN
       Label1 ON (Websites.protocol REGEXP '^' || Label1.protocol_re || '$' AND 
                  Websites.domain REGEXP '^' || Label1.domain_re || '$' AND 
                  Websites.path REGEXP '^' || Label1.path_re || '$') 
       JOIN
       Label2 ON (Websites.protocol REGEXP '^' || Label2.protocol_re || '$' AND 
                  Websites.domain REGEXP '^' || Label2.domain_re || '$' AND 
                  Websites.path REGEXP '^' || Label2.path_re || '$');
Now.. this is really slow, especially for more label tables, using PCRE sqlite3 extension for the REGEXP function. I would like to know if there's way to optimize this query using either parallelization (the query should run ideally from python) or using the knowledge that there is exactly 1 match in each Label table. From my understanding, multiple inner joins should take at most the sum of the individual joins, correct? Perhaps indexes are also helpful, but I have only a basic idea of what they are and no idea whether they would be of help here.
Asked by Xaser (115 rep)
Feb 4, 2021, 03:48 PM
Last activity: Feb 21, 2024, 04:01 PM