MariaDB FULLTEXT search with short / mandatory words
1
vote
1
answer
1553
views
Just checking I'm understanding this correctly:
CREATE TABLE customer (
id INT NOT NULL AUTO_INCREMENT,
name TINYTEXT NOT NULL,
PRIMARY KEY (id),
FULLTEXT (name)
) ENGINE = InnoDB;
INSERT INTO customer VALUES (1, "ABC.DEF");
INSERT INTO customer VALUES (2, "ABC_DEF");
INSERT INTO customer VALUES (3, "ABC'DEF");
INSERT INTO customer VALUES (4, "ABC.DE");
INSERT INTO customer VALUES (5, "ABC.DFF");
Where I've got
innodb_ft_min_token_size
set to 3 (default is 4).
---
When running:
SELECT
c.*,
MATCH (name) AGAINST ("+ABC +DEF" IN BOOLEAN MODE) AS m
FROM
customer AS c
Customers 1 and 3 match, because the .
and '
are seen as word separators ([annoying for O'Brien](https://dba.stackexchange.com/q/250524)) .
For customer 2, because the underscore gets the whole name treated as a single word, the "DEF" word cannot be found.
---
If I change the MATCH to "+ABC +DE"
.
1, 2, or 3 do not match because this is using a full word match ("+DE"
does not match "DEF"
).
4 does not match because... innodb_ft_min_token_size
is set to 3?
As in, the 2 letter "DE" word is not in the FULLTEXT INDEX?
---
If I change the MATCH to use asterisks (e.g. "+ABC* +DE*"
), that will use *prefix* matching.
But will only add customers 1 and 3 to the selection.
Because the 2 letter "DE" word for customer 4 is not in the FULLTEXT INDEX?
---
If I change the MATCH to use "+ABC.DE*"
, it matches all of them.
Note how they all get the same rank (even customer 5), and this is no different to "+ABC*"
, where MATCH seems to be seeing the "DE*" as a separate word, and not matching it against anything.
Whereas "+ABC DE*"
is explicitly keeping it as a separate word, and the scores are handled appropriately.
---
While the individual points make sense, I'm not sure this creates a good system.
For a bit more consistency, I'm wondering if the database should ignore short words (tokens) in the MATCH query, in the same way it does when building the FULLTEXT INDEX.
Only because I don't think "+DE"
will ever do anything useful when the min token size is 3; and it's not exactly easy for the developer to identify what the individual words in the FULLTEXT INDEX will be (i.e. to remove them).
Asked by Craig Francis
(135 rep)
Feb 5, 2021, 06:25 PM
Last activity: Jun 25, 2025, 11:01 AM
Last activity: Jun 25, 2025, 11:01 AM