How do I deduplicate Drupal taxonomies on a PostgreSQL database?
1
vote
0
answers
64
views
# Background
I've been tasked with deduplicating a Drupal 7 install. There are about 15 thousand articles published, which have been imported directly into the database using some custom in-house migration script, rather than via the usual Drupal workflow.
Every article has a certain taxonomy item associated with it. There are around 315 different possible values for this taxonomy, and it is *always* present on all articles, having been inserted using the migration script.
# Issues
However, it appers that some data duplication happened during the initial import. There are 15020 rows in the
node
table, and 11751 rows in the taxonomy_term_data
table when filtered with the corresponding vid
for this taxonomy (2).
I ran the following SQL query to try to find repetitions
SELECT name, count(name) as repetitions
FROM taxonomy_term_data
WHERE vid = 2
GROUP BY name
ORDER BY repetitions desc, name;
The result shows 335 rows, which is, coincidentally, the expected amount of unique terms. The "repetitions" column indicates nearly every term shows up dozens or even hundreds of times.
# Database structure
The table node
contains the articles, the table taxonomy_term_data
holds the taxonomies and the table taxonomy_index
holds the relationship between each.
## node
table
nid serial NOT NULL, -- The primary identifier for a node.
vid bigint, -- The current node_revision.vid version identifier.
type character varying(32) NOT NULL DEFAULT ''::character varying, -- The node_type.type of this node.
language character varying(12) NOT NULL DEFAULT ''::character varying, -- The languages.language of this node.
title character varying(255) NOT NULL DEFAULT ''::character varying, -- The title of this node, always treated as non-markup plain text.
uid integer NOT NULL DEFAULT 0, -- The users.uid that owns this node; initially, this is the user that created it.
status integer NOT NULL DEFAULT 1, -- Boolean indicating whether the node is published (visible to non-administrators).
created integer NOT NULL DEFAULT 0, -- The Unix timestamp when the node was created.
changed integer NOT NULL DEFAULT 0, -- The Unix timestamp when the node was most recently saved.
comment integer NOT NULL DEFAULT 0, -- Whether comments are allowed on this node: 0 = no, 1 = closed (read only), 2 = open (read/write).
promote integer NOT NULL DEFAULT 0, -- Boolean indicating whether the node should be displayed on the front page.
sticky integer NOT NULL DEFAULT 0, -- Boolean indicating whether the node should be displayed at the top of lists in which it appears.
tnid bigint NOT NULL DEFAULT 0, -- The translation set id for this node, which equals the node id of the source post in each set.
translate integer NOT NULL DEFAULT 0, -- A boolean indicating whether this translation page needs to be updated.
CONSTRAINT node_pkey PRIMARY KEY (nid),
CONSTRAINT node_vid_key UNIQUE (vid),
CONSTRAINT node_nid_check CHECK (nid >= 0),
CONSTRAINT node_tnid_check CHECK (tnid >= 0),
CONSTRAINT node_vid_check CHECK (vid >= 0)
## taxonomy_term_data
table
tid serial NOT NULL, -- Primary Key: Unique term ID.
vid bigint NOT NULL DEFAULT 0, -- The taxonomy_vocabulary.vid of the vocabulary to which the term is assigned.
name character varying(255) NOT NULL DEFAULT ''::character varying, -- The term name.
description text, -- A description of the term.
format character varying(255), -- The filter_format.format of the description.
weight integer NOT NULL DEFAULT 0, -- The weight of this term in relation to other terms.
CONSTRAINT taxonomy_term_data_pkey PRIMARY KEY (tid),
CONSTRAINT taxonomy_term_data_tid_check CHECK (tid >= 0),
CONSTRAINT taxonomy_term_data_vid_check CHECK (vid >= 0)
## taxonomy_index
table
nid bigint NOT NULL DEFAULT 0, -- The node.nid this record tracks.
tid bigint NOT NULL DEFAULT 0, -- The term ID.
sticky smallint DEFAULT 0, -- Boolean indicating whether the node is sticky.
created integer NOT NULL DEFAULT 0, -- The Unix timestamp when the node was created.
weight integer NOT NULL DEFAULT 0, -- A user-defined weight for each node in its respective category.
CONSTRAINT taxonomy_index_nid_check CHECK (nid >= 0),
CONSTRAINT taxonomy_index_tid_check CHECK (tid >= 0)
# What I need
- A way to determine if any of the repeated taxonomy_term_data
is actually referenced in taxonomy_index
.
- A way to, if necessary, set all occurrences in taxonomy_index
to point to just one of each repeated taxonomy_term_data
- Finally, a way to delete all the taxonomy_term_data
entries not in use in taxonomy_index
.
I suppose one or more well-written queries would do the trick, but my SQL knowledge is terribly low.
Asked by That Brazilian Guy
(111 rep)
Sep 30, 2016, 09:00 PM
Last activity: Oct 1, 2016, 12:54 PM
Last activity: Oct 1, 2016, 12:54 PM