Sample Header Ad - 728x90

How do I deduplicate Drupal taxonomies on a PostgreSQL database?

1 vote
0 answers
64 views
# Background I've been tasked with deduplicating a Drupal 7 install. There are about 15 thousand articles published, which have been imported directly into the database using some custom in-house migration script, rather than via the usual Drupal workflow. Every article has a certain taxonomy item associated with it. There are around 315 different possible values for this taxonomy, and it is *always* present on all articles, having been inserted using the migration script. # Issues However, it appers that some data duplication happened during the initial import. There are 15020 rows in the node table, and 11751 rows in the taxonomy_term_data table when filtered with the corresponding vid for this taxonomy (2). I ran the following SQL query to try to find repetitions SELECT name, count(name) as repetitions FROM taxonomy_term_data WHERE vid = 2 GROUP BY name ORDER BY repetitions desc, name; The result shows 335 rows, which is, coincidentally, the expected amount of unique terms. The "repetitions" column indicates nearly every term shows up dozens or even hundreds of times. # Database structure The table node contains the articles, the table taxonomy_term_data holds the taxonomies and the table taxonomy_index holds the relationship between each. ## node table nid serial NOT NULL, -- The primary identifier for a node. vid bigint, -- The current node_revision.vid version identifier. type character varying(32) NOT NULL DEFAULT ''::character varying, -- The node_type.type of this node. language character varying(12) NOT NULL DEFAULT ''::character varying, -- The languages.language of this node. title character varying(255) NOT NULL DEFAULT ''::character varying, -- The title of this node, always treated as non-markup plain text. uid integer NOT NULL DEFAULT 0, -- The users.uid that owns this node; initially, this is the user that created it. status integer NOT NULL DEFAULT 1, -- Boolean indicating whether the node is published (visible to non-administrators). created integer NOT NULL DEFAULT 0, -- The Unix timestamp when the node was created. changed integer NOT NULL DEFAULT 0, -- The Unix timestamp when the node was most recently saved. comment integer NOT NULL DEFAULT 0, -- Whether comments are allowed on this node: 0 = no, 1 = closed (read only), 2 = open (read/write). promote integer NOT NULL DEFAULT 0, -- Boolean indicating whether the node should be displayed on the front page. sticky integer NOT NULL DEFAULT 0, -- Boolean indicating whether the node should be displayed at the top of lists in which it appears. tnid bigint NOT NULL DEFAULT 0, -- The translation set id for this node, which equals the node id of the source post in each set. translate integer NOT NULL DEFAULT 0, -- A boolean indicating whether this translation page needs to be updated. CONSTRAINT node_pkey PRIMARY KEY (nid), CONSTRAINT node_vid_key UNIQUE (vid), CONSTRAINT node_nid_check CHECK (nid >= 0), CONSTRAINT node_tnid_check CHECK (tnid >= 0), CONSTRAINT node_vid_check CHECK (vid >= 0) ## taxonomy_term_data table tid serial NOT NULL, -- Primary Key: Unique term ID. vid bigint NOT NULL DEFAULT 0, -- The taxonomy_vocabulary.vid of the vocabulary to which the term is assigned. name character varying(255) NOT NULL DEFAULT ''::character varying, -- The term name. description text, -- A description of the term. format character varying(255), -- The filter_format.format of the description. weight integer NOT NULL DEFAULT 0, -- The weight of this term in relation to other terms. CONSTRAINT taxonomy_term_data_pkey PRIMARY KEY (tid), CONSTRAINT taxonomy_term_data_tid_check CHECK (tid >= 0), CONSTRAINT taxonomy_term_data_vid_check CHECK (vid >= 0) ## taxonomy_index table nid bigint NOT NULL DEFAULT 0, -- The node.nid this record tracks. tid bigint NOT NULL DEFAULT 0, -- The term ID. sticky smallint DEFAULT 0, -- Boolean indicating whether the node is sticky. created integer NOT NULL DEFAULT 0, -- The Unix timestamp when the node was created. weight integer NOT NULL DEFAULT 0, -- A user-defined weight for each node in its respective category. CONSTRAINT taxonomy_index_nid_check CHECK (nid >= 0), CONSTRAINT taxonomy_index_tid_check CHECK (tid >= 0) # What I need - A way to determine if any of the repeated taxonomy_term_data is actually referenced in taxonomy_index. - A way to, if necessary, set all occurrences in taxonomy_index to point to just one of each repeated taxonomy_term_data - Finally, a way to delete all the taxonomy_term_data entries not in use in taxonomy_index. I suppose one or more well-written queries would do the trick, but my SQL knowledge is terribly low.
Asked by That Brazilian Guy (111 rep)
Sep 30, 2016, 09:00 PM
Last activity: Oct 1, 2016, 12:54 PM