Sample Header Ad - 728x90

how to delete a div with a specific class from XHTML using xstarlet?

5 votes
2 answers
580 views
I have several hundred .xhtml files in a sub-directory(*) and I want to delete all DIVs with a specific class (and the entire contents of those DIVs - including other divs, spans, image and paragraph elements) from them. The DIV may appear zero, one, or more times at any arbitrary depth within each .xhtml file. The specific DIVs I want to delete are:
.....
Using the xml_grep utility from the perl [XML::Twig](https://metacpan.org/release/XML-Twig) module, I can run xml_grep -v 'div[@class="portlet solid author-note-portlet"]' file*.xhtml and it will remove all instances of that div from the .xhtml files and display the result on stdout. Exactly what I want, except for "display on stdout". If xml_grep had some kind of in-place edit option, that would be fine, I'd just use that....but it doesn't, so I'd have to write a wrapper script that used a temporary file or sponge and run xml_grep against each .xhtml file individually, which would be slow and tedious. Or I could hack a copy of xml_grep so that it could edit its input file(s). But I don't want to do either of these things, I want to use the existing tool which can already do this, I want to use xmlstarlet - it'll be faster, has in-place edit, and I won't have to run it once per filename. The trouble is that no matter what I try (and I have tried dozens of variations), I cannot figure out the correct xpath specification to delete a div with this class. e.g. I have tried: xmlstarlet ed -d "div[@class='portlet solid author-note-portlet']" file.xhtml and (with different quoting) xmlstarlet ed -d 'div[@class="portlet solid author-note-portlet"]' file.xhtml and xmlstarlet ed -d '//html/body/div/div/div[@class="portlet solid author-note-portlet"]' and dozens of other variations. None of them have resulted in any change to the xhtml output. This is the point at which I usually give up on xmlstarlet and write a perl script, but this time I'm determined to do it with xmlstarlet. So, what's the correct way to specify this div class for xmlstarlet? BTW, for one example .xhtml file (with two instances of this div, which happen to be at the same depth...which is fairly typical but not universal), xmlstarlet el -v says:
$ xmlstarlet el -v OEBPS/file0007.xhtml | grep author-note-portlet
html/body/div/div[@class='portlet solid author-note-portlet']
html/body/div/div[@class='portlet solid author-note-portlet']
--- (*) Not that it matters, but these .xhtml files are inside a .epub file(**) generated by the [FanFicFare](https://github.com/JimmXinu/FanFicFare) plugin for [Calibre](https://calibre-ebook.com) - which downloads all chapters from books on various fiction web sites and turns them into an epub file (which is basically a zip archive containing XHTML and CSS files and maybe jpeg or gif files, along with a bunch of metadata files).
is used by one site (Royal Road) for authors to include a note with a chapter. Some authors use it sparingly, and insert short notes about either the chapter or the book or brief announcements about random stuff, with maybe a link to their patreon page...fine, no big deal. Others use it to add a half page note with links to 10 of their other books at the start of **each** chapter and again to add three and half pages of links (with cover images) to those books at the end of each chapter. Which is kind of OK-ish if you're reading it in serial form chapter-by-chapter on the web site, but not if you're reading it as a book - ~4 pages of self-promotion for every 6-10 or so pages of story is excessive and distracting. And, BTW, that's 4 "pages" on my 10 inch android tablet - it's more than double that on my phone. I can easily add display: none to the epub's style sheet for this class, but I want to actually delete the divs from the .xhtml files. They noticeably inflate the .epub file size. (**) extracting the contents of the .epub with unzip and rebuilding it afterwards are way outside of the scope of this question, so please don't get distracted by irrelevant details. Already handled. --- Sample .xhtml file, edited down to the bare minimum (and story/chapter/author name anonymised to protect the "guilty :-):
xml





Chapter Five - Chapter Name







Chapter Five - Chapter Name

A note from Author Name

About a dozen or so p, span, img, and br tags here

story text here. a few hundreds p, br, etc tags

A note from Author Name

several dozen more p, span, br, img, etc tags here

Asked by cas (81957 rep)
Sep 15, 2022, 12:55 AM
Last activity: Sep 15, 2022, 01:39 PM