how to delete a div with a specific class from XHTML using xstarlet?
5
votes
2
answers
580
views
I have several hundred .xhtml files in a sub-directory(*) and I want to delete all DIVs with a specific class (and the entire contents of those DIVs - including other divs, spans, image and paragraph elements) from them. The DIV may appear zero, one, or more times at any arbitrary depth within each .xhtml file.
The specific DIVs I want to delete are:
Using the xml_grep
utility from the perl [XML::Twig](https://metacpan.org/release/XML-Twig) module, I can run xml_grep -v 'div[@class="portlet solid author-note-portlet"]' file*.xhtml
and it will remove all instances of that div from the .xhtml files and display the result on stdout. Exactly what I want, except for "display on stdout".
If xml_grep
had some kind of in-place edit option, that would be fine, I'd just use that....but it doesn't, so I'd have to write a wrapper script that used a temporary file or sponge
and run xml_grep against each .xhtml file individually, which would be slow and tedious. Or I could hack a copy of xml_grep so that it could edit its input file(s).
But I don't want to do either of these things, I want to use the existing tool which can already do this, I want to use xmlstarlet
- it'll be faster, has in-place edit, and I won't have to run it once per filename.
The trouble is that no matter what I try (and I have tried dozens of variations), I cannot figure out the correct xpath specification to delete a div with this class. e.g. I have tried:
xmlstarlet ed -d "div[@class='portlet solid author-note-portlet']" file.xhtml
and (with different quoting)
xmlstarlet ed -d 'div[@class="portlet solid author-note-portlet"]' file.xhtml
and
xmlstarlet ed -d '//html/body/div/div/div[@class="portlet solid author-note-portlet"]'
and dozens of other variations. None of them have resulted in any change to the xhtml output. This is the point at which I usually give up on xmlstarlet and write a perl script, but this time I'm determined to do it with xmlstarlet.
So, what's the correct way to specify this div class for xmlstarlet?
BTW, for one example .xhtml file (with two instances of this div, which happen to be at the same depth...which is fairly typical but not universal), xmlstarlet el -v
says:
$ xmlstarlet el -v OEBPS/file0007.xhtml | grep author-note-portlet
html/body/div/div[@class='portlet solid author-note-portlet']
html/body/div/div[@class='portlet solid author-note-portlet']
---
(*) Not that it matters, but these .xhtml files are inside a .epub file(**) generated by the [FanFicFare](https://github.com/JimmXinu/FanFicFare) plugin for [Calibre](https://calibre-ebook.com) - which downloads all chapters from books on various fiction web sites and turns them into an epub file (which is basically a zip archive containing XHTML and CSS files and maybe jpeg or gif files, along with a bunch of metadata files).
Asked by cas
(81957 rep)
Sep 15, 2022, 12:55 AM
Last activity: Sep 15, 2022, 01:39 PM
Last activity: Sep 15, 2022, 01:39 PM