How to check consistency of a generated web site using recursive HTML parsing
2
votes
1
answer
81
views
I have a FOSS project whose web site is generated by
asciidoc
and some custom scripts as an horde (thousands) of static files locally in the source files' repo, copied into another workspace and uploaded to github.io style repository, and eventually is rendered as an HTTP server for browsers around the world to see.
Users occasionally report that some of the links between site pages end up broken (lead nowhere).
The website build platform is generally POSIX-ish, although most often the agent doing the regular work is a Debian/Linux one. *Maybe* the platform differences cause the "page outages"; maybe this bug is platform-independent.
I had a thought about crafting a check for the two local directories as well as the resulting site to crawl all relative links (and/or absolute ones starting with its domain name(s)), and report any broken pages so I could focus on finding why they fail and/or avoiding publication of "bad" iterations - same as with compilers, debuggers and warnings elsewhere.
The general train of thought is about using some wget
spider mode, though any other command-line tool (curl
, lynx
...), python script, shell with sed
, etc. would do as well. Surely this particular wheel has been invented too many times for me to even think about making my own? A quick and cursory googling session while on commute did not come up with any good fit however.
So, suggestions are welcome :)
Asked by Jim Klimov
(131 rep)
May 7, 2024, 07:04 AM
Last activity: Jan 12, 2025, 09:01 AM
Last activity: Jan 12, 2025, 09:01 AM