Unix & Linux Stack Exchange
Q&A for users of Linux, FreeBSD and other Unix-like operating systems
Latest Questions
0
votes
0
answers
20
views
Detecting Messy Markup in EPUB Files (e.g. PDF Conversions)
I'm trying to identify EPUB files with messy or problematic markup—typically the kind that results from automatic PDF-to-EPUB conversions. These often lead to unreadable or structurally broken files that aren't suitable for further processing or reading. I've written scripts to analyze EPUB contents...
I'm trying to identify EPUB files with messy or problematic markup—typically the kind that results from automatic PDF-to-EPUB conversions. These often lead to unreadable or structurally broken files that aren't suitable for further processing or reading.
I've written scripts to analyze EPUB contents by counting HTML tags per word or checking for tags that interrupt sentence flow. But these metrics don’t correlate reliably with actual messiness. Some clean files rank at the top of the list.
I need a way to automatically score or rank EPUBs by how likely they are to contain poor or overly complex markup, so I can prioritize which files to manually review or discard.
A chatbot suggested analyzing how styles are applied, especially if styles mimic visual layout from PDFs. That seems like a good angle, but I don’t have enough knowledge of EPUB styling conventions or CSS patterns to define robust detection criteria.
I could use suggestions for:
- Better heuristics or metrics to detect messy EPUB structure
Any guidance is appreciated. Manually checking hundreds of files is not an option, and I’d like to automate this as much as possible.
JohnBig
(121 rep)
Apr 24, 2025, 12:30 PM
1
votes
1
answers
591
views
Getting more metadata about epub documents
Without giving more details, the only commands I know are file, stat and mediainfo. While both give some idea about an .epub document, not all. For e.g. file just gives the filename and declares it to be an epub document. Mediainfo is slightly better that it gives the following info. Format : ZIP Fi...
Without giving more details, the only commands I know are file, stat and mediainfo. While both give some idea about an .epub document, not all. For e.g. file just gives the filename and declares it to be an epub document. Mediainfo is slightly better that it gives the following info.
Format : ZIP
File size : 93.9 MiB
FileExtension_Invalid : zip docx odt xlsx ods
So apart from the name of the document, I know the above. The most crucial bits though are missing. When was the epub book published, what version of epub version was used, what app. was used to make the .epub document and so on and so forth. There are so many versions from version 2, 2.0.1, 3, 3.2 and so on. Having all the above info. would make things easier to troubleshoot.
Something on the lines of pdfinfo.
shirish
(12954 rep)
May 4, 2023, 10:11 PM
• Last activity: May 5, 2023, 07:16 AM
0
votes
1
answers
202
views
Find list of words and replace with one word
I have some `.epub` files of books I want to edit swear words out of for my younger kids to read. I've read `sed` is the right tool for the job (I am open to different solutions as well), but am new to it. example original text ``` ant bat cat dog eagle fish ``` modified text (post-sed) ``` ant XXX...
I have some
.epub
files of books I want to edit swear words out of for my younger kids to read. I've read sed
is the right tool for the job (I am open to different solutions as well), but am new to it.
example original text
ant bat cat
dog eagle fish
modified text (post-sed)
ant XXX cat
XXX eagle XXX
I am on a Mac, and have got this to work:
LC_ALL=C sed -E 's/bat|dog|fish/XXX/ig' temp1.txt > temp2.txt
ant XXX cat
XXX eagle XXX
But I can't get this to work with the .epub file format
LC_ALL=C sed -E 's/bat|dog|fish/XXX/ig' file1.epub > file2.epub
Here's a link to an example .epub
file.
timothy.s.lau
(103 rep)
Nov 22, 2022, 06:22 PM
• Last activity: Nov 28, 2022, 06:10 PM
0
votes
1
answers
124
views
extract TOC from epub
I'm trying to learn script and found this post [Extract TOC from epub file][1], which give me part of solution that I need, but when I tested it, got an error `error: Extra content at the end of the document`. A little bit of background: I have 2 epub files: `1.epub` and `2.epub`. I tested each one...
I'm trying to learn script and found this post Extract TOC from epub file , which give me part of solution that I need, but when I tested it, got an error
error: Extra content at the end of the document
.
A little bit of background: I have 2 epub files: 1.epub
and 2.epub
. I tested each one separately, it worked fine (got the TOC from each epub), but when I tried to test both files using do
, got the above error.
I'm learning scripts, not sure if I made a mistake somewhere. Anyone can point what's my mistake?
ps: my script
#! /usr/bin/bash
EPUB_LIST="1.epub 2.epub"
for f in "$EPUB_LIST"
do
echo "$f:"
unzip -p "$f" OEBPS/toc.ncx |
xml2 |
sed -n -e 's:^/ncx/navMap/navPoint/navLabel/text=: :p'
echo
done
michaelbr
(111 rep)
Aug 27, 2022, 12:19 PM
• Last activity: Aug 27, 2022, 01:13 PM
0
votes
0
answers
160
views
command/app to compare epub files
Is there any easy way to compare the first line/text of a epub file using command line or an app in Linux? Details: I have 15 different epub files, each with over 10 "chapters", I'd like to compare the first few words of each chapter to see if there's a duplicate chapter. It seems most of compare co...
Is there any easy way to compare the first line/text of a epub file using command line or an app in Linux?
Details: I have 15 different epub files, each with over 10 "chapters", I'd like to compare the first few words of each chapter to see if there's a duplicate chapter. It seems most of compare commands in Linux are for text comparison, unfortunately I'm not expert in Linux.
Currently I'm using
Sigil
to do the comparison, which is time consuming, I'd prefer to use command line or an app to do it, if possible.
michaelbr
(111 rep)
Aug 26, 2022, 12:31 PM
• Last activity: Aug 26, 2022, 12:34 PM
16
votes
3
answers
8085
views
Downloading a .epub from a .acsm
I want to transfer books I buy on Google Play (downloads a .acsm) to my Kobo reader device. Everything I can find on internet about it: - has you running Adobe Digital Editions - aims at removing the DRM I want to avoid both: I'm fine with the DRM as long as I can read the book on my device, and I d...
I want to transfer books I buy on Google Play (downloads a .acsm) to my Kobo reader device. Everything I can find on internet about it:
- has you running Adobe Digital Editions
- aims at removing the DRM
I want to avoid both: I'm fine with the DRM as long as I can read the book on my device, and I don't want to run ADE through wine or otherwise (already lost hours trying that).
I guess acsm -> epub "conversion" is mostly a download, but are there conversions/encryptions along the way ? There is an url in the `` tag in the .acsm, but also a lot of other parameters. Is there a way to download "manually" (without ADE) ?
Gnurfos
(283 rep)
Oct 12, 2015, 12:35 PM
• Last activity: Jun 5, 2022, 04:18 AM
1
votes
1
answers
267
views
How to cat the first page of an epub file?
epubcat book.epub 1 3 # outputs plain text of pages 1 through 3 I don’t know if epubs have the concept of “pages.” If not, perhaps we can say each 400 chars are a page? A general solution that works for other ebook formats is better (mobi, azw3, etc). My own thoughts are currently on first convertin...
epubcat book.epub 1 3
# outputs plain text of pages 1 through 3
I don’t know if epubs have the concept of “pages.” If not, perhaps we can say each 400 chars are a page?
A general solution that works for other ebook formats is better (mobi, azw3, etc).
My own thoughts are currently on first converting the book to text via ebook-convert or pandoc and then extracting the needed amount, but this seems awfully inefficient as I intend to only get a little of the beginning of the content.
You can download an example file that can be used for testing [here](http://82.102.11.148:8080//tmp/Time%20to%20Put%20Your%20Galleons%20Where%20Your%20Mouth%20Is%20-%20Tsume%20Yuki.epub) .
HappyFace
(1694 rep)
Apr 10, 2020, 09:32 AM
• Last activity: May 5, 2022, 11:46 PM
1
votes
1
answers
357
views
Convert Blog to PDF or Epub Book
I want a command or script to collect all the posts on a given blog and convert them into a PDF and/or Epub book without needing to be the owner of the blog. This [website](http://blog2book.pothi.com/) allows users to convert blogs to PDF without needing to be the blog owner, but it will only conver...
I want a command or script to collect all the posts on a given blog and convert them into a PDF and/or Epub book without needing to be the owner of the blog. This [website](http://blog2book.pothi.com/) allows users to convert blogs to PDF without needing to be the blog owner, but it will only convert up to 100 posts. Most of the blogs I want to convert have 200+ posts. I'd like the published date of the posts be included at the top or bottom of each post, and graphics and images retained if possible.
whitewings
(2527 rep)
May 1, 2015, 08:35 PM
• Last activity: Jan 12, 2022, 12:39 PM
1
votes
0
answers
232
views
XMLstarlet to fix image tags and replace path for images
I have multiple .XHTML files in the folder. The top declaration part is as follows: First, I don't want alter top head part. I want process files in batch and fix two things, 1) terminate image end tags properly `'/>'`, same for ` ` and ` ` tags. 2) replace path in all images (preserving name), i.e....
I have multiple .XHTML files in the folder. The top declaration part is as follows:
First, I don't want alter top head part.
I want process files in batch and fix two things,
1) terminate image end tags properly
` tags. 2) replace path in all images (preserving name), i.e. from
'/>'
, same for ` and
` tags. 2) replace path in all images (preserving name), i.e. from

to

Tried xmlstarlet (v1.6.1), xmlstarlet fo --recover --html file.xhtml
but it alters top declaration part, adding extra stuff at the top:
Also warns about invalid tag
file.xhtml:8.54: Tag section invalid
^
what is correct commands? First I need 'dry run' to see changes, if OK then apply in place.
minto
(575 rep)
Sep 27, 2021, 09:19 PM
• Last activity: Sep 27, 2021, 10:07 PM
0
votes
3
answers
4073
views
How to extract files recursively but keep them in their own folders?
This is how I'm extracting all the files in a folder (recursively): find -iname \*.epub -exec unzip -o {} \; But the extracted files end up all in the parent folder: Parent (Extracted Epub files) Child (Epub files) Child (Epub files) How to change that command, so that they are extracted in their ow...
This is how I'm extracting all the files in a folder (recursively):
find -iname \*.epub -exec unzip -o {} \;
But the extracted files end up all in the parent folder:
Parent (Extracted Epub files)
Child (Epub files)
Child (Epub files)
How to change that command, so that they are extracted in their own folders?
Parent
Child (Epub files and extracted Epub files)
Child (Epub files and extracted Epub Files)
wyc
(143 rep)
Sep 20, 2021, 05:32 AM
• Last activity: Sep 20, 2021, 09:44 AM
0
votes
1
answers
1126
views
How to automatically replace mimetype when unzipping?
I'm using the following command to unzip epub files recursively inside a folder: find -iname \*.epub -exec unzip {} \; It works. But the terminal asks me this each time a file is being extracted: > replace mimetype? [y]es, [n]o, [A]ll, [N]one, [r]ename: Is there a flag that I can add to the command...
I'm using the following command to unzip epub files recursively inside a folder:
find -iname \*.epub -exec unzip {} \;
It works. But the terminal asks me this each time a file is being extracted:
> replace mimetype? [y]es, [n]o, [A]ll, [N]one, [r]ename:
Is there a flag that I can add to the command so it automatically selects
[A]ll
?
wyc
(143 rep)
Sep 19, 2021, 06:25 AM
• Last activity: Sep 19, 2021, 07:09 AM
4
votes
1
answers
4438
views
Convert EPUB to TXT and preserve original formatting
I have a programming book in EPUB format and I'm trying to convert it to TXT. For that I'm using the utility **ebook-convert** from **calibre**. The problem is that the standard usage: ebook-convert book.epub book.txt removes indentation in source code samples. E.g. a sample in the book looks so: cl...
I have a programming book in EPUB format and I'm trying to convert it to TXT.
For that I'm using the utility **ebook-convert** from **calibre**.
The problem is that the standard usage:
ebook-convert book.epub book.txt
removes indentation in source code samples.
E.g. a sample in the book looks so:
class A {
private int a;
}
But in the resulted TXT:
class A {
private int a;
}
After reading the utility's man page I've tried the following options:
--keep-ligatures
--pretty-print
--change-justification=original
but with no result. How to achieve it?
ka3ak
(1275 rep)
May 2, 2021, 10:15 AM
• Last activity: May 2, 2021, 10:50 AM
4
votes
5
answers
4600
views
EPUB reader for *BSD/Linux
What are the best native EPUB readers for *BSD/Linux. Browser add-ons are not an option. I prefer non-Qt applications but you can share Qt applications if you want. If possible, I want a program that remembers the page I was last viewing.
What are the best native EPUB readers for *BSD/Linux. Browser add-ons are not an option.
I prefer non-Qt applications but you can share Qt applications if you want. If possible, I want a program that remembers the page I was last viewing.
Rufo El Magufo
(3224 rep)
Oct 19, 2012, 07:13 PM
• Last activity: Oct 17, 2019, 10:57 AM
6
votes
3
answers
8098
views
Recommendation for an eBook reader for Gnome
There are eBook readers for Android, there's Okular for KDE, and stuff like that, but what I want, is an eBook (ePub format) reader for my normal Linux desktop. I know there's [Calibre][2], which goes way beyond being just an eBook reader, and theres [FBReader][1], Which doesn't really work as of ye...
There are eBook readers for Android, there's Okular for KDE, and stuff like that, but what I want, is an eBook (ePub format) reader for my normal Linux desktop.
I know there's Calibre , which goes way beyond being just an eBook reader, and theres FBReader , Which doesn't really work as of yet. Given that eBooks have been around for several years now, I'd assume, more software would've sprung up by now.
polemon
(11921 rep)
Oct 20, 2012, 05:14 AM
• Last activity: Oct 17, 2019, 10:31 AM
2
votes
1
answers
1637
views
How to convert PDF to e pub in a fixed layout in Calibre
I am trying to use [Calibre][1] to convert a PDF file to Epub format with a fixed layout, but I am not able to convert it. Can somebody tell me the steps to convert in a fixed layout in Calibre? [1]: https://calibre-ebook.com/help
I am trying to use Calibre to convert a PDF file to Epub format with a fixed layout, but I am not able to convert it. Can somebody tell me the steps to convert in a fixed layout in Calibre?
mohan rathour
(121 rep)
May 28, 2019, 07:53 AM
• Last activity: May 31, 2019, 06:38 PM
2
votes
1
answers
2261
views
ebook-convert for all .epub files in the folder
This code converts epub file to txt file: ebook-convert "book.epub" "book.txt" How can I use it to convert all .epub files in the directory? I am using Ubuntu. ### Code from os import listdir, rename from os.path import isfile, join import subprocess # return name of file to be kept after conversion...
This code converts epub file to txt file:
ebook-convert "book.epub" "book.txt"
How can I use it to convert all .epub files in the directory?
I am using Ubuntu.
### Code
from os import listdir, rename
from os.path import isfile, join
import subprocess
# return name of file to be kept after conversion.
# we are just changing the extension. azw3 here.
def get_final_filename(f):
f = f.split(".")
filename = ".".join(f[0:-1])
processed_file_name = filename+".azw3"
return processed_file_name
# return file extension. pdf or epub or mobi
def get_file_extension(f):
return f.split(".")[-1]
# list of extensions that needs to be ignored.
ignored_extensions = ["pdf"]
# here all the downloaded files are kept
mypath = "/home/user/Downloads/ebooks/"
# path where converted files are stored
mypath_converted = "/home/user/Downloads/ebooks/kindle/"
# path where processed files will be moved to, clearing the downloaded folder
mypath_processed = "/home/user/Downloads/ebooks/processed/"
raw_files = [f for f in listdir(mypath) if isfile(join(mypath, f))]
converted_files = [f for f in listdir(mypath_converted) if isfile(join(mypath_converted, f))]
for f in raw_files:
final_file_name = get_final_filename(f)
extension = get_file_extension(f)
if final_file_name not in converted_files and extension not in ignored_extensions:
print("Converting : "+f)
try:
subprocess.call(["ebook-convert",mypath+f,mypath_converted+final_file_name])
s = rename(mypath+f, mypath_processed+f)
print(s)
except Exception as e:
print(e)
else:
print("Already exists : "+final_file_name)
silver
(61 rep)
Mar 9, 2019, 03:07 PM
• Last activity: Mar 9, 2019, 10:09 PM
5
votes
0
answers
11254
views
Lightweight PDF to mobi and epub converter for Ubuntu
By lightweight I mean NOT Calibre. Please. I do not need a cataloging/library management software- which would not only consume unnecessary disk space but also ignore my current cataloging which I have maintained for years. I just need a quick and dirty batch convert to epub or mobi without having t...
By lightweight I mean NOT Calibre. Please. I do not need a cataloging/library management software- which would not only consume unnecessary disk space but also ignore my current cataloging which I have maintained for years.
I just need a quick and dirty batch convert to epub or mobi without having to deal with the myriad issues of using calibre.
**Are there any simple PDF epub and PDFmobi conversion tools for Ubuntu?**
There seems to be several for Windows based machines but strictly only calibre for Ubuntu.
NVAR
(51 rep)
Jan 4, 2018, 07:02 AM
• Last activity: Nov 21, 2018, 09:58 PM
4
votes
1
answers
130
views
Safely handling PDFs and other ebook formats on Linux
I'm running Arch Linux and using Okular for opening PDF files and FBReader for other ebook formats (Epub, Mobi, etc.). Simply put, here's my question: Assuming some of those documents come from unreliable sources and contain malicious code what can I do to mitigate the risk of compromising the syste...
I'm running Arch Linux and using Okular for opening PDF files and FBReader for other ebook formats (Epub, Mobi, etc.). Simply put, here's my question: Assuming some of those documents come from unreliable sources and contain malicious code what can I do to mitigate the risk of compromising the system and opening it for invasion (which can be a common occurrence in this country if you even smell like someone who holds opinions the government disapprove of)?
A few more specific questions:
Is just opening the referred files enough to put my setup at serious risk? The user I use for this is on the sudoers list, so, if compromised, it could be used for escalation.
Suppose I only open the files using a more limited user account, would that at least help?
Outside of setting up a virtual machine only for reading (which wouldn't be practical for a few reasons) or using another computer just for that (same thing), is there anything I can do?
Dave
(41 rep)
Sep 1, 2018, 04:53 PM
• Last activity: Sep 9, 2018, 08:28 AM
2
votes
2
answers
804
views
Recursively grep through epub files
I tried the answers [here][1], but without luck. find . -name "*.epub" -exec zipgrep pattern {} \; showed me "matched", but didn't give me the matching epub file back. Also, it returned huge blobs of data, which were hard to grep through. `grep -a` didn't work at all. I want something like `grep -R`...
I tried the answers here , but without luck.
find . -name "*.epub" -exec zipgrep pattern {} \;
showed me "matched", but didn't give me the matching epub file back. Also, it returned huge blobs of data, which were hard to grep through.
grep -a
didn't work at all.
I want something like grep -R
but for epub files.
JJ Abrams
(185 rep)
Jun 28, 2018, 03:47 PM
• Last activity: Jun 28, 2018, 05:37 PM
4
votes
2
answers
2345
views
Extract TOC of epub file
Lately I hit the command that will print the TOC of a `pdf` file. `mutool show file.pdf outline` I'd like to use a command for the `epub` format with similar simplicity of usage and nice result as the above for `pdf` format. Is there something like that?
Lately I hit the command that will print the TOC of a
pdf
file.
mutool show file.pdf outline
I'd like to use a command for the epub
format with similar simplicity
of usage and nice result as the above for pdf
format.
Is there something like that?
xralf
(15189 rep)
May 19, 2016, 09:53 PM
• Last activity: Feb 1, 2018, 03:58 PM
Showing page 1 of 20 total questions