Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

0 votes

0 answers

20 views

Detecting Messy Markup in EPUB Files (e.g. PDF Conversions)

epub

I'm trying to identify EPUB files with messy or problematic markup—typically the kind that results from automatic PDF-to-EPUB conversions. These often lead to unreadable or structurally broken files that aren't suitable for further processing or reading. I've written scripts to analyze EPUB contents...

                                  I'm trying to identify EPUB files with messy or problematic markup—typically the kind that results from automatic PDF-to-EPUB conversions. These often lead to unreadable or structurally broken files that aren't suitable for further processing or reading.

I've written scripts to analyze EPUB contents by counting HTML tags per word or checking for tags that interrupt sentence flow. But these metrics don’t correlate reliably with actual messiness. Some clean files rank at the top of the list.

I need a way to automatically score or rank EPUBs by how likely they are to contain poor or overly complex markup, so I can prioritize which files to manually review or discard.

A chatbot suggested analyzing how styles are applied, especially if styles mimic visual layout from PDFs. That seems like a good angle, but I don’t have enough knowledge of EPUB styling conventions or CSS patterns to define robust detection criteria.

I could use suggestions for:
- Better heuristics or metrics to detect messy EPUB structure

Any guidance is appreciated. Manually checking hundreds of files is not an option, and I’d like to automate this as much as possible.

JohnBig (121 rep)

Apr 24, 2025, 12:30 PM

1 votes

1 answers

591 views

Getting more metadata about epub documents

bash command-line epub

Without giving more details, the only commands I know are file, stat and mediainfo. While both give some idea about an .epub document, not all. For e.g. file just gives the filename and declares it to be an epub document. Mediainfo is slightly better that it gives the following info. Format : ZIP Fi...

                                  Without giving more details, the only commands I know are file, stat and mediainfo. While both give some idea about an .epub document, not all. For e.g. file just gives the filename and declares it to be an epub document. Mediainfo is slightly better that it gives the following info. 

    Format                                   : ZIP
    File size                                : 93.9 MiB
    FileExtension_Invalid                    : zip docx odt xlsx ods

So apart from the name of the document, I know the above. The most crucial bits though are missing. When was the epub book published, what version of epub version was used, what app. was used to make the .epub document and so on and so forth. There are so many versions from version 2, 2.0.1, 3, 3.2 and so on. Having all the above info. would make things easier to troubleshoot. 

Something on the lines of pdfinfo.

shirish (12954 rep)

May 4, 2023, 10:11 PM • Last activity: May 5, 2023, 07:16 AM

0 votes

1 answers

202 views

Find list of words and replace with one word

sed macos epub

I have some `.epub` files of books I want to edit swear words out of for my younger kids to read. I've read `sed` is the right tool for the job (I am open to different solutions as well), but am new to it. example original text ``` ant bat cat dog eagle fish ``` modified text (post-sed) ``` ant XXX...

I have some .epub files of books I want to edit swear words out of for my younger kids to read. I've read sed is the right tool for the job (I am open to different solutions as well), but am new to it. example original text

ant bat cat
dog eagle fish

modified text (post-sed)

ant XXX cat
XXX eagle XXX

I am on a Mac, and have got this to work: LC_ALL=C sed -E 's/bat|dog|fish/XXX/ig' temp1.txt > temp2.txt

ant XXX cat
XXX eagle XXX

But I can't get this to work with the .epub file format LC_ALL=C sed -E 's/bat|dog|fish/XXX/ig' file1.epub > file2.epub Here's a link to an example .epub file.

timothy.s.lau (103 rep)

Nov 22, 2022, 06:22 PM • Last activity: Nov 28, 2022, 06:10 PM

0 votes

1 answers

124 views

extract TOC from epub

scripting epub

I'm trying to learn script and found this post [Extract TOC from epub file][1], which give me part of solution that I need, but when I tested it, got an error `error: Extra content at the end of the document`. A little bit of background: I have 2 epub files: `1.epub` and `2.epub`. I tested each one...

I'm trying to learn script and found this post Extract TOC from epub file , which give me part of solution that I need, but when I tested it, got an error error: Extra content at the end of the document. A little bit of background: I have 2 epub files: 1.epub and 2.epub. I tested each one separately, it worked fine (got the TOC from each epub), but when I tried to test both files using do, got the above error. I'm learning scripts, not sure if I made a mistake somewhere. Anyone can point what's my mistake? ps: my script

#! /usr/bin/bash

EPUB_LIST="1.epub 2.epub" 

for f in "$EPUB_LIST"
do
    echo "$f:"
    unzip -p "$f" OEBPS/toc.ncx |
        xml2 |
        sed -n -e 's:^/ncx/navMap/navPoint/navLabel/text=:  :p'
    echo
done

michaelbr (111 rep)

Aug 27, 2022, 12:19 PM • Last activity: Aug 27, 2022, 01:13 PM

0 votes

0 answers

160 views

command/app to compare epub files

command-line file-comparison epub

Is there any easy way to compare the first line/text of a epub file using command line or an app in Linux? Details: I have 15 different epub files, each with over 10 "chapters", I'd like to compare the first few words of each chapter to see if there's a duplicate chapter. It seems most of compare co...

                                  Is there any easy way to compare the first line/text of a epub file using command line or an app in Linux?

Details: I have 15 different epub files, each with over 10 "chapters", I'd like to compare the first few words of each chapter to see if there's a duplicate chapter. It seems most of compare commands in Linux are for text comparison, unfortunately I'm not expert in Linux.

Currently I'm using Sigil to do the comparison, which is time consuming, I'd prefer to use command line or an app to do it, if possible.

michaelbr (111 rep)

Aug 26, 2022, 12:31 PM • Last activity: Aug 26, 2022, 12:34 PM

16 votes

3 answers

8085 views

Downloading a .epub from a .acsm

ebooks epub

I want to transfer books I buy on Google Play (downloads a .acsm) to my Kobo reader device. Everything I can find on internet about it: - has you running Adobe Digital Editions - aims at removing the DRM I want to avoid both: I'm fine with the DRM as long as I can read the book on my device, and I d...

                                  I want to transfer books I buy on Google Play (downloads a .acsm) to my Kobo reader device. Everything I can find on internet about it:

 - has you running Adobe Digital Editions
 - aims at removing the DRM

I want to avoid both: I'm fine with the DRM as long as I can read the book on my device, and I don't want to run ADE through wine or otherwise (already lost hours trying that).

I guess acsm -> epub "conversion" is mostly a download, but are there conversions/encryptions along the way ? There is an url in the `` tag in the .acsm, but also a lot of other parameters. Is there a way to download "manually" (without ADE) ?

Gnurfos (283 rep)

Oct 12, 2015, 12:35 PM • Last activity: Jun 5, 2022, 04:18 AM

1 votes

1 answers

267 views

How to cat the first page of an epub file?

shell-script file-format pandoc calibre epub

epubcat book.epub 1 3 # outputs plain text of pages 1 through 3 I don’t know if epubs have the concept of “pages.” If not, perhaps we can say each 400 chars are a page? A general solution that works for other ebook formats is better (mobi, azw3, etc). My own thoughts are currently on first convertin...

                                      epubcat book.epub 1 3
    # outputs plain text of pages 1 through 3

I don’t know if epubs have the concept of “pages.” If not, perhaps we can say each 400 chars are a page?
A general solution that works for other ebook formats is better (mobi, azw3, etc).

My own thoughts are currently on first converting the book to text via ebook-convert or pandoc and then extracting the needed amount, but this seems awfully inefficient as I intend to only get a little of the beginning of the content.

You can download an example file that can be used for testing [here](http://82.102.11.148:8080//tmp/Time%20to%20Put%20Your%20Galleons%20Where%20Your%20Mouth%20Is%20-%20Tsume%20Yuki.epub) .
                                

HappyFace (1694 rep)

Apr 10, 2020, 09:32 AM • Last activity: May 5, 2022, 11:46 PM

1 votes

1 answers

357 views

Convert Blog to PDF or Epub Book

pdf conversion epub

I want a command or script to collect all the posts on a given blog and convert them into a PDF and/or Epub book without needing to be the owner of the blog. This [website](http://blog2book.pothi.com/) allows users to convert blogs to PDF without needing to be the blog owner, but it will only conver...

                                  I want a command or script to collect all the posts on a given blog and convert them into a PDF and/or Epub book without needing to be the owner of the blog. This [website](http://blog2book.pothi.com/)  allows users to convert blogs to PDF without needing to be the blog owner, but it will only convert up to 100 posts. Most of the blogs I want to convert have 200+ posts. I'd like the published date of the posts be included at the top or bottom of each post, and graphics and images retained if possible.
                                

whitewings (2527 rep)

May 1, 2015, 08:35 PM • Last activity: Jan 12, 2022, 12:39 PM

1 votes

0 answers

232 views

XMLstarlet to fix image tags and replace path for images

xml html xmlstarlet epub

I have multiple .XHTML files in the folder. The top declaration part is as follows: First, I don't want alter top head part. I want process files in batch and fix two things, 1) terminate image end tags properly `'/>'`, same for ` ` and ` ` tags. 2) replace path in all images (preserving name), i.e....

                                  I have multiple .XHTML files in the folder. The top declaration part is as follows:

First, I don't want alter top head part.

I want process files in batch and fix two things,
1) terminate image end tags properly '/>', same for ` and 
` tags.
2) replace path in all images (preserving name), i.e. from
 
to

Tried xmlstarlet (v1.6.1), xmlstarlet fo --recover --html file.xhtml

but it alters top declaration part, adding extra stuff at the top:

 Also warns about invalid tag

    file.xhtml:8.54: Tag section invalid
    
                                                         ^

what is correct commands? First I need 'dry run' to see changes, if OK then apply in place.

minto (575 rep)

Sep 27, 2021, 09:19 PM • Last activity: Sep 27, 2021, 10:07 PM

0 votes

3 answers

4073 views

How to extract files recursively but keep them in their own folders?

command-line terminal zip epub

This is how I'm extracting all the files in a folder (recursively): find -iname \*.epub -exec unzip -o {} \; But the extracted files end up all in the parent folder: Parent (Extracted Epub files) Child (Epub files) Child (Epub files) How to change that command, so that they are extracted in their ow...

                                  This is how I'm extracting all the files in a folder (recursively):

    find -iname \*.epub -exec unzip -o {} \;

But the extracted files end up all in the parent folder:

    Parent (Extracted Epub files)
      Child (Epub files)
      Child (Epub files)

How to change that command, so that they are extracted in their own folders?

    Parent
      Child (Epub files and extracted Epub files)
      Child (Epub files and extracted Epub Files)

wyc (143 rep)

Sep 20, 2021, 05:32 AM • Last activity: Sep 20, 2021, 09:44 AM

0 votes

1 answers

1126 views

How to automatically replace mimetype when unzipping?

command-line terminal zip epub

I'm using the following command to unzip epub files recursively inside a folder: find -iname \*.epub -exec unzip {} \; It works. But the terminal asks me this each time a file is being extracted: > replace mimetype? [y]es, [n]o, [A]ll, [N]one, [r]ename: Is there a flag that I can add to the command...

                                  I'm using the following command to unzip epub files recursively inside a folder:

    find -iname \*.epub -exec unzip {} \;

It works. But the terminal asks me this each time a file is being extracted:

> replace mimetype? [y]es, [n]o, [A]ll, [N]one, [r]ename:

Is there a flag that I can add to the command so it automatically selects [A]ll?

wyc (143 rep)

Sep 19, 2021, 06:25 AM • Last activity: Sep 19, 2021, 07:09 AM

4 votes

1 answers

4438 views

Convert EPUB to TXT and preserve original formatting

text calibre epub

I have a programming book in EPUB format and I'm trying to convert it to TXT. For that I'm using the utility **ebook-convert** from **calibre**. The problem is that the standard usage: ebook-convert book.epub book.txt removes indentation in source code samples. E.g. a sample in the book looks so: cl...

                                  I have a programming book in EPUB format and I'm trying to convert it to TXT.
For that I'm using the utility **ebook-convert** from **calibre**.
The problem is that the standard usage:

    ebook-convert book.epub book.txt

removes indentation in source code samples.
E.g. a sample in the book looks so:

    class A {
      private int a;
    }

But in the resulted TXT:

    class A {
    private int a;
    }

After reading the utility's man page I've tried the following options:

    --keep-ligatures
    --pretty-print
    --change-justification=original

but with no result. How to achieve it?

                                

ka3ak (1275 rep)

May 2, 2021, 10:15 AM • Last activity: May 2, 2021, 10:50 AM

4 votes

5 answers

4600 views

EPUB reader for *BSD/Linux

linux software-rec bsd epub

What are the best native EPUB readers for *BSD/Linux. Browser add-ons are not an option. I prefer non-Qt applications but you can share Qt applications if you want. If possible, I want a program that remembers the page I was last viewing.

                                  What are the best native EPUB readers for *BSD/Linux. Browser add-ons are not an option.

I prefer non-Qt applications but you can share Qt applications if you want. If possible, I want a program that remembers the page I was last viewing.

Rufo El Magufo (3224 rep)

Oct 19, 2012, 07:13 PM • Last activity: Oct 17, 2019, 10:57 AM

6 votes

3 answers

8098 views

Recommendation for an eBook reader for Gnome

gnome software-rec epub ebooks

There are eBook readers for Android, there's Okular for KDE, and stuff like that, but what I want, is an eBook (ePub format) reader for my normal Linux desktop. I know there's [Calibre][2], which goes way beyond being just an eBook reader, and theres [FBReader][1], Which doesn't really work as of ye...

                                  There are eBook readers for Android, there's Okular for KDE, and stuff like that, but what I want, is an eBook (ePub format) reader for my normal Linux desktop.

I know there's Calibre , which goes way beyond being just an eBook reader, and theres FBReader , Which doesn't really work as of yet. Given that eBooks have been around for several years now, I'd assume, more software would've sprung up by now.

polemon (11921 rep)

Oct 20, 2012, 05:14 AM • Last activity: Oct 17, 2019, 10:31 AM

2 votes

1 answers

1637 views

How to convert PDF to e pub in a fixed layout in Calibre

pdf calibre epub

I am trying to use [Calibre][1] to convert a PDF file to Epub format with a fixed layout, but I am not able to convert it. Can somebody tell me the steps to convert in a fixed layout in Calibre? [1]: https://calibre-ebook.com/help

                                  I am trying to use Calibre  to convert a PDF file to Epub format with a fixed layout, but I am not able to convert it. Can somebody tell me the steps to convert in a fixed layout in Calibre? 
                                

mohan rathour (121 rep)

May 28, 2019, 07:53 AM • Last activity: May 31, 2019, 06:38 PM

2 votes

1 answers

2261 views

ebook-convert for all .epub files in the folder

ubuntu conversion calibre epub

This code converts epub file to txt file: ebook-convert "book.epub" "book.txt" How can I use it to convert all .epub files in the directory? I am using Ubuntu. ### Code from os import listdir, rename from os.path import isfile, join import subprocess # return name of file to be kept after conversion...

                                  This code converts epub file to txt file:

    ebook-convert "book.epub" "book.txt"

How can I use it to convert all .epub files in the directory?

I am using Ubuntu.

### Code

    from os import listdir, rename
    from os.path import isfile, join
    import subprocess
    
    
    # return name of file to be kept after conversion.
    # we are just changing the extension. azw3 here.
    def get_final_filename(f):
        f = f.split(".")
        filename = ".".join(f[0:-1])
        processed_file_name = filename+".azw3"
        return processed_file_name
    
    
    # return file extension. pdf or epub or mobi
    def get_file_extension(f):
        return f.split(".")[-1]
    
    
    # list of extensions that needs to be ignored.
    ignored_extensions = ["pdf"]
    
    # here all the downloaded files are kept
    mypath = "/home/user/Downloads/ebooks/"
    
    # path where converted files are stored
    mypath_converted = "/home/user/Downloads/ebooks/kindle/"
    
    # path where processed files will be moved to, clearing the downloaded folder
    mypath_processed = "/home/user/Downloads/ebooks/processed/"
    
    raw_files = [f for f in listdir(mypath) if isfile(join(mypath, f))]
    converted_files =  [f for f in listdir(mypath_converted) if isfile(join(mypath_converted, f))]
    
    for f in raw_files:
        final_file_name = get_final_filename(f)
        extension = get_file_extension(f)
        if final_file_name not in converted_files and extension not in ignored_extensions:
            print("Converting : "+f)
            try:
                subprocess.call(["ebook-convert",mypath+f,mypath_converted+final_file_name]) 
                s = rename(mypath+f, mypath_processed+f)
                print(s)
            except Exception as e:
                print(e)
        else:
            print("Already exists : "+final_file_name)
                                

silver (61 rep)

Mar 9, 2019, 03:07 PM • Last activity: Mar 9, 2019, 10:09 PM

5 votes

0 answers

11254 views

Lightweight PDF to mobi and epub converter for Ubuntu

ubuntu pdf conversion ebooks epub

By lightweight I mean NOT Calibre. Please. I do not need a cataloging/library management software- which would not only consume unnecessary disk space but also ignore my current cataloging which I have maintained for years. I just need a quick and dirty batch convert to epub or mobi without having t...

                                  By lightweight I mean NOT Calibre. Please. I do not need a cataloging/library management software- which would not only consume unnecessary disk space but also ignore my current cataloging which I have maintained for years.

I just need a quick and dirty batch convert to epub or mobi without having to deal with the myriad issues of using calibre. 

**Are there any simple PDF epub and PDFmobi conversion tools for Ubuntu?**

There seems to be several for Windows based machines but strictly only calibre for Ubuntu.

NVAR (51 rep)

Jan 4, 2018, 07:02 AM • Last activity: Nov 21, 2018, 09:58 PM

4 votes

1 answers

130 views

Safely handling PDFs and other ebook formats on Linux

security pdf documents ebooks epub

I'm running Arch Linux and using Okular for opening PDF files and FBReader for other ebook formats (Epub, Mobi, etc.). Simply put, here's my question: Assuming some of those documents come from unreliable sources and contain malicious code what can I do to mitigate the risk of compromising the syste...

                                  I'm running Arch Linux and using Okular for opening PDF files and FBReader for other ebook formats (Epub, Mobi, etc.). Simply put, here's my question: Assuming some of those documents come from unreliable sources and contain malicious code what can I do to mitigate the risk of compromising the system and opening it for invasion (which can be a common occurrence in this country if you even smell like someone who holds opinions the government disapprove of)?

A few more specific questions:

Is just opening the referred files enough to put my setup at serious risk? The user I use for this is on the sudoers list, so, if compromised, it could be used for escalation.

Suppose I only open the files using a more limited user account, would that at least help?

Outside of setting up a virtual machine only for reading (which wouldn't be practical for a few reasons) or using another computer just for that (same thing), is there anything I can do?

Dave (41 rep)

Sep 1, 2018, 04:53 PM • Last activity: Sep 9, 2018, 08:28 AM

2 votes

2 answers

804 views

Recursively grep through epub files

grep search recursive epub

I tried the answers [here][1], but without luck. find . -name "*.epub" -exec zipgrep pattern {} \; showed me "matched", but didn't give me the matching epub file back. Also, it returned huge blobs of data, which were hard to grep through. `grep -a` didn't work at all. I want something like `grep -R`...

                                  I tried the answers here , but without luck.

    find . -name "*.epub" -exec zipgrep pattern {} \;

showed me "matched", but didn't give me the matching epub file back. Also, it returned huge blobs of data, which were hard to grep through.

grep -a didn't work at all.

I want something like grep -R but for epub files.

JJ Abrams (185 rep)

Jun 28, 2018, 03:47 PM • Last activity: Jun 28, 2018, 05:37 PM

4 votes

2 answers

2345 views

Extract TOC of epub file

command-line file-format ebooks epub

Lately I hit the command that will print the TOC of a `pdf` file. `mutool show file.pdf outline` I'd like to use a command for the `epub` format with similar simplicity of usage and nice result as the above for `pdf` format. Is there something like that?

                                  Lately I hit the command that will print the TOC of a pdf file.

mutool show file.pdf outline

I'd like to use a command for the epub format with similar simplicity
of usage and nice result as the above for pdf format.

Is there something like that?

xralf (15189 rep)

May 19, 2016, 09:53 PM • Last activity: Feb 1, 2018, 03:58 PM

Showing page 1 of 20 total questions