# The Problem
I have a lot of old books that I want to scan and digitize. For this, I use some flatbed scanner,
xsane
and GImageReader
, which works great.
Back a few years ago, when I was still using Windows for such things, I used ABBY Fine Reader
, which I was also very happy with, but which is not available on Linux.
Now, comparing the PDFs I create now and the ones I created back in the day, I see that the files are much larger today.
With ABBY, I used to get PDFs with 50-60 pages with file sizes between 10-50 MB, which I find acceptable.
Nowaydays, I have PDFs with 50-60 pages with file sizes of 150+ MB, which is not very usable when reading them on a smartphone for example.
I use the same settings for scanning, namely:
- A4 pages
- jpg
compression
- 300 dpi
- color scans for the covers
- greyscale scans for all interior pages
I assume that the size difference is related to ABBY using some commercial magic to be smart about image compression, which GImageReader doesn't have. Perhaps they identify non-empty areas (illustrations and text blocks) and save them at a higher quality, while aggressively compressing the "background image" or something like that - or maybe they are simply able to identify that some pages are greyscale while others are colored, a distinction which might be lost on GImageReader
. I don't know, really, and I'd love to understand it.
# What I have tried
Since then, I have played a bit with various methods of PDF compression. Most online guides suggest using gs
or pdftk
, both of which I have tried. In my specific case, I observe the following:
- Option 1: gs
for pdf->pdf
. The /printer
and /prepress
settings do not reduce the file size at all, the /screen
and /ebook
settings lead to notable degradation of image quality.
- Option 2: gs
for pdf->ps
and then ps->pdf
. This leads to a notable reduction in filesize (I don't understand why this is any different from Option 1 but whatever) and I was happy with that option, until I noticed that apparently the glyphs of the text get lost in translation. When I copy&paste text segments from the PDF, the result is some wingdings-type gibberish, where I was able to copy&paste text from the original PDF, so this is a no-no.
- Option 3: pdftk
for pdf->pdf
. This does not seem to reduce the filesize at all.
# What to do now
I'm a bit lost as to how it is possible that the PDF compression techniques yield such vastly different results. I am looking for a tool that runs under linux (preferably FOSS, but I'd settle for an affordable commercial product as well) and provides significant PDF compression of scanned & OCRd PDFs without notable loss of quality against a 300dpi A4 JPG.
Asked by carsten
(375 rep)
Jan 3, 2021, 10:05 AM
Last activity: Jan 3, 2021, 04:21 PM
Last activity: Jan 3, 2021, 04:21 PM