methods of PDF compression

0 votes
1 answer
1743 views
                          # The Problem

I have a lot of old books that I want to scan and digitize. For this, I use some flatbed scanner, xsane and GImageReader, which works great.

Back a few years ago, when I was still using Windows for such things, I used ABBY Fine Reader, which I was also very happy with, but which is not available on Linux.

Now, comparing the PDFs I create now and the ones I created back in the day, I see that the files are much larger today.

With ABBY, I used to get PDFs with 50-60 pages with file sizes between 10-50 MB, which I find acceptable.
Nowaydays, I have PDFs with 50-60 pages with file sizes of 150+ MB, which is not very usable when reading them on a smartphone for example.

I use the same settings for scanning, namely:

 - A4 pages
 - jpg compression
 - 300 dpi
 - color scans for the covers
 - greyscale scans for all interior pages

I assume that the size difference is related to ABBY using some commercial magic to be smart about image compression, which GImageReader doesn't have. Perhaps they identify non-empty areas (illustrations and text blocks) and save them at a higher quality, while aggressively compressing the "background image" or something like that - or maybe they are simply able to identify that some pages are greyscale while others are colored, a distinction which might be lost on GImageReader. I don't know, really, and I'd love to understand it.

# What I have tried

Since then, I have played a bit with various methods of PDF compression. Most online guides suggest using gs or pdftk, both of which I have tried. In my specific case, I observe the following:

 - Option 1: gs for pdf->pdf. The /printer and /prepress settings do not reduce the file size at all, the /screen and /ebook settings lead to notable degradation of image quality.
 - Option 2: gs for pdf->ps and then ps->pdf. This leads to a notable reduction in filesize (I don't understand why this is any different from Option 1 but whatever) and I was happy with that option, until I noticed that apparently the glyphs of the text get lost in translation. When I copy&paste text segments from the PDF, the result is some wingdings-type gibberish, where I was able to copy&paste text from the original PDF, so this is a no-no.
 - Option 3: pdftk for pdf->pdf. This does not seem to reduce the filesize at all.

# What to do now

I'm a bit lost as to how it is possible that the PDF compression techniques yield such vastly different results. I am looking for a tool that runs under linux (preferably FOSS, but I'd settle for an affordable commercial product as well) and provides significant PDF compression of scanned & OCRd PDFs without notable loss of quality against a 300dpi A4 JPG.
Asked by carsten (375 rep)
Jan 3, 2021, 10:05 AM
Last activity: Jan 3, 2021, 04:21 PM
methods of PDF compression

Related Questions