Sample Header Ad - 728x90

Relinking OCR data to downscaled images

0 votes
0 answers
33 views
I have a PDF consisting of scanned pages with OCR done by tesseract. I want to downscale the images (by around 4x) and retain the OCR. What would be an automatic way to relink the OCR data to the new coordinates on the pages? --- Original question (*Sudden surge in the size of PDF generated by Ghostscript 9.55.0*): I have a large PDF and am unsuccessfully trying to reduce its size. (The file is copyrighted material, so I cannot post it; I am doing this for personal use. It must be a common problem, though: I had the same experience earlier with other files from different sources.) The file is a scan of a book with 1500 B&W A4 pages of text, no pictures at all. The individual pages were mogrify-ed into PNG images of equal height (around 1000px) and cleaned up via scantailor-advanced. Then each of the pages (now in TIFF) was tesseracted. The results were pdfunited into a 200MB file. This is way too large for this kind of book. I would like to be able to shrink it to around 30MB, perhaps 50. (The total text size extracted by pdftext is 9MB.) Most of the PDF compression methods I found on StackExchange and other sites boil down to gs with varying parameters. On my machine they all behave in a very similar way. I start gs in the terminal and switch to a GUI file manager. The size of the output file grows slowly and steadily from 0 to around 15MB (no matter the settings), and then it gallops in the last split second, as if gs gives up and just dumps the input into the output verbatim. (I attributed this to memory shortage, but the program also exhibits similar behavior on a relatively small, 100-page, part of this file.) If gs is not told to change the DPI (300), the output file becomes as large as the input was. If the DPI is changed to 72, the file becomes 70MB; this is still too much for such a loss in image quality. Is there an explanation of this surge? Should I perhaps use some other toolchain on the raw scans, or a different optimization tool? pdfsizeopt is very slow and seems to lead to 10% reduction. tiff2pdf -j 50 saves 5% (which will be re-added during OCR).
Asked by Dilettante (101 rep)
Jul 25, 2025, 07:17 PM
Last activity: Jul 26, 2025, 07:19 PM