Sample Header Ad - 728x90

Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

0 votes
0 answers
33 views
Relinking OCR data to downscaled images
I have a PDF consisting of scanned pages with OCR done by `tesseract`. I want to downscale the images (by around 4x) and retain the OCR. What would be an automatic way to relink the OCR data to the new coordinates on the pages? --- Original question (*Sudden surge in the size of PDF generated by Gho...
I have a PDF consisting of scanned pages with OCR done by tesseract. I want to downscale the images (by around 4x) and retain the OCR. What would be an automatic way to relink the OCR data to the new coordinates on the pages? --- Original question (*Sudden surge in the size of PDF generated by Ghostscript 9.55.0*): I have a large PDF and am unsuccessfully trying to reduce its size. (The file is copyrighted material, so I cannot post it; I am doing this for personal use. It must be a common problem, though: I had the same experience earlier with other files from different sources.) The file is a scan of a book with 1500 B&W A4 pages of text, no pictures at all. The individual pages were mogrify-ed into PNG images of equal height (around 1000px) and cleaned up via scantailor-advanced. Then each of the pages (now in TIFF) was tesseracted. The results were pdfunited into a 200MB file. This is way too large for this kind of book. I would like to be able to shrink it to around 30MB, perhaps 50. (The total text size extracted by pdftext is 9MB.) Most of the PDF compression methods I found on StackExchange and other sites boil down to gs with varying parameters. On my machine they all behave in a very similar way. I start gs in the terminal and switch to a GUI file manager. The size of the output file grows slowly and steadily from 0 to around 15MB (no matter the settings), and then it gallops in the last split second, as if gs gives up and just dumps the input into the output verbatim. (I attributed this to memory shortage, but the program also exhibits similar behavior on a relatively small, 100-page, part of this file.) If gs is not told to change the DPI (300), the output file becomes as large as the input was. If the DPI is changed to 72, the file becomes 70MB; this is still too much for such a loss in image quality. Is there an explanation of this surge? Should I perhaps use some other toolchain on the raw scans, or a different optimization tool? pdfsizeopt is very slow and seems to lead to 10% reduction. tiff2pdf -j 50 saves 5% (which will be re-added during OCR).
Dilettante (101 rep)
Jul 25, 2025, 07:17 PM • Last activity: Jul 26, 2025, 07:19 PM
102 votes
4 answers
79759 views
How to OCR a PDF file and get the text stored within the PDF?
First, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support. I am interested in a solution for Fedora to OCR a multipage non-searchable PDF and to turn this PDF into a new PDF file that contains the text layer on top of the image. On...
First, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support. I am interested in a solution for Fedora to OCR a multipage non-searchable PDF and to turn this PDF into a new PDF file that contains the text layer on top of the image. On Mac OSX or Windows we could use Adobe Acrobat, but is there a solution on Linux, specifically on Fedora? [This](https://web.archive.org/web/20190807064639/https://snippets.webaware.com.au/howto/pdf-ocr-linux/) seems to describe a solution - but unfortunately I am already lost when retrieving exact-image.
ingli (2029 rep)
Aug 4, 2016, 03:39 PM • Last activity: Mar 27, 2025, 11:54 AM
0 votes
0 answers
17 views
Is there something which lets tesseract tell some OCR-quality measure?
I am on Ubuntu. Most of my scanned documents are German, English or French. This question is related to my other question at [https://unix.stackexchange.com/questions/792095/is-there-an-option-to-let-pdfsandwich-try-90-rotations-automatically-for-scanne][1] Is there a way to let tesseract tell us ho...
I am on Ubuntu. Most of my scanned documents are German, English or French. This question is related to my other question at https://unix.stackexchange.com/questions/792095/is-there-an-option-to-let-pdfsandwich-try-90-rotations-automatically-for-scanne Is there a way to let tesseract tell us how well its OCR worked, something like a quality measure like x% of everything looking like characters could clearly be identified, y% have been identified as characters but with a doubtfull distance to traineddata. If there were something like this, one might start tesseract (possibly time constrained for each page) and start it again with the same page rotated by 180° and try out if OCR works better for the upside-down orientation. Or it would be possible to start it again with the document turned 90°, 180° or 270° and fully do the OCR for the orientation which works best.
Adalbert Hanßen (303 rep)
Mar 7, 2025, 04:20 PM
0 votes
0 answers
14 views
Is there an option to let pdfsandwich try 90° rotations automatically for scanned pages when necessary?
I am on Ubuntu. Most of my scanned documents are German, English or French. Some scans have to be rotated before doing OCR on them, otherwise pdfsandwich returns nonsense OCR. Is there any preprocessor parameter which lets pdfsandwich automatically try rotated input files if the OCR result from tess...
I am on Ubuntu. Most of my scanned documents are German, English or French. Some scans have to be rotated before doing OCR on them, otherwise pdfsandwich returns nonsense OCR. Is there any preprocessor parameter which lets pdfsandwich automatically try rotated input files if the OCR result from tesseract from an unrotated input file is of low quality? Is there any preprocessing which first looks for the general orientation of lines ad which rotates them to support better OCR?
Adalbert Hanßen (303 rep)
Mar 7, 2025, 04:07 PM
0 votes
0 answers
14 views
Is it possible to integrate pdfsandwich into XSane?
Is it possible to integrate pdfsandwich into XSane? This command adds an OCR sandwiched plane to a scanned file (the option behind -lang is for German): ```pdfsandwich -lang deu -grayfilter file.pdf``` Compared with other solutions, the generated pdf file is relatively small. Therefore I would like...
Is it possible to integrate pdfsandwich into XSane? This command adds an OCR sandwiched plane to a scanned file (the option behind -lang is for German):
-lang deu -grayfilter file.pdf
Compared with other solutions, the generated pdf file is relatively small. Therefore I would like to integrate pdfsandwich into the XSane workflow. It would be best, if this becomes possible for multipage scanned pdf files too. The language should not be fixed once for all. Perhaps this might be achieved by changing the letters for the
-lang
option stored in some file which is read after a pdf file is OCRed after the file is scanned. If the question for the language has to be risen for each page separately, then it should be possible just to press RETURN to stay with the most recent value for
-lang
.
Adalbert Hanßen (303 rep)
Feb 17, 2025, 08:30 PM
3 votes
1 answers
462 views
MacOS-like OCR for Linux?
How can one setup the same ubiquitous OCR capabilities on Linux, in a manner similar to how one can copy text from *any image* in *any software* on MacOS and iOS? I am using EndevourOS with Gnome DE.
How can one setup the same ubiquitous OCR capabilities on Linux, in a manner similar to how one can copy text from *any image* in *any software* on MacOS and iOS? I am using EndevourOS with Gnome DE.
Pushp Vashisht (131 rep)
Apr 13, 2023, 06:04 PM • Last activity: Jan 26, 2025, 07:20 PM
56 votes
7 answers
44888 views
Is there some sort of PDF-to-text converter?
I need PDF files in text so I can search over them in bulk from commandline. Is there some converter for Ubuntu, OBSD or similar distro? Perhaps related post, OCR with Ubuntu [here][1]. [1]: https://askubuntu.com/questions/8792/optical-character-recognition-software-on-ubuntu
I need PDF files in text so I can search over them in bulk from commandline. Is there some converter for Ubuntu, OBSD or similar distro? Perhaps related post, OCR with Ubuntu here .
otto (661 rep)
Dec 11, 2010, 02:46 PM • Last activity: Dec 30, 2024, 06:23 PM
0 votes
0 answers
300 views
What happened to Tesseract's "Math / equation detection module"?
I was able to get Tesseract to run via a Python script on my Windows machine to turn non-searchable PDFs into searchable ones. When downloading Tesseract onto windows, it asked me which languages I wanted and I selected them, this was when I learned about the math module to begin with. I am not sure...
I was able to get Tesseract to run via a Python script on my Windows machine to turn non-searchable PDFs into searchable ones. When downloading Tesseract onto windows, it asked me which languages I wanted and I selected them, this was when I learned about the math module to begin with. I am not sure how effective the math module was, but I could see that it was downloaded when I checked the languages. Now I am trying to install Tesseract on Debian. To install Tesseract I used the command: sudo apt install -y tesseract-ocr Then, to ensure I had the math module, I would always follow that up with: sudo apt install tesseract-ocr-equ And, I am pretty sure that would install the math module. I remember using that command successfully several times, including earlier this morning. However, now, when I use that code, I get the following messages: Reading package lists... Done Building dependency tree... Done Reading state information... Done E: Unable to locate package tesseract-ocr-equ Just to make sure I wasn't crazy, I looked up the language codes used by Tesseract, [according to Debian.org](https://manpages.debian.org/testing/tesseract-ocr/tesseract.1.en.html#LANGUAGES_AND_SCRIPTS:~:text=Math%20/%20equation%20detection%20module) , and they say that "equ" belongs to the "Math / equation detection module", admittedly that is an earlier version. So, I tried the following code: sudo apt-get install -y tesseract-ocr-equ Among the several lines of code that I got in response were the following: Note, selecting 'tesseract-ocr-uzb-cyrl' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-ell' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-eng' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-enm' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-epo' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-est' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-eus' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-que' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-uig' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-ukr' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-urd' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-uzb' for regex 'tesseract-ocr-[equ]' tesseract-ocr-eng is already the newest version (1:4.1.0-2). tesseract-ocr-eng set to manually installed. So, this made me wonder if there was a different math module for different languages, and the math module is automatically downloaded with the language you download. I just really remember using the command initially without any problem. That being said, I have had several head injuries, so my memory is not entirely reliable. It's just that if I turn out to have been mistaken here and I have not been using that code as I remember, this will be one of those deeply troubling times due to how vividly I remember this working. So, the primary question is how do I download the "Math / equation detection module" for Tesseract onto my Linux Beta on my Chromebook. Secondarily, could someone tell me if the functionality of the "sudo apt install tesseract-ocr-equ" command changed recently. This is frustrating me quite a bit. I am hoping that someone just changed the functionality this morning and math modules are now built into the languages.
Curious Layman (101 rep)
May 16, 2024, 04:17 PM • Last activity: May 21, 2024, 09:06 AM
8 votes
4 answers
5544 views
How can I rasterize all of the text in a PDF?
You know when you have a pdf, which is a scan of a document and it's a really huge file, because it just stores the picture of the scanned document? And there are OCR tools which can help you to make a proper document which just stores the text? Well, I need the reverse of that! Let's say I have a p...
You know when you have a pdf, which is a scan of a document and it's a really huge file, because it just stores the picture of the scanned document? And there are OCR tools which can help you to make a proper document which just stores the text? Well, I need the reverse of that! Let's say I have a perfect pdf document generated with pdflatex and I need to turn it into such a "huge" pdf, which looks exactly the same when printed on paper (with a certain dpi value), but is just a picture of the original. My initial idea is to turn the pdf into a series of JPGs and then back into a PDF, but perhaps there is some canonical way for that? --- In case you wonder why I would want to do such a thing: I'm currently stuck with a network printer, which is not maintained by me, and which randomly drops characters in printed files! So until someone figures out what's wrong there, I want this as workaround.
Dimitri Schachmann (183 rep)
Apr 26, 2015, 02:09 PM • Last activity: Feb 18, 2024, 01:40 PM
1 votes
2 answers
666 views
How to scan with ocr bash script
To streamline the scan process I intend to create a script that scans and applies OCR in one step. However my bash skills are rather poor, so I would be very thankful for a bit of help. Here my attempt: #!/bin/bash mydate="$(date +"%Y%m%d-%H%M%S")" image="$(scanimage --device "brother4:net1;dev0" --...
To streamline the scan process I intend to create a script that scans and applies OCR in one step. However my bash skills are rather poor, so I would be very thankful for a bit of help. Here my attempt: #!/bin/bash mydate="$(date +"%Y%m%d-%H%M%S")" image="$(scanimage --device "brother4:net1;dev0" --progress --verbose --resolution=600 -l 0 -t 0 -x 210 -y 297 --format=pdf)" ocrmypdf --deskew "$image" "$mydate".pdf The command, which works well, without creating a date specific filename is: scanimage --device "brother4:net1;dev0" --progress --verbose --resolution=600 -l 0 -t 0 -x 210 -y 297 --format=pdf > scan.pdf && ocrmypdf --deskew scan.pdf scan.pdf Since the OCR process takes some time, the filename containig the time (up to seconds) has to be stored at scantime, and then applied to the final file. Or maybe it is possible -- did not find how -- to pipe the file to ocrmypdf without naming it and then save the file with date and time informations.
alex (1023 rep)
Apr 26, 2023, 02:16 AM • Last activity: Feb 13, 2024, 09:30 AM
4 votes
1 answers
2845 views
Delete OCR from PDF
I have PDF file containing corrupted OCR. It is a bunch of handwritten pages with a lot of symbols and abbreviations, and I got this file with an automatically generated OCR. How can I remove the text layer in order to get a lighter file (and to get rid of the unnecessary OCR)?
I have PDF file containing corrupted OCR. It is a bunch of handwritten pages with a lot of symbols and abbreviations, and I got this file with an automatically generated OCR. How can I remove the text layer in order to get a lighter file (and to get rid of the unnecessary OCR)?
Seninha (1065 rep)
Jun 11, 2017, 10:46 PM • Last activity: Nov 24, 2023, 04:26 PM
0 votes
0 answers
175 views
Making badly scanned public domain books legible with OCR
I've obtained soft copies of some very old public domain books. The illustrations are clear enough, but the text is somewhat blurry. I've experimented with Tesseract OCR and it can recognize a surprising amount of the words with some errors, but it spits them out into a jumbled mess in a separate fi...
I've obtained soft copies of some very old public domain books. The illustrations are clear enough, but the text is somewhat blurry. I've experimented with Tesseract OCR and it can recognize a surprising amount of the words with some errors, but it spits them out into a jumbled mess in a separate file. **Questions:** 1. Is there a way to have Tesseract or another OCR recognize the text and then place it over the original blurry text without changing other elements such as lines and illustrations? 2. And, if this is possible, is it also possible to have Tesseract or another OCR mimic the varying sizes, fonts, and colors of the original text? Thank you!
YQ002lc2 (145 rep)
Jul 24, 2023, 06:11 AM
2 votes
0 answers
50 views
OCR high res images & combine OCR data later, after image compression?
I have a large number of .tif's coming out of ScanTailor. Is there a way that I might OCR those .tif's with tesseract, holding the OCR data separate from the images; then compress the images, and finally combine the OCR data with the compressed images? The point is that I don't want to compress befo...
I have a large number of .tif's coming out of ScanTailor. Is there a way that I might OCR those .tif's with tesseract, holding the OCR data separate from the images; then compress the images, and finally combine the OCR data with the compressed images? The point is that I don't want to compress before I OCR, and the tools for compressing the pdf's later, preserving the OCR, are not great.
Diagon (740 rep)
Jul 7, 2023, 10:50 PM
0 votes
3 answers
1559 views
OCR software for handwritten equations to get LaTeX file
First of all, I apologize if this is not the right place to ask this, but I couldn't think of anywhere else (maybe Stack Overflow?). Anyway, I'm looking for a Optical Character Recognition software (OCR) to process my notes. The thing is that occasionally there is an equation there in the middle, so...
First of all, I apologize if this is not the right place to ask this, but I couldn't think of anywhere else (maybe Stack Overflow?). Anyway, I'm looking for a Optical Character Recognition software (OCR) to process my notes. The thing is that occasionally there is an equation there in the middle, so I was looking for a software that can process the text and the equations together that I can run in my Linux system. Ultimately my goal is to create a LaTeX file from that, so it wouldn't hurt if the output was already in LaTeX, but I guess that would be asking too much. I couldn't find anything online that did that, but I think that's mainly because I'm not using the right search terms (English is not my main language). I did find [this question](https://tex.stackexchange.com/questions/1443/what-is-the-status-of-generating-latex-from-handwriting-i-e-ocr) but it's from 4 years ago and I think this have changed since then. If I could get one good software to process the text part of the notes, and another to process the equation part of the notes, I'd be able to put them all together already. Does anybody know a way of doing this?
TomCho (529 rep)
Dec 18, 2016, 05:59 PM • Last activity: Jan 11, 2023, 07:33 AM
0 votes
1 answers
200 views
Make (`ocrmypdf`) command run in terminal AND include input name in that of the output
I have this line inside a Dolphin service-menu file that contains many other commands for PDF processing: Exec=bash -c 'f="%u"; ocrmypdf "$f" "${f%.pdf}_ocr.pdf";' It has the advantage of giving an output file of the form `MY_PDF_ocr.pdf`, thus keeping the name of the input file. But I would prefer...
I have this line inside a Dolphin service-menu file that contains many other commands for PDF processing: Exec=bash -c 'f="%u"; ocrmypdf "$f" "${f%.pdf}_ocr.pdf";' It has the advantage of giving an output file of the form MY_PDF_ocr.pdf, thus keeping the name of the input file. But I would prefer to have the command running in terminal (konsole) so that I see the process. For that, I can use the line: Exec=konsole --noclose -e ocrmypdf "%u" ocr_en.pdf but without the output keeping the name of the input. A line like Exec=konsole --noclose -e ocrmypdf "%u" "${%u}_ocr.pdf" does nothing. How to adjust the ocrmypdf so that the command is run in konsole and the output includes the name of the input?
cipricus (1779 rep)
Nov 30, 2022, 02:43 PM • Last activity: Dec 1, 2022, 01:08 PM
1 votes
1 answers
594 views
Best command-line OCR software for recognizing typed text over colorful background
I need to extract text from images like the one below: [![example image][1]][1] As you can see, the text is typed not handwritten. Moreover, the background is colorful. I've tried Tesseract OCR, and while it works some of the times, it fails miserably on some inputs. For the above example, it produc...
I need to extract text from images like the one below: example image As you can see, the text is typed not handwritten. Moreover, the background is colorful. I've tried Tesseract OCR, and while it works some of the times, it fails miserably on some inputs. For the above example, it produces "Due CoN aicomrBi em Cela RTL". Which command-line OCR software would you recommend? If Tesseract is my best bet, can I transform these images to make it easier for Tesseract to recognize the characters? **EDIT**: Based on @MarcusMüller's suggestion, I used convert -threshold 55% to better separate the foreground text from the background. The resulting images are much better! binarized image Alas, Tesseract still is useless. On this new image, it produces: "Bim KM ioes Bm Meme e Cera". As such, the question remains open.
user549392
Nov 15, 2022, 07:35 PM • Last activity: Nov 15, 2022, 09:26 PM
0 votes
0 answers
48 views
How do I format texts that were processed by OCR?
Let's say that I want to connect all the paragraphs that are broken by the citations that start with (1), (2), (3), (4), (5). How would I express/automate this in bash? Keep in mind there are at most 5 citations in a single page which means I have to keep in mind all the combinations of 1, 1-2, 1-2-...
Let's say that I want to connect all the paragraphs that are broken by the citations that start with (1), (2), (3), (4), (5). How would I express/automate this in bash? Keep in mind there are at most 5 citations in a single page which means I have to keep in mind all the combinations of 1, 1-2, 1-2-3, 1-2-3-4, 1-2-3-4-5. Example: > parler en usage dans tous les temps et dans tous les pays, formaient > avec les nouveaux principes le contraste le plus frappant. Suivant le > système dont nous parlons, le pouvoir souverain, ou du moins la source > du pouvoir, est dans le peuple, c'est-à-dire, dans le corps des > sujets; car ce sont eux qui doivent avoir fondé l'État par leur > réunion. La masse du peuple est le véritable souverain , le maître > réel, le summus imperans ; c'est en elle seule que la majesté réside > tout entière (1). Les princes que l?on regardait jadis comme des > seigneurs indépendans, se trouvent changés en simples serviteurs ou > employés de leurs peuples (2); car celui qui recoit un pouvoir est > nécessairement subordonné à celui qui le confère (3). L'autorité > > (1) Du Contrat social, L. I. ch. 1, et mille autres ou vrages pareils. > On est forcé de soutenir cette proposition , du moment où l?on > considère le peuple (l'agrégation des hommes tenus à s'acquitter de > certains devoirs, à rendre de certains services ), comme une > bourgeoïsie ou une corporation libre , dont tout pouvoir dérive. > > (2) I y a donc aussi , suivant ce système, des maîtres et des > serviteurs dans le monde; seulement les nouveaux philosophes veulent > mettre les uns à la place des autres. > > (3) Constituens est superior constituto. Grotius et Pufendorf > s'élèvent déjà fortement contre l'application de cette règle , de > crainte de passer pour révolutionnaires. Mais il n'y a pas moyen de la > réfuter, dès qu'on part du principe de la délégation du pouvoir. Ils > allèguent , il est vrai, pour exemple, le tuteur : il est, disent-ils > , nommé dans l'intérét du pupille , qui cependant est au dessous de > lui. Mais la comparaison est fausse ; le tuteur n'a point été nommé > par le pupille, mais par les parens , ou par quelqu'autre autorité qui > certainement est au dessus de lui. > > > leur étant confiée par le peuple, ils n?en doivent faire usage que > pour les intérêts du peuple et jamais pour les leurs propres. L'empire > même le plus juste exercé par les princes , et sans aucun abus du > pouvoir, n'est plus un droit, mais une fonction ou un devoir (1), non > point, comme on le croyait jadis, envers le législateur divin, qui > est. aussi leur maître, mais envers le peuple, auquel seul ils sont > responsables de leur administration. La loi, c?est-à-dire, ce qui, > joint aux devoirs naturels, doit servir à tous ou à plusieurs de règle > obligatoire dans le lien social, n?est pas la volonté du seigneur ou > du chef, mais la volonté générale, la volonté de tous les sujets. > D'après les mêmes principes, les princes ne possèdent plus rien en > propre (2). Tous leurs > > > (1) Cest pour cela que les écrivains modernes parlent sans cesse des > devoirs des princes et des droits des peuples, jamais ils ne disent > autrement. Ce langage a été transporté même dans les rapports de > famille; il n'y est à présent question que des devoirs des parens et > des droits des enfans , comme si les parens n'avaient aucun droit > propre, et qu'ils eussent été établis par les enfans. > > (2) On peut dire d?un souverain, « qu?il ne possède rien » (en propre > ). Il ne peut avoir des domaines. » Kant. Elém. métaph. de jurisp. p. > 183. Il ajoute immédiatement > > > biens , tous leurs revenus viennent également du peuple, et demeurent > essentiellement la propriété de la nation. Ce sont des contributions > directes ou indirectes des membres de l'État, uniquement destinées à > des intérêts nationaux, à des besoins communs, et non point aux > dépenses particulières des princes. Ce dont ils ont besoin, eux et > leurs familles, pour jouir d'une existence décente et honorable, ne > doit être regardé que comme un traitement que le peuple leur accorde, > en vertu de leur charge. Tous les fonctionnaires et les serviteurs que > les princes emploient à l'instar des autres hommes, soit pour la > sûreté ou le soulagement de leur personne, soit pour l'administration > de leurs biens et de leurs revenus, soit pour la direction de diverses > autres affaires, deviennent des fonctionnaires publics, des serviteurs > de l'État , ou du peuple, et c'est à ce nouveau maître fictif qu'ils > sont responsables de leur conduite. En un mot, tous les États ne sont > plus que des républiques sous une autre forme, et la chose privée d?un > prince devient une chose publique (1).
Jean (1 rep)
Oct 1, 2022, 12:48 PM • Last activity: Oct 1, 2022, 04:59 PM
1 votes
0 answers
44 views
Can I transform colors of scanned pdf files and reduce the scan resolution to save memory keeping an existing text layer from OCR?
I have a pile of pdf files which have been scanned long ago and which are already searchable (i.e. they went through OCR). However the light level and contrast settings were not optimal. **Is it possible to reduce the bits per pixel of the existing files to some reasonable low level** in order to sa...
I have a pile of pdf files which have been scanned long ago and which are already searchable (i.e. they went through OCR). However the light level and contrast settings were not optimal. **Is it possible to reduce the bits per pixel of the existing files to some reasonable low level** in order to save storage space (make color-curve transformations, posterize or even binarize to black and white like in Gimp or other image-manipulation programs)**?** The files are scanned with 600 dpi and already searchable, i.e. in addition to the scanned image there is a text layer. Probably the scan resolution had been chosen so high in oder to obtain better OCR results. But it makes them excessively large. I think, a scan with 200 dpi would have created good visual quality with much less memory requirements. I want to maintain the OCR generated text layer with its good OCR quality. **What are the proper command?**
Adalbert Hanßen (303 rep)
Sep 14, 2022, 07:19 PM
0 votes
0 answers
96 views
NormCap OCR via Awesome Window Manager
One of the coolest programs I've come across recently, is an Optical Character Recognition (OCR) program called [NormCap][1]. I have it tied to a hot key, and anytime I want to copy un-highlightable text to my clipboard, I'm a hot key away from grabbing that (formerly uncooperative) text and get it...
One of the coolest programs I've come across recently, is an Optical Character Recognition (OCR) program called NormCap . I have it tied to a hot key, and anytime I want to copy un-highlightable text to my clipboard, I'm a hot key away from grabbing that (formerly uncooperative) text and get it into the clipboard (as text). Take that, you bad user interface! Take that, you text-image! However, recently, I've installed Awesome . And ever since, this very same AppImage can no longer function. It either locks up Awesome completely, or sometimes it will attempt to function as normal, but you can't see the text you want to select during NormCap's attempt to allow you to do so. It's a little hard to explain, but I think it is because NormCap uses transparency during your selection steps, and you cannot see through the intended transparency via Awesome. I feel like Awesome is saying to me: "Take that, you newbie!" There's probably some prerequisite startup application that I need to run prior to attempting a NormCap OCR capture. Please advise. When NormCap doesn't completely lock up Awesome, this [issue](https://github.com/dynobo/normcap/issues/154) is similar to what I see. I used this command to see my compositor: inxi -Gxx | grep compositor Output: Display: server: X.Org 1.20.11 compositor: picom driver: loaded: modesetting Then, I installed Compton instead, and replaced that in my awesome config. After a reboot, from within Awesome, it reported: Display: server: X.Org 1.20.11 compositor: compton driver: loaded: modesetting However, this had no effect on the issue. Awesome just locks up each time I run NormCap and I have to hit ctrl+super+r to reload Awesome.
Lonnie Best (5415 rep)
Dec 25, 2021, 11:46 PM • Last activity: Jan 25, 2022, 07:24 PM
1 votes
0 answers
568 views
Using tesseract for character recongniton, result is not as expected (much worse). How to get better?
I wanted to add output of Linux boot to my question and decided to try to use optical character recognition thinking now in 2022 surely there should be decent open source options (have not tried OCR for a long time). Links found via Web search "praise" `tesseract`. https://www.linuxlinks.com/ocrtool...
I wanted to add output of Linux boot to my question and decided to try to use optical character recognition thinking now in 2022 surely there should be decent open source options (have not tried OCR for a long time). Links found via Web search "praise" tesseract. https://www.linuxlinks.com/ocrtools/ second best on chart. https://askubuntu.com/questions/16268/whats-the-best-simplest-ocr-solution > Tesseract is probably the most accurate open source OCR engine > available. I've installed it from distro via apt-get and run. Result with out-of-the-box is IMO awful. Why? Maybe it can be ealily fixed? Or advice another package that does the job. The page I've tried to recognize lacks pictures, as I see it it is rather easy task. See below the result: Edit: in fact result when that small part is processed were much better, but when whole is processed than results are not ok. I understand making lines more horizontal and not skewed might help a lot, still I was hoping software got good at recognizing non-perfectly aligned text. enter image description here oon usb 1-@: | “3792661 usb 1-8: New USB device found, idVendor=1343, idProduct: 7.983163] usb 1-8: New USB dev bs P luct=5662, bedDevice=16.6? re eh peeled haibbetaia a : new high-speed USB device number 5 PhS | i Per Samm SCR Can) t pela ee rcpt PP cay : 2.998668) usb 1-8: er t Ct When only small part is processed: 2.837811) usb 1-8: new high-speed USB device number 5 using xhei_hed 2.979266] usb 1-8: New USB device ECU CREME Cnt ttc cain Tt teen Td 7.983163] usb 1-8: New USB device strings: Mfr=1, Product=2, SerialNumbers@ ?.9869291 usb 1-8: Product: Integrated Camera Added 1: Tried again smaller and less skewed picture, I guess software considers time stamps as separate column, I have not seen on man page options to tweak that: enter image description here f a eg | 7.849264] Device= 6.44 f 7 .6492961 | 7.849355] f 7.849415] [ 7.849492] | Van eos fl 7.861846] if Va ACB | 7.864776] if eel Be Ha Bs) bs 4 if be A be ge C ie BD LB ce B) te] Bs] rage lb eae 8.962076) ie Ke Lb 9.600567) 9.696957) 9 .6970371 YS SF SS Se usb 1-8: new high-speed USB device number 4 using xhci_hcd usb 1-8: New USB device found, idVendor=04f2, idProduct=b449, bed usb 1-8: New USB device strings: Mfr=3, Product=1, SerialNumber=2 usb 1-8: Product: Integrated Camera usb 1-8: Manufacturer: Chicony Electronics Co.,Ltd. usb 1-8: SerialNumber: 6x0001 usb-storage 1-1:1.6: USB Mass Storage device detected scsi host3: usb-storage 1-1:1.6 usbcore: registered new interface driver usb-storage usbcore: registered new interface driver uas scsi 3:0:6:@: Direct-fAccess General UDisk eg sd 3:0:0:0: Attached scsi generic sgi type @ eM Pee PM eA PA ed) te) ae Py Me ee dd Py ee ee eee dm sd 3:0:0:0: [sdb] Assuming drive cache: write through sdb: sdbi sdb2 sdb3 sd 3:0:0:0: [sdb] Attached SCSI removable disk squashfs: version 4.6 (2609/01/31) Phillip Lougher Copying live image to RAM... Ca ewe te Mae
Martian2020 (1443 rep)
Jan 10, 2022, 06:35 AM • Last activity: Jan 10, 2022, 07:13 AM
Showing page 1 of 20 total questions