Sample Header Ad - 728x90

Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

0 votes
0 answers
33 views
Relinking OCR data to downscaled images
I have a PDF consisting of scanned pages with OCR done by `tesseract`. I want to downscale the images (by around 4x) and retain the OCR. What would be an automatic way to relink the OCR data to the new coordinates on the pages? --- Original question (*Sudden surge in the size of PDF generated by Gho...
I have a PDF consisting of scanned pages with OCR done by tesseract. I want to downscale the images (by around 4x) and retain the OCR. What would be an automatic way to relink the OCR data to the new coordinates on the pages? --- Original question (*Sudden surge in the size of PDF generated by Ghostscript 9.55.0*): I have a large PDF and am unsuccessfully trying to reduce its size. (The file is copyrighted material, so I cannot post it; I am doing this for personal use. It must be a common problem, though: I had the same experience earlier with other files from different sources.) The file is a scan of a book with 1500 B&W A4 pages of text, no pictures at all. The individual pages were mogrify-ed into PNG images of equal height (around 1000px) and cleaned up via scantailor-advanced. Then each of the pages (now in TIFF) was tesseracted. The results were pdfunited into a 200MB file. This is way too large for this kind of book. I would like to be able to shrink it to around 30MB, perhaps 50. (The total text size extracted by pdftext is 9MB.) Most of the PDF compression methods I found on StackExchange and other sites boil down to gs with varying parameters. On my machine they all behave in a very similar way. I start gs in the terminal and switch to a GUI file manager. The size of the output file grows slowly and steadily from 0 to around 15MB (no matter the settings), and then it gallops in the last split second, as if gs gives up and just dumps the input into the output verbatim. (I attributed this to memory shortage, but the program also exhibits similar behavior on a relatively small, 100-page, part of this file.) If gs is not told to change the DPI (300), the output file becomes as large as the input was. If the DPI is changed to 72, the file becomes 70MB; this is still too much for such a loss in image quality. Is there an explanation of this surge? Should I perhaps use some other toolchain on the raw scans, or a different optimization tool? pdfsizeopt is very slow and seems to lead to 10% reduction. tiff2pdf -j 50 saves 5% (which will be re-added during OCR).
Dilettante (101 rep)
Jul 25, 2025, 07:17 PM • Last activity: Jul 26, 2025, 07:19 PM
1 votes
0 answers
74 views
Pdfsandwich does not work
I use arch linux. When running pdfsandwich, I get the following error: ``` ERROR: Command "OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmpae6a1b/pdfsandwich76f2d2.tif>/dev/null 2>&1 /tmp/pdfsandwich_tmpae6a1b/pdfsandwich6ff0e9 -l eng pdf " failed. Terminating pdfsandwich. All temporary files are k...
I use arch linux. When running pdfsandwich, I get the following error:
ERROR: Command "OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmpae6a1b/pdfsandwich76f2d2.tif>/dev/null 2>&1 /tmp/pdfsandwich_tmpae6a1b/pdfsandwich6ff0e9  -l eng pdf " failed. 
Terminating pdfsandwich. All temporary files are kept.
after receiving the error message
WARNING: The convert command is deprecated in IMv7, use "magick" instead of "convert" or "magick convert"
Does anyone know what the problem is? I do not know where to begin to try fixing it. EDIT: When running
tesseract /tmp/pdfsandwich_tmpae6a1b/pdfsandwich76f2d2.tif /tmp/pdfsandwich_tmpae6a1b/pdfsandwich6ff0e9  -l eng pdf
I get
Error opening data file /usr/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
fish_monster (111 rep)
Sep 18, 2024, 07:24 PM • Last activity: Sep 20, 2024, 05:46 PM
0 votes
0 answers
300 views
What happened to Tesseract's "Math / equation detection module"?
I was able to get Tesseract to run via a Python script on my Windows machine to turn non-searchable PDFs into searchable ones. When downloading Tesseract onto windows, it asked me which languages I wanted and I selected them, this was when I learned about the math module to begin with. I am not sure...
I was able to get Tesseract to run via a Python script on my Windows machine to turn non-searchable PDFs into searchable ones. When downloading Tesseract onto windows, it asked me which languages I wanted and I selected them, this was when I learned about the math module to begin with. I am not sure how effective the math module was, but I could see that it was downloaded when I checked the languages. Now I am trying to install Tesseract on Debian. To install Tesseract I used the command: sudo apt install -y tesseract-ocr Then, to ensure I had the math module, I would always follow that up with: sudo apt install tesseract-ocr-equ And, I am pretty sure that would install the math module. I remember using that command successfully several times, including earlier this morning. However, now, when I use that code, I get the following messages: Reading package lists... Done Building dependency tree... Done Reading state information... Done E: Unable to locate package tesseract-ocr-equ Just to make sure I wasn't crazy, I looked up the language codes used by Tesseract, [according to Debian.org](https://manpages.debian.org/testing/tesseract-ocr/tesseract.1.en.html#LANGUAGES_AND_SCRIPTS:~:text=Math%20/%20equation%20detection%20module) , and they say that "equ" belongs to the "Math / equation detection module", admittedly that is an earlier version. So, I tried the following code: sudo apt-get install -y tesseract-ocr-equ Among the several lines of code that I got in response were the following: Note, selecting 'tesseract-ocr-uzb-cyrl' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-ell' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-eng' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-enm' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-epo' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-est' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-eus' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-que' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-uig' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-ukr' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-urd' for regex 'tesseract-ocr-[equ]' Note, selecting 'tesseract-ocr-uzb' for regex 'tesseract-ocr-[equ]' tesseract-ocr-eng is already the newest version (1:4.1.0-2). tesseract-ocr-eng set to manually installed. So, this made me wonder if there was a different math module for different languages, and the math module is automatically downloaded with the language you download. I just really remember using the command initially without any problem. That being said, I have had several head injuries, so my memory is not entirely reliable. It's just that if I turn out to have been mistaken here and I have not been using that code as I remember, this will be one of those deeply troubling times due to how vividly I remember this working. So, the primary question is how do I download the "Math / equation detection module" for Tesseract onto my Linux Beta on my Chromebook. Secondarily, could someone tell me if the functionality of the "sudo apt install tesseract-ocr-equ" command changed recently. This is frustrating me quite a bit. I am hoping that someone just changed the functionality this morning and math modules are now built into the languages.
Curious Layman (101 rep)
May 16, 2024, 04:17 PM • Last activity: May 21, 2024, 09:06 AM
2 votes
0 answers
50 views
OCR high res images & combine OCR data later, after image compression?
I have a large number of .tif's coming out of ScanTailor. Is there a way that I might OCR those .tif's with tesseract, holding the OCR data separate from the images; then compress the images, and finally combine the OCR data with the compressed images? The point is that I don't want to compress befo...
I have a large number of .tif's coming out of ScanTailor. Is there a way that I might OCR those .tif's with tesseract, holding the OCR data separate from the images; then compress the images, and finally combine the OCR data with the compressed images? The point is that I don't want to compress before I OCR, and the tools for compressing the pdf's later, preserving the OCR, are not great.
Diagon (740 rep)
Jul 7, 2023, 10:50 PM
1 votes
1 answers
594 views
Best command-line OCR software for recognizing typed text over colorful background
I need to extract text from images like the one below: [![example image][1]][1] As you can see, the text is typed not handwritten. Moreover, the background is colorful. I've tried Tesseract OCR, and while it works some of the times, it fails miserably on some inputs. For the above example, it produc...
I need to extract text from images like the one below: example image As you can see, the text is typed not handwritten. Moreover, the background is colorful. I've tried Tesseract OCR, and while it works some of the times, it fails miserably on some inputs. For the above example, it produces "Due CoN aicomrBi em Cela RTL". Which command-line OCR software would you recommend? If Tesseract is my best bet, can I transform these images to make it easier for Tesseract to recognize the characters? **EDIT**: Based on @MarcusMüller's suggestion, I used convert -threshold 55% to better separate the foreground text from the background. The resulting images are much better! binarized image Alas, Tesseract still is useless. On this new image, it produces: "Bim KM ioes Bm Meme e Cera". As such, the question remains open.
user549392
Nov 15, 2022, 07:35 PM • Last activity: Nov 15, 2022, 09:26 PM
0 votes
1 answers
161 views
Tesseract doesn't accept process substitution
I'm making a quick script that is supposed to use OCR tool (`tesseract`) on image in clipboard to convert it to text and output it. It looks like this: ```sh #!/bin/sh temp="$(mktemp tmpXXX.png)" xclip -selection clipboard -t image/png -o > $temp tesseract $temp stdout 2>/dev/null rm $temp ``` What...
I'm making a quick script that is supposed to use OCR tool (tesseract) on image in clipboard to convert it to text and output it. It looks like this:
#!/bin/sh

temp="$(mktemp tmpXXX.png)"
xclip -selection clipboard -t image/png -o > $temp
tesseract $temp stdout 2>/dev/null
rm $temp
What I'm wondering is why doesn't this one-liner tesseract <(xclip -selection clipboard -t image/png -o) stdout work? From what I know, process substitution is supposed to make temporary file (similar to my full script) that tesseract uses as input file. Alas, this leads to an error:
Error in pixReadStream: Unknown format: no pix returned
Error in pixRead: pix not read
Error during processing.
Does anybody have an idea why this happens? Thanks in advance.
Fedja (125 rep)
Apr 4, 2022, 05:08 PM • Last activity: Apr 4, 2022, 06:10 PM
0 votes
1 answers
129 views
Scripting tesseract for file manager context menu
File manager context menu scripts sometimes do the job far quicker than using a GUI utility. So I've been using dozens of simple and more complex scripts for a long time in file managers Dolphin, Nautilus and Nemo, although I have elementary level scripting skills. However, this time I'm stuck with...
File manager context menu scripts sometimes do the job far quicker than using a GUI utility. So I've been using dozens of simple and more complex scripts for a long time in file managers Dolphin, Nautilus and Nemo, although I have elementary level scripting skills. However, this time I'm stuck with a very simple loop to OCR selected image file(s) using **tesseract** in **Dolphin**, which works in many other scripts: for filename in "${@}"; do tesseract -l eng "$filename" "${filename%.*}" done This should normally be executed for a selected image (or each and every one of the selected images) like this, a command which works in Terminal, giving me a text file named "image.txt": tesseract -l eng "image.png" "image" Any ideas please???
Sadi (515 rep)
Feb 25, 2022, 02:20 PM • Last activity: Feb 27, 2022, 06:05 PM
1 votes
0 answers
568 views
Using tesseract for character recongniton, result is not as expected (much worse). How to get better?
I wanted to add output of Linux boot to my question and decided to try to use optical character recognition thinking now in 2022 surely there should be decent open source options (have not tried OCR for a long time). Links found via Web search "praise" `tesseract`. https://www.linuxlinks.com/ocrtool...
I wanted to add output of Linux boot to my question and decided to try to use optical character recognition thinking now in 2022 surely there should be decent open source options (have not tried OCR for a long time). Links found via Web search "praise" tesseract. https://www.linuxlinks.com/ocrtools/ second best on chart. https://askubuntu.com/questions/16268/whats-the-best-simplest-ocr-solution > Tesseract is probably the most accurate open source OCR engine > available. I've installed it from distro via apt-get and run. Result with out-of-the-box is IMO awful. Why? Maybe it can be ealily fixed? Or advice another package that does the job. The page I've tried to recognize lacks pictures, as I see it it is rather easy task. See below the result: Edit: in fact result when that small part is processed were much better, but when whole is processed than results are not ok. I understand making lines more horizontal and not skewed might help a lot, still I was hoping software got good at recognizing non-perfectly aligned text. enter image description here oon usb 1-@: | “3792661 usb 1-8: New USB device found, idVendor=1343, idProduct: 7.983163] usb 1-8: New USB dev bs P luct=5662, bedDevice=16.6? re eh peeled haibbetaia a : new high-speed USB device number 5 PhS | i Per Samm SCR Can) t pela ee rcpt PP cay : 2.998668) usb 1-8: er t Ct When only small part is processed: 2.837811) usb 1-8: new high-speed USB device number 5 using xhei_hed 2.979266] usb 1-8: New USB device ECU CREME Cnt ttc cain Tt teen Td 7.983163] usb 1-8: New USB device strings: Mfr=1, Product=2, SerialNumbers@ ?.9869291 usb 1-8: Product: Integrated Camera Added 1: Tried again smaller and less skewed picture, I guess software considers time stamps as separate column, I have not seen on man page options to tweak that: enter image description here f a eg | 7.849264] Device= 6.44 f 7 .6492961 | 7.849355] f 7.849415] [ 7.849492] | Van eos fl 7.861846] if Va ACB | 7.864776] if eel Be Ha Bs) bs 4 if be A be ge C ie BD LB ce B) te] Bs] rage lb eae 8.962076) ie Ke Lb 9.600567) 9.696957) 9 .6970371 YS SF SS Se usb 1-8: new high-speed USB device number 4 using xhci_hcd usb 1-8: New USB device found, idVendor=04f2, idProduct=b449, bed usb 1-8: New USB device strings: Mfr=3, Product=1, SerialNumber=2 usb 1-8: Product: Integrated Camera usb 1-8: Manufacturer: Chicony Electronics Co.,Ltd. usb 1-8: SerialNumber: 6x0001 usb-storage 1-1:1.6: USB Mass Storage device detected scsi host3: usb-storage 1-1:1.6 usbcore: registered new interface driver usb-storage usbcore: registered new interface driver uas scsi 3:0:6:@: Direct-fAccess General UDisk eg sd 3:0:0:0: Attached scsi generic sgi type @ eM Pee PM eA PA ed) te) ae Py Me ee dd Py ee ee eee dm sd 3:0:0:0: [sdb] Assuming drive cache: write through sdb: sdbi sdb2 sdb3 sd 3:0:0:0: [sdb] Attached SCSI removable disk squashfs: version 4.6 (2609/01/31) Phillip Lougher Copying live image to RAM... Ca ewe te Mae
Martian2020 (1443 rep)
Jan 10, 2022, 06:35 AM • Last activity: Jan 10, 2022, 07:13 AM
2 votes
0 answers
105 views
Is there software to manually OCR / teach OCR for handwriting (non-english) texts?
I had a problem that can't solve Tesseract/Abbyy Finereader etc - they can't recognize handwriting Russian as example. So I search 1. OCR software for such things 2. or a way to manually OCR my pdfs (create layers, draw squares, fill it with text by hands) 3. maybe teach OCR engine locally for autom...
I had a problem that can't solve Tesseract/Abbyy Finereader etc - they can't recognize handwriting Russian as example. So I search 1. OCR software for such things 2. or a way to manually OCR my pdfs (create layers, draw squares, fill it with text by hands) 3. maybe teach OCR engine locally for automation after some manually job
PDD (21 rep)
Oct 15, 2021, 04:19 AM
0 votes
1 answers
232 views
How do you save the text in the terminal to various text formats?
I'm playing around a bit with OCR software, in particular I'm spending a bit of time with tesseract. I got it to where I can load an image and get tesseract to rip the text from the image, in Linux terminal. I'm now trying to figure out how I can automatically save that ripped text to pdf, odf, txt...
I'm playing around a bit with OCR software, in particular I'm spending a bit of time with tesseract. I got it to where I can load an image and get tesseract to rip the text from the image, in Linux terminal. I'm now trying to figure out how I can automatically save that ripped text to pdf, odf, txt and word formats, from the terminal.
Neil Meyer (149 rep)
Mar 8, 2021, 09:21 AM • Last activity: Mar 9, 2021, 10:50 AM
10 votes
2 answers
13787 views
Tesseract: High CPU Usage and slow speed, only when running multiple processes in parallel
# Problem `pytesseract.image_to_string()` takes too much time when I run the script through supervisordd, but executes almost instantaneously when run directly in shell (on the same server and simultaneously with supervisor scripts). Apart from taking too much time, the processes are also showing hi...
# Problem pytesseract.image_to_string() takes too much time when I run the script through supervisordd, but executes almost instantaneously when run directly in shell (on the same server and simultaneously with supervisor scripts). Apart from taking too much time, the processes are also showing high CPU usage. Time taken by pytesseract.image_to_string() when run via Supervisord: ~30s Time taken by pytesseract.image_to_string() when run via Bash: 0.1s This problem only occurs, if there are a lot of processes, executing pytesseract.image_to_string(), being run via supervisord (around 22 instances). If I reduce the number of instances (to around 10), the scripts executed via supervisord also run smoothly. ### System Information OS: Ubuntu 18.04.2 LTS (bionic) Supervisord: Version 3.3.1 Tesseract: Version 4.0.0-beta.1 Python: Version 3.6 PyTesseract: Version 0.2.5 ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 127357
max locked memory       (kbytes, -l) 16384
max memory size         (kbytes, -m) unlimited
open files                      (-n) 8096
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 127357
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Let me know if you need any more information. ## Edit 1 (or I know what's NOT the source of this problem) I am fairly certain that it is not an issue with Supervisord. When I run one instance from an ssh shell, the function (pytesseract.image_to_string()) is executed smoothly (i.e takes only 0.1s), while there are 10 instances being run via Supervisord. When I start another instance from a new ssh shell, both the instances (ones started from ssh) run smoothly most of the time. When I start yet another instance from a new ssh shell, all the three instances start choking, taking around 10s to execute the function. This time keeps on increasing as I add more instances via shell. So the problem can be replicated even with a shell. ### More Information I ran the program with strace -T -f but I could not figure out what exactly is causing the spike in time. For a function call that takes 1s
Top 10 system calls sorted by time taken
1.504530    [pid 29921]  [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 30166
0.503915    [pid 29932]  )      = 0 (Timeout)
0.503472    [pid 29932]  )      = 0 (Timeout)
0.500524    [pid 29933]  )      = 0 (Timeout)
0.500515    [pid 29933]  )      = 0 (Timeout)
0.500514    [pid 29932]  )      = 0 (Timeout)
0.500512    [pid 29933]  )      = 0 (Timeout)
0.069869    [pid 30169]  )       = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
0.035989    [pid 30167]  )       = 0
0.016002    [pid 30168]  )       = 0
For a function call that takes 9s
Top 10 system calls sorted by time taken
9.795787    [pid 29921]  [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 30106
0.515960    [pid 29933]  )      = 0 (Timeout)
0.511955    [pid 29933]  )      = 0 (Timeout)
0.507979    [pid 29932]  )      = 0 (Timeout)
0.507968    [pid 29932]  )      = 0 (Timeout)
0.505257    [pid 29932]  )      = 0 (Timeout)
0.503988    [pid 29932]  )      = 0 (Timeout)
0.503978    [pid 29932]  )      = 0 (Timeout)
0.503975    [pid 29932]  )      = 0 (Timeout)
0.503974    [pid 29932]  )      = 0 (Timeout)
Ashish (270 rep)
Jul 18, 2019, 08:29 AM • Last activity: Jan 28, 2021, 05:56 PM
0 votes
1 answers
1583 views
Install tesseract offline in RHEL
I have an RHEL based server that does not connect to the internet. I need to install Tesseract >4.0 on this server. Therefore, my option was to download RPM packages from another and move them to the server and install using `rpm` command. I have used ([https://build.opensuse.org/project/show/home:A...
I have an RHEL based server that does not connect to the internet. I need to install Tesseract >4.0 on this server. Therefore, my option was to download RPM packages from another and move them to the server and install using rpm command. I have used (https://build.opensuse.org/project/show/home:Alexander_Pozdnyakov ) from the official tesseract documentation to download the RPM. The issue is when I try to install those RPMs, they have a lot of other dependencies which is very difficult to get one by one. Are there any other alternatives to install tesseract without connecting to the internet? Or any other source to download all RPMs at once?
Sathindu (101 rep)
Aug 19, 2020, 10:21 AM • Last activity: Aug 19, 2020, 11:40 AM
3 votes
0 answers
345 views
Debian Buster: Tesseract not supporting URL as argument
I'm trying to parse text from a hosted image, but it looks like I've miss-configured Tesseract. I'm using Debian Buster, `tesseract-ocr`, `libtesseract-dev` and a Ruby wrapper are installed. ``` # $ tesseract -v tesseract 4.0.0 leptonica-1.76.0 libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpn...
I'm trying to parse text from a hosted image, but it looks like I've miss-configured Tesseract. I'm using Debian Buster, tesseract-ocr, libtesseract-dev and a Ruby wrapper are installed.
#  $ tesseract -v
tesseract 4.0.0
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE
Inside a terminal tesseract output returns Error, cannot read input : No such file or directory. The same error message is raised using the Ruby gem. Did I miss something after installing the packages ? The doc talks about manually placing the traneddata directory on Ubuntu, should it also be done on Debian ? > The traineddata is currently not shipped with the snap package and must be placed manually to ~/snap/tesseract/current. I can get it working by using curl and local path as argument, but it should support URL as argument Thanks **EDIT** I've tested both v4.1.1 and v5.0.0 by following these instructions and setting up tessdata directory. They both explicity returns that they don't support URLs:
Tesseract Open Source OCR Engine v5.0.0-alpha-647-g4a00 with Leptonica
Error, this tesseract has no URL support
Error during processing.
I'm obviously missing something because release notes says it supports URL since 4.1.1
Sumak (273 rep)
May 6, 2020, 05:05 PM • Last activity: May 6, 2020, 06:40 PM
1 votes
0 answers
51 views
script run via keyboard binding does not write to file
Following bash script interprets text in an image file and writes to a .txt file. #!/usr/bin/env bash LD_LIBRARY_PATH="/usr/local/lib" export LD_LIBRARY_PATH /usr/local/bin/tesseract /home/martin/work/textpic.png /home/martin/work/tesseract-out When I run it from the terminal the tesseract-out.txt i...
Following bash script interprets text in an image file and writes to a .txt file. #!/usr/bin/env bash LD_LIBRARY_PATH="/usr/local/lib" export LD_LIBRARY_PATH /usr/local/bin/tesseract /home/martin/work/textpic.png /home/martin/work/tesseract-out When I run it from the terminal the tesseract-out.txt is created, but when I run it via custom keyboard shortcut nothing is written. I have ensured that the correct script is run by putting echo "test" > /home/martin/work/test.txt in it, which creates the file. I have run sudo chmod 777 on tesseract in case it was some permission issue. I have an inkling that tesseract needs some lib files which are not in paths when the script is run by shortcut, so I put these lines at the top of my script file (I know that some of the lib files it needs are in /usr/local/lib): LD_LIBRARY_PATH="/usr/local/lib" export LD_LIBRARY_PATH But it did not do the the trick. How can I debug what is going wrong? If I could obtain some kind of error message somehow, that would go a long way. My linux version: DISTRIB_RELEASE=18.3 DISTRIB_CODENAME=sylvia DISTRIB_DESCRIPTION="Linux Mint 18.3 Sylvia" NAME="Linux Mint" VERSION="18.3 (Sylvia)" ID=linuxmint ID_LIKE=ubuntu PRETTY_NAME="Linux Mint 18.3" VERSION_ID="18.3" HOME_URL="http://www.linuxmint.com/ " SUPPORT_URL="http://forums.linuxmint.com/ " BUG_REPORT_URL="http://bugs.launchpad.net/linuxmint/ " VERSION_CODENAME=sylvia UBUNTU_CODENAME=xenial The keyboard-shortcuts manager I use is the standard GUI one in Mint 18. Might be I could use something else to get a better error message? keyboard-shortcuts GUI EDIT: I verified that all needed libs are in path by putting /sbin/ldconfig -N -v $(sed 's/:/ /g' /home/martin/work/libs-in-path.txt at the bottom of my script and crosschecking the output against readelf -d /usr/local/bin/tesseract | grep NEEDED.
MyrionSC2 (111 rep)
Dec 24, 2019, 09:41 AM • Last activity: Dec 26, 2019, 10:51 AM
0 votes
1 answers
296 views
Leptonica compilation error
Trying to install leptonica v1.78 on Ubuntu 16, but it's not working for some reason. After running ```./configure``` and ```make```, I keep getting this error: ``` make[2]: Entering directory '/home/user/Documents/leptonica/leptonica-1.78.0/prog' CC convertfilestopdf.o CCLD convertfilestopdf ../src...
Trying to install leptonica v1.78 on Ubuntu 16, but it's not working for some reason. After running
./configure
and
, I keep getting this error:
make: Entering directory '/home/user/Documents/leptonica/leptonica-1.78.0/prog'
  CC       convertfilestopdf.o
  CCLD     convertfilestopdf
../src/.libs/liblept.so: undefined reference to `lzham_z_version'
../src/.libs/liblept.so: undefined reference to `lzham_z_deflateInit'
../src/.libs/liblept.so: undefined reference to `lzham_z_inflate'
../src/.libs/liblept.so: undefined reference to `lzham_z_deflate'
../src/.libs/liblept.so: undefined reference to `lzham_z_deflateEnd'
../src/.libs/liblept.so: undefined reference to `lzham_z_inflateInit'
../src/.libs/liblept.so: undefined reference to `lzham_z_inflateEnd'
collect2: error: ld returned 1 exit status
Makefile:2603: recipe for target 'convertfilestopdf' failed
make: *** [convertfilestopdf] Error 1
make: Leaving directory '/home/user/Documents/leptonica/leptonica-1.78.0/prog'
Makefile:476: recipe for target 'all-recursive' failed
make: *** [all-recursive] Error 1
make: Leaving directory '/home/user/Documents/leptonica/leptonica-1.78.0'
Makefile:385: recipe for target 'all' failed
make: *** [all] Error 2
I think I installed all the dependencies needed, am I missing something?
Gyakenji (101 rep)
Jun 10, 2019, 07:37 AM • Last activity: Jun 11, 2019, 02:11 AM
2 votes
1 answers
700 views
Where I can get Tesseract binaries for Debian 6 64bit?
I used apt-get to install Tesseract but it's not really working. Maybe I could just download binaries somewhere, put in a dir and use this way? What's wrong with my Tesseract now: tesseract --help tesseract:Error:Usage:tesseract imagename outputbase [-l lang] [configfile [[+|-]varfile]...] and tesse...
I used apt-get to install Tesseract but it's not really working. Maybe I could just download binaries somewhere, put in a dir and use this way? What's wrong with my Tesseract now: tesseract --help tesseract:Error:Usage:tesseract imagename outputbase [-l lang] [configfile [[+|-]varfile]...] and tesseract test.tif out2.txt -l pol Unable to load unicharset file /usr/share/tesseract-ocr/tessdata/pol.unicharset I have downloaded and unpacked Polish language into the dir above, but the only pol.* is pol.traindeddate.
buikoto (21 rep)
Jan 23, 2015, 10:05 PM • Last activity: Mar 7, 2018, 09:47 PM
5 votes
1 answers
2029 views
tesseract: is it possible to change font output in OCRed pdf?
Following up on https://unix.stackexchange.com/questions/301318/how-to-ocr-a-pdf-file-and-get-the-text-stored-within-pdf/301319#301319 I have successfully produced OCRed pdf pages. In Evince, however, the letters are not shown; by this I mean that I cannot see the characters, but I can select them,...
Following up on https://unix.stackexchange.com/questions/301318/how-to-ocr-a-pdf-file-and-get-the-text-stored-within-pdf/301319#301319 I have successfully produced OCRed pdf pages. In Evince, however, the letters are not shown; by this I mean that I cannot see the characters, but I can select them, copy them and paste them elsewhere successfully. This does not seem to be a bug of Evince: https://bugzilla.redhat.com/show_bug.cgi?id=1364201 When initiating an OCR of a pdf page with pdfsandwich, tesseract produces a page that > contains a font which doesn't have any usable glyphs (they named it GlyphLessFont). It has only .notdef and .null replacements (the squares). Evince uses the .notdef glyph if there is no glyph for the character. The reason that Okular highlight the text is because it does it in the image not as a regular text as evince does. pdftotext recognises the characters. Now, the question is: can tesseract be told to use a different font?
ingli (2029 rep)
Aug 27, 2016, 08:14 AM • Last activity: Mar 22, 2017, 09:39 PM
Showing page 1 of 17 total questions