|
Author |
Message |
stinkwheel |
This post is not being displayed .
|
stinkwheel Bovine Proctologist
Joined: 12 Jul 2004 Karma :
|
Posted: 17:45 - 17 Jan 2023 Post subject: Text recognition software? |
|
|
I have a scanned workshop manual for a Belarussian bike I'd like to translate. It's got detailed information of crank rebuilds and suchlike.
It's in Russian Cyrillic. Google translate can't cope.
I had some success running a couple of pages through an online OCR site then running the result through google translate but it's very time consuming and they only do so many pages free, I'm looking at about 90 pages.
Ideally I want to split it into individual pages, paste the diagrams into a publishing suite then run the text parts through a local text recognition app. translate it and paste that in too so I make up a proper English workshop manual.
Any recommendations for decent text recognition software that will handle Cyrillic? Searching suggests Tesseract is probably a good option.
Would ideally be to run on Linux (debian). I have a windows PC but it's a small notebook so will probably take ages to churn out graphic related stuff, also no mouse. ____________________ “Rule one: Always stick around for one more drink. That's when things happen. That's when you find out everything you want to know.”
I did the 2010 Round Britain Rally on my 350 Bullet. 89 landmarks, 3 months, 9,500 miles. |
|
Back to top |
|
You must be logged in to rate posts |
|
|
WD Forte |
This post is not being displayed .
|
WD Forte World Chat Champion
Joined: 17 Jun 2010 Karma :
|
|
Back to top |
|
You must be logged in to rate posts |
|
|
stinkwheel |
This post is not being displayed .
|
stinkwheel Bovine Proctologist
Joined: 12 Jul 2004 Karma :
|
Posted: 14:15 - 23 Jan 2023 Post subject: |
|
|
Google translate actually does a fairly good job of the actual translation once you have the text. There's the occasional "When passing a sporting horse, do not loudly tootle your horn parper." moment but on the whole, you can tell what it means.
It's the text recognition bit it couldn't cope with. I managed to get teseract up and running but it really struggled. Even when I broke it down into blocks of text and ran it through some image manipulation to clear it up I was still getting more gaps than text.
It looks like the online one is going to be the way forwards, I may have to just pay for some useage and rattle it all through. I'll get it all broken down into sections ahead of time and set up onto pages because they charge per page.
Convertio has been the best tool so far. It seems remarkably good at intelligantly filling in any blanks/smeared text. ____________________ “Rule one: Always stick around for one more drink. That's when things happen. That's when you find out everything you want to know.”
I did the 2010 Round Britain Rally on my 350 Bullet. 89 landmarks, 3 months, 9,500 miles. |
|
Back to top |
|
You must be logged in to rate posts |
|
|
stinkwheel |
This post is not being displayed .
|
stinkwheel Bovine Proctologist
Joined: 12 Jul 2004 Karma :
|
|
Back to top |
|
You must be logged in to rate posts |
|
|
hellkat |
This post is not being displayed .
|
hellkat Super Spammer
Joined: 12 Jul 2004 Karma :
|
Posted: 14:50 - 05 Feb 2023 Post subject: |
|
|
I love that we can always rely on you to have some interesting off the wall shit going on
Can you post a link to the online OCR page pls ____________________ Not nearly as interesting in real life. |
|
Back to top |
|
You must be logged in to rate posts |
|
|
stinkwheel |
This post is not being displayed .
|
stinkwheel Bovine Proctologist
Joined: 12 Jul 2004 Karma :
|
|
Back to top |
|
You must be logged in to rate posts |
|
|
Nobby the Bastard |
This post is not being displayed .
|
Nobby the Bastard Harley Gaydar
Joined: 16 Aug 2013 Karma :
|
|
Back to top |
|
You must be logged in to rate posts |
|
|
AlexArcvarden |
This post is not being displayed because it has a low rating (Spam). Unhide this post / all posts.
|
AlexArcvarden Derestricted Danger
Joined: 03 Mar 2023 Karma :
|
Posted: 10:32 - 03 Mar 2023 Post subject: |
|
|
It sounds like you're on the right way with using OCR software, but Tesseract is not the most convenient service for me personally.
Since you're looking to run the software on Linux, I'd recommend also checking out Smart Engines. It's an open-source OCR engine developed for the task you stated. Smart Engines has some additional features that may be helpful for your project, such as the ability to segment pages into regions of text and images automatically. I hope that helps you. |
|
Back to top |
|
You must be logged in to rate posts |
|
|
Old Thread Alert!
The last post was made 1 year, 52 days ago. Instead of replying here, would creating a new thread be more useful? |
|
|
|