OCR Tools: Coping with PDF Files

Picture source: https://openclipart.org/detail/198124/mono_ocr-select-by-dannya

Picture source: https://openclipart.org/detail/198124/mono_ocr-select-by-dannya

As a medical translator, I work with a LOT of PDF files. I probably use my OCR tool up to 10 times per day and I’m fairly certain that at this point, I couldn’t work without it. However, it took some time before I figured out exactly how to get the most out of it and I’m certain that I haven’t even scratched the surface. In case you are not familiar with OCR, it stands for “Optical Character Recognition” and is basically used to turn “dead” (not editable) documents of all kinds (including pictures and PDFs) into editable Word documents preserving the formatting of the original. This sometimes works better in theory than in practice since a bad fax can ruin the OCR tool’s ability to properly recreate formatting.

Fixing the strange formatting produced by an OCR tool can be more difficult than recreating the formatting from scratch. With that said, it still has plenty of uses. I use the text from an OCR file pasted as unformatted text into a new clean file which I format from scratch. I find this to be the easiest way to get around the strange formatting the files can create while still taking advantage of the benefits.

Quality: When editing translations from PDF documents, I often find that translators omit text. Although this is an unacceptable translation error, it does happen. OCR helps ensure all of the text gets translated, just like using a Word file.

Computer-Assisted Translation tools: OCR enables you to use your favorite CAT tool with a dead PDF file. This helps speed up the translation process by taking advantage of the matches and repetitions that are generally inaccessible in PDF translations. You can also increase consistency by always ensuring that segments and terminology are translated the same way throughout a document.

Numbers, names and lists: Have you ever waded through pages and pages of a lab report? Ever painfully retyped tables full of numbers? An OCR tool will recreate all of those numbers for you. That means all you need to do is proofread them! Or, how about a list of names with phone numbers? Don’t type the whole list from scratch—OCR the list and proofread instead!

Tables: Although OCR tools can create strange formatting, they are great with simple tables and lines that they can read well. You may just need to correct the cell alignment and font.

Word counts: Most translators estimate how long a project will take based on the number of words in the document. With a PDF, the word count is usually estimated a variety of ways, but the accuracy varies. I recently had a client ask me to translate a very technical medical document with 2,000 words in 24 hours. No problem, right? It looked a little longer than that to me so I sent the file through my OCR tool and it turned out that the file was 7,000 words. No, I’m not kidding. That would have been a long night.

Flat rates: Having an accurate word count also allows you to give clients a flat rate if you so choose and/or helps provide a more accurate quote up front so no one is surprised.

Just remember that OCR tools only give an estimate. If you use it to check the word count of a document, be sure to scroll through and make sure that all or most of the text was picked up by the OCR tool. If it can’t read something, it will be inserted as a picture and maybe a picture is worth a thousand words, but not to a translator!

How do you use your OCR tool?