Pdf looks crap when OCRd

Xavier · #1 10-23-2024, 10:49 PM

I've been working on a very large project and part of it involves converting some court documents into editable Word Documents (RTF files).

I've seen and read "Cleaning up Text Pasted from Websites, E-mails, PDFs etc"
https://www.msofficeforums.com/word/...-pdfs-etc.html but found that hasn't helped.

I've never had problems like this in OCR-ing PDFs ever. If someone can take a look at the attached and point me in the direction of somewhere that explains how I can try and automate the process a bit to clean up the Word Docs I would be really grateful.

I've attached the Word Docs as .docx files, but the ones I am working with are RTF files. Unfortunately the file uploader doesn't allow for these files to be uploaded.

I've tried removing column and section breaks and selecting everything and pressing Control + Spacebar, which are tricks I've been taught in the past. But it is still looking like the dog's breakfast in my Word Doc.

macropod · #2 10-23-2024, 11:30 PM

Why are you OCRing the PDF when you can open it in Word? After all, the text is all there. You could even just copy/paste the text from the PDF into Word.

Xavier · #3 10-23-2024, 11:39 PM

Quote:

Originally Posted by macropod

Why are you OCRing the PDF when you can open it in Word? After all, the text is all there. You could even just copy/paste the text from the PDF into Word.

No, the text is there but most of it is only as an image, not as a text PDF.
The small snippet of the document had some of it as text, but the entire document is more complicated.

You can't copy and paste all the text from the PDF. I wish I could.

If there was a way I could send you more I would, but the forum only allows small parts of the document to be uploaded.

macropod · #4 10-24-2024, 12:32 AM

The PDF you posted already has the text, though, suggesting it's already been subject to an OCR process and the result seems to be better than what you're getting.

Charles Kenyon · #5 10-24-2024, 08:23 AM

OCR is an imperfect process, although it has certainly improved over the years. It still REQUIRES human proofreading, especially for court documents.

It also very much depends on the quality of the original scan. The more stray marks, the more anomalies in the text.

There are different OCR engines out there. Try different ones. Word does not have an internal OCR engine.

Xavier · #6 10-24-2024, 03:33 PM

Quote:

Originally Posted by macropod

The PDF you posted already has the text, though, suggesting it's already been subject to an OCR process and the result seems to be better than what you're getting.

Yeah, unfortunately it isn't representative of the entire document, and much of it is just treated as images, so isn't easily OCR'd.

macropod · #7 10-24-2024, 05:00 PM

Do you have a link to the full document?