Search Scanned Documents (OCR)

PDF images

Project Center automatically attempts to perform OCR scanning of images within PDF files (and only in PDF files), and indexes any text that it finds. The text found in the images is then searchable and will appear in the search results.

Be aware that it will only scan images that take up the entirety of the page (there cannot be any other text on the page), and it will scan the images in the background as the server's resources are available.

The accuracy of OCR indexing is limited to the clarity of the image and the text within it. If the quality of the image is poor, it will not scan accurately.

OCR scanning does not modify the PDF files in any way.

The Newforma Viewer Find feature does not support finding text discovered during the OCR process.

Search text in PDF files from AutoCAD .SHX files

Project Center cannot OCR .PDF files that contain annotations created from AutoCAD .SHX files because .SHX files are converted to vector graphics, which are lines. You can do one of the following: Replace the .SHX files in your AutoCAD file with TrueType fonts and then create your PDFs, or use the raster option when creating a .PDF from AutoCAD so that the entire file can be scanned and indexed.

Scanning paper documents and drawings

If your firm scans paper documents and drawings for archiving and you would like to perform OCR scanning manually to make these scanned documents searchable in Project Center, there are a variety of solutions that use OCR (optical character recognition) technology to do this.

Most scanners include software that performs OCR processing during the scanning process, so the documents can be made searchable after they are scanned. The scanned documents are typically saved as “hybrid” .PDF files that contain the original scanned image overlaid on a hidden but selectable and searchable text layer.

The resulting .PDF files are searchable from Project Center, but the search accuracy is directly related to the quality of the scanner, the OCR software, and the quality of the scanned document. With recent advancements in OCR software, the accuracy is now more a function of the quality of the original paper document. OCR tends to produce great results with crisp scans of documents and drawings that originated from a word processing or CAD application, but they do not perform as well on tattered, faded, gray-scale, hand-faxed, or hand-written documents

If you already have a large archive of scanned raster files, or your scanner’s software lacks effective OCR processing capabilities, there are a variety of software applications that provide OCR processing on existing raster files. Some are listed below.

Adobe Acrobat

Recent versions of Adobe Acrobat have powerful OCR capabilities built in. They allow you to convert any raster-based .PDF to a hybrid .PDF that is searchable. If you need to convert a collection of files, you can use Acrobat’s Batch Processing functionality to build sequences that will process a selection of supported file formats.

ScanSoft OmniPage

OmniPage combines scanning capabilities with OCR capabilities similar to Acrobat. It also provides the ability to convert scanned documents to a variety of other formats such as Microsoft Office, or web pages.

ABBYY FineReader and Recognition Server

FineReader allows you to convert and edit a variety of paper documents and electronic files using OCR, including .PDF files and scanned pages into searchable .PDF, Microsoft Office, or other file formats. Recognition Server is a server-based solution for automating the recognition and .PDF conversion process in enterprise environments.

Table of Contents

Index

Search (English only)

Back