Open a scanned contract and press Ctrl+F. Nothing. The word you are looking for is right there on the page — you can see it — but the file has no idea it exists. That gap between visible and machine-readable is what separates an image-only PDF from a searchable one.
What Makes a PDF Searchable
A PDF is searchable when it contains a text layer — a stream of actual Unicode characters embedded alongside (or behind) the visual page. When you press Ctrl+F, the viewer queries that text layer, not the pixels you see. If the layer is absent, no search tool can find anything regardless of how clearly the words print on screen.
The PDF specification (ISO 32000) allows a document to carry both a content stream of characters and, optionally, a separate visual representation. A purely image-based page carries only the visual representation. A text-based page, or an OCR’d scan, carries the character stream as well.
Searchable vs. Image-Only PDFs: The Differences That Matter
| Feature | Searchable (text-layer) PDF | Image-only PDF |
|---|---|---|
| Text selection | Yes — click and drag to highlight | No — entire page selects as a block |
| Ctrl+F search | Works in any viewer | Returns zero results |
| Copy–paste | Pastes as editable text | Pastes as an image (if at all) |
| Screen reader access | Readable by assistive technology | Inaccessible without OCR post-processing |
| Indexing by search engines | Google can read and index the content | Treated as a blank page |
| File size (same content) | Smaller — characters compress well | Larger — bitmap pages do not compress as efficiently |
| Typical origin | Exported from Word, Google Docs, InDesign, LaTeX | Flatbed scanner, camera photograph of a document |
The quick test: open the file, press Ctrl+A (Select All). If text highlights character-by-character, the text layer is present. If the whole page highlights as a single rectangle — or nothing highlights — the file is image-only and will need OCR before it is searchable.
How OCR Turns a Scan into a Searchable PDF
OCR — Optical Character Recognition — is the process that bridges image and text. The software analyses pixels, identifies patterns that correspond to characters, and outputs a Unicode string. That string is then embedded as a hidden text layer behind the original image, so the scan still looks like a scan but the text is now machine-readable.
When I first ran OCR on a batch of scanned insurance documents, the part that surprised me was how invisible the process is in the final file: the pages look identical before and after. The only difference is what happens when you press Ctrl+F.
What OCR Engines Actually Do (Step by Step)
- Pre-processing. The engine straightens skewed scans (deskew), removes background noise, and converts the image to a high-contrast grayscale representation. Scan quality has an outsize effect on accuracy here — 300 dpi is the accepted minimum for reliable OCR results.
- Page segmentation. The image is divided into regions: columns, paragraphs, headings, tables, images. Each region is processed separately so a two-column newspaper layout does not get scrambled into a single stream of text.
- Character recognition. Each segmented word or character region is compared against trained models. Modern engines such as Tesseract use LSTM (Long Short-Term Memory) neural networks, which dramatically improved accuracy over the older pattern-matching approach.
- Post-processing. A dictionary pass corrects common misreadings (rn mistaken for m, 0 for O). Confidence scores flag low-certainty characters for review.
- Text layer embedding. The recognised text is written into the PDF as an invisible overlay, aligned precisely to the character positions visible on the image layer.
OCR Accuracy: What Affects It
Accuracy varies more than most people expect. A clean, 300 dpi scan of a printed English document will typically give character error rates below 1% with Tesseract. A faded photocopy of a handwritten form will be far worse — sometimes unusable. The main variables:
- Scan resolution. 300 dpi is the practical minimum. 400–600 dpi improves accuracy on small fonts and fine print.
- Image contrast. Low-contrast grey-on-grey text confuses recognition models. A good scanner lid (or blank white paper behind a loose page) makes a measurable difference.
- Font type. Printed serif and sans-serif fonts perform well. Handwriting, decorative scripts, and dot-matrix printer output are significantly harder.
- Language model. Tesseract supports over 100 languages but requires the correct trained data file. Running English OCR on a French document lowers accuracy noticeably.
- Page cleanliness. Coffee stains, bleed-through from the reverse side, and torn edges all introduce noise the pre-processing step has to strip out.
Common OCR Tools
| Tool | Type | OCR engine | Handles image-only PDFs | Cost |
|---|---|---|---|---|
| ocrmypdf | Command-line | Tesseract (configurable) | Yes | Free (open source) |
| Adobe Acrobat Pro | Desktop GUI | Proprietary | Yes — “Recognise Text” dialog | Subscription (see Adobe pricing) |
| Google Drive | Cloud | Google Vision API | Yes — Open with Google Docs | Free (storage limits apply) |
| Tesseract CLI | Command-line | Tesseract | Requires pdftoppm pre-step | Free (open source) |
| ABBYY FineReader | Desktop GUI | Proprietary | Yes | One-time purchase or subscription |
ocrmypdf is worth singling out: it wraps Tesseract into a purpose-built PDF pipeline that handles deskew, page segmentation, and text-layer embedding in one command. It is the tool I reach for whenever I need to batch-process a folder of scanned documents.
ocrmypdf: Practical Example
Install on Ubuntu/Debian:
sudo apt install ocrmypdf
Add a text layer to a single scanned PDF:
# Convert scan.pdf to a searchable PDF
ocrmypdf scan.pdf searchable-output.pdf
# With deskew and language specified
ocrmypdf --deskew --language eng scan.pdf searchable-output.pdf
# Skip pages that already have a text layer (safe for mixed archives)
ocrmypdf --skip-text scan.pdf searchable-output.pdf
The --skip-text flag is particularly useful when processing an archive where some files are already searchable and others are not — it leaves existing text layers untouched. Full documentation is available at the ocrmypdf documentation.
PDF/A: Searchability Plus Long-Term Preservation
PDF/A is an ISO-standardised subset of PDF (ISO 19005) designed for long-term archiving. It adds constraints that guarantee the document will render identically decades from now: all fonts must be embedded, colour spaces must be device-independent, encryption is prohibited, and metadata must follow specific schemas.
PDF/A-1 (the original 2005 standard) and PDF/A-2 (2011, based on PDF 1.7) preserve any text layer that is present, and conversion tools typically add one via OCR when the source is scanned. A scanned document converted to PDF/A-2b — the most common archival target — is both searchable and preserved in a format courts and government archives accept as a reliable record.
ocrmypdf can output directly to PDF/A-2b:
ocrmypdf --output-type pdfa-2 scan.pdf archive-output.pdf
If you are building a document archive that needs to stand up in legal or regulatory contexts, PDF/A is worth the small extra step.
When You Actually Need a Searchable PDF
Not every document needs OCR. Here is a practical breakdown of when it matters and when it does not:
- Scanned contracts or legal documents. If you will ever need to find a clause, a date, or a party name quickly, OCR is not optional. A folder of unsearchable scans is a liability.
- Archiving physical records. Any paper document you are digitising for long-term storage should be OCR’d at the time of scanning. Doing it later is more work and introduces a second-generation quality loss if you re-scan.
- PDFs you intend to publish on a website. Google can read text-layer PDFs and index their content. An image-only PDF is invisible to search engines — the content does not exist as far as indexing is concerned.
- Accessibility requirements. Screen readers cannot read image-only PDFs. If your documents are intended for a public audience or must comply with accessibility standards (WCAG, Section 508, EN 301 549), they need a text layer.
- When you do not need it. A photograph saved as a PDF, a scanned drawing, a cover image — none of these have text to extract. Running OCR on them produces garbage output and wastes processing time.
For document management specifically — searching across many files at once — a searchable PDF is only the first step. The indexing layer on top is what makes large collections actually useful. That process is covered in detail in How to Build a Searchable PDF Archive.
Searchable PDFs and HTML Files: A Parallel
There is a structural parallel worth noting if you spend time with web files. An HTML file stores its content as plain readable text — you can open it in a text editor and search its source directly. A PDF with a text layer works on the same principle: the content exists as characters, not just pixels. Understanding how one format stores its text helps you reason about the other. If you are working with HTML files alongside PDFs, the guide to opening HTML files covers the equivalent tooling decisions across platforms.
Frequently Asked Questions
How do I tell if a PDF is searchable without opening it?
Open the file and press Ctrl+F, type a word you know appears on the first page, and check whether the viewer highlights it. Alternatively, press Ctrl+A to select all — if the text highlights character-by-character, the text layer exists. If the whole page selects as a single object, or nothing selects, the document is image-only.
Does making a PDF searchable change how it looks?
No. The OCR process adds an invisible text layer behind the existing image. The visual appearance of the page — the scanned image itself — is untouched. The only visible difference might be a very slight increase in file size from the embedded text data, though for most documents this is negligible compared to the image data.
Can I make a PDF searchable for free?
Yes. ocrmypdf and Tesseract are both open-source and free. Google Drive also provides free OCR via the “Open with Google Docs” method for individual files, though it requires a Google account and stores the file in the cloud.
What is the difference between a searchable PDF and PDF/A?
A searchable PDF has a text layer — that is all. PDF/A is an archival standard (ISO 19005) that adds additional constraints: embedded fonts, no encryption, no external dependencies, specific metadata schemas. PDF/A files made from digital sources or via OCR carry a searchable text layer, but a searchable PDF is not necessarily PDF/A — it may lack the archival constraints. Use PDF/A when documents must be preserved reliably over long periods or accepted by legal and government systems.
Does OCR work on handwritten documents?
Standard OCR engines perform poorly on handwriting. Tesseract and similar tools are trained primarily on printed fonts. Handwriting recognition (ICR — Intelligent Character Recognition) is a separate specialisation with significantly lower accuracy for unconstrained handwriting. For historical handwritten documents, dedicated ICR tools or manual transcription remain the practical options.
Will a searchable PDF be indexed by Google?
Google can crawl and index the text layer of a PDF hosted publicly on the web, provided the file is accessible (no password, not blocked by robots.txt). The content appears in search results in the same way a web page would. An image-only PDF, by contrast, presents Google with no readable content — the text on the scan is invisible to the crawler.



