Skip to main content
Back to Blog
7 min read

How to OCR a Scanned PDF for Free — Without Uploading It

You scan a contract, save it as a PDF, and try to copy a paragraph — only to discover the entire document is a flat image. The text is there, visually, but you can't select it, search it, or copy it. That's where OCR (Optical Character Recognition) comes in.

This guide explains how to OCR a scanned PDF or image for free, what makes a scan work well or badly, and the privacy traps to watch out for with the popular online tools.

What OCR actually does

OCR is a technique that looks at an image of text and figures out the actual letters and words. The output is real text data you can:

  • Copy and paste anywhere
  • Search inside Word, Google Docs, or your file system
  • Feed into ChatGPT, Claude, or any other AI tool
  • Translate with DeepL or Google Translate
  • Index for full-text search across your archive

Until OCR runs over a scanned PDF, the document is essentially a photo. After OCR, it's a real text document.

Two kinds of OCR output

There are two useful outputs you'll see:

  1. Plain text (.txt) — Just the recognized words, no formatting. Great for feeding into other tools.
  2. Searchable PDF — The original page image is preserved exactly, but an invisible text layer is added behind it. The PDF looks identical to the scan, but you can now select, copy, and search the text. This is what you want if you need to keep the document looking the same while making it searchable.

A good OCR tool gives you both options.

The privacy trap with online OCR

Most "free" online OCR tools work by uploading your file to their server, running OCR there, and sending you the result back. That's fine for a meme, but a problem for:

  • Tax returns and W-2s
  • Medical records
  • Contracts with confidential terms
  • ID documents (passport, driver's license)
  • Bank statements
  • Anything covered by NDA

You have no way to verify what they do with the file after processing. Many keep it indefinitely, train models on it, or sell anonymised versions. Read any free tool's privacy policy carefully before uploading sensitive scans.

The browser-based alternative

Modern browsers can run OCR locally — no upload required. The trick is Tesseract.js, an open-source OCR engine that compiles to WebAssembly. The first time you use it for a given language, your browser downloads a small (~10 MB) language model. After that, OCR runs entirely on your machine, with no network connection needed.

That's the approach we use in the OCR PDF & Image tool. Drop a scanned PDF or image, pick the language, and recognized text appears in the browser. Nothing leaves your computer.

If your input is more often a phone photo, screenshot, or single image rather than a multi-page PDF, the Image to Text Converter is the same engine with framing aimed at that use case — same privacy model, same accuracy, just a UI tuned for photos and screenshots.

Step-by-step: OCR a scanned PDF in your browser

### 1. Open the OCR tool

Go to yourpdftools.com/ocr. It's a single page. No signup, no email, no card.

### 2. Upload your file

You can drop in:

  • A scanned PDF (single page or multi-page)
  • An image — PNG, JPG, WebP, or BMP
  • Files up to 50 MB

If your file is large or has many pages, the tool processes one page at a time so the UI stays responsive.

### 3. Pick the language

This step matters more than you'd expect. Tesseract is a multi-language engine, but it can only recognize one language at a time well. If you select English on a German document, you'll get garbled text. The tool ships with 14 common languages: English, Spanish, French, German, Portuguese, Italian, Dutch, Russian, Arabic, Hindi, Japanese, Korean, and Chinese (Simplified and Traditional).

For mixed-language documents, run OCR once per language and combine the results.

### 4. Run OCR

Click Run OCR. The first time you use a language, the browser downloads a ~10 MB language model. This is cached, so subsequent runs in the same language are instant. After that, each page is rendered to a canvas and recognized.

You'll see live progress per page: rendering → recognizing.

### 5. Use the output

When OCR finishes, you have three things you can do:

  • Copy the text to your clipboard with one click
  • Download .txt to save the plain text
  • Download searchable PDF — this is the magic option: a brand-new PDF that looks identical to your scan but with an invisible text layer added. Open it in any PDF reader (Preview, Adobe, Chrome) and you can select, search, and copy the text.

The OCR confidence score is shown so you know how trustworthy the recognition is. Above 90% is usually clean. Below 70% means the source scan is too low-quality for reliable OCR — try rescanning at a higher DPI.

What makes OCR accurate (or not)

OCR quality is almost entirely determined by the source scan. The tool can only work with what you give it. Use these rules:

  • 300 DPI is the sweet spot. Most scanners default to 200 DPI which is fine for archiving but borderline for OCR. Set the scanner to 300 DPI for best results.
  • Black-and-white text on a white background works best. Coloured backgrounds, low-contrast text, and weird highlights all reduce accuracy.
  • Straight, deskewed pages. A page tilted by even 5 degrees can confuse the line detector. Most scanners auto-deskew; if yours doesn't, run the page through a deskew filter first.
  • Standard fonts. Times New Roman, Arial, and Helvetica recognise almost perfectly. Decorative or handwritten fonts will struggle.
  • Avoid two-column layouts when possible. The OCR engine reads top-to-bottom; columns can get mixed up unless the layout is very clean.
  • Higher resolution beats lower compression. A high-res JPEG with mild compression is better than a low-res PNG.

Handwriting?

Tesseract is trained primarily on printed text. It will read handwriting, but accuracy drops to 50–70% even for clean handwriting. For neat block printing it's usable; for cursive, expect to retype large parts. Specialised handwriting OCR (like Google Cloud Vision) is significantly better but requires uploading to a server.

Common use cases

A few real-world scenarios this is useful for:

  • Receipts and invoices. Snap a photo, OCR it, paste numbers into your spreadsheet.
  • Old PDF archives. A folder of scanned documents from years ago becomes fully searchable in your file system once converted to searchable PDFs.
  • Books and articles. A scanned journal article you want to highlight or quote.
  • Notes from meetings. Photo of a whiteboard or notebook, converted to editable text.
  • ID and form data extraction. Scan a form, OCR it, and feed the text into a spreadsheet for data entry — without typing.

Combining OCR with other tools

Once you have searchable text, you can pipe it through the rest of the toolkit:

  • Use Extract Pages to pull just the pages you need before OCR-ing — saves time on large PDFs
  • Use PDF to Text on the resulting searchable PDF to get the plain text out separately
  • Use Compress PDF to shrink the searchable PDF for emailing
  • Use Protect PDF to password-protect sensitive scans before sharing

Privacy summary

The browser-based OCR approach has one big advantage worth restating: your file never leaves your computer. There's no upload, no server processing, and no copy stored anywhere. You can OCR a scanned tax return, a medical record, or an employment contract without trusting anyone but your own browser. For anything sensitive, that's the only acceptable choice.

Ready to try it? Open the OCR tool and drop in a scanned PDF or photo. The first language model takes a few seconds to load; after that, it's instant.

Working with a single photo or screenshot instead? Use the Image to Text Converter — same engine, plain-English framing, identical privacy guarantee.