How to Convert PDF to Text — Extract and Reuse PDF Content

PDFs look great but lock content away. You can read it on screen, but the moment you want to copy a paragraph into an email, search across documents, or paste content into a writing tool, the rigid layout fights back. Converting a PDF to plain text breaks that lock and gives you raw, editable content.

When to extract text from a PDF

Common scenarios:

Reusing content — Quoting a long passage in a report, an email, or a presentation
Searching across documents — Plain text indexes faster and is searchable by tools that don't read PDFs
Feeding into AI tools — ChatGPT, Claude, and similar tools work better with clean text than raw PDFs
Translation workflows — Translation tools usually want plain text input
Accessibility — Screen readers and assistive tech handle text better than complex PDF layouts
Data extraction — Pulling structured information (names, dates, amounts) for spreadsheets
Plain-text archives — Long-term archives that survive future format changes

Two kinds of PDFs (this matters)

Not every PDF is equal when it comes to text extraction:

1. Text-based PDFs — Created by Word, Google Docs, LaTeX, web "Print to PDF", or any tool that produces real text content. The text is stored as actual characters and extraction is fast and accurate.

2. Image-based (scanned) PDFs — Created by scanners, "snap a photo of a doc" apps, or older fax tools. Each page is essentially a picture; there are no real characters to extract. You'll need OCR (Optical Character Recognition) to convert the images to text first.

A quick test: open the PDF and try to select a paragraph with your cursor. If the text highlights cleanly, it's text-based. If you get a blue rectangle around the whole page, it's image-based and needs OCR.

This article focuses on text-based PDFs. For scanned PDFs, you'll want a dedicated OCR tool first.

Free methods to convert PDF to text

Method 1: Copy and paste

Open the PDF in any reader, select all (⌘/Ctrl + A), copy, paste into a text editor. Works for short documents but:

Page breaks usually disappear
Formatting like columns and tables get garbled
Headers and footers get inlined into the body
Hyphenated line-end words may stay split

Good for a paragraph or two; painful for an entire document.

Method 2: macOS Preview (Export As Text)

Preview can export, but plain-text export was removed in newer macOS versions. Workaround: open in Preview → File → Export → choose PDF (with text annotations) → then copy the text. Or use a third-party tool.

Method 3: Adobe Acrobat (paid)

File → Export To → Text (Plain) — produces a `.txt` file. Free Acrobat Reader doesn't include this feature.

Method 4: Command line (pdftotext)

The Poppler suite includes `pdftotext`:

``` pdftotext input.pdf output.txt ```

Add `-layout` to preserve column layout, or `-raw` for the rawest possible extraction. Excellent quality and great for scripting batch jobs.

Method 5: Browser-based tools

The most accessible option for non-technical users. Our PDF to Text tool extracts text from any PDF, lets you preview the result, copy it to your clipboard, or download as a `.txt` file. The PDF is processed entirely in your browser — nothing is uploaded.

Choosing how pages are joined

When a multi-page PDF becomes a text file, you have to decide what happens at page boundaries:

Double newline (recommended) — Adds a blank line between pages. Keeps reading flow but makes pages distinguishable.
Single newline — Smaller separation; treats the document as one continuous stream.
Form-feed character (`\f`) — The Unix-traditional page separator. Preserved by many text editors and useful when you'll process the file programmatically.
Custom separator — Insert your own marker like `--- Page Break ---` for visual clarity.

If you're going to feed the text into an AI or search index, single or double newline is best. If you'll print it or process it as a structured document, page numbers + a clear separator wins.

Why text extraction sometimes looks weird

Even with a perfectly text-based PDF, output can have quirks:

Column ordering — A two-column document may have all of column 1 followed by all of column 2, or alternating lines, depending on how the PDF stores text positions
Reading order — Sidebars, captions, and footnotes may appear in unexpected places
Hyphenation — Words split at line ends with a `-` may stay split (`exam-ple` rather than `example`)
Ligatures — `fi`, `fl`, and `ffi` ligatures sometimes extract as single characters that don't render
Tables — Complex tables flatten into linear text, losing structure
Page headers/footers — Repeat on every page in the output unless you filter them
Special characters — Math symbols, accented characters, and CJK text may need a Unicode-aware viewer

These are limitations of how PDFs store text, not the extraction tool. For pristine output, the source format (Word, Markdown, etc.) is always better.

Filtering specific pages

If you only need text from certain pages, extract just those. Range syntax like `1-3, 5, 8-10` is supported by most modern tools. This is faster than extracting everything and trimming, especially for long documents where you only care about an abstract or conclusion.

Tips for best results

Extract per chapter or section — Long documents are easier to use as several smaller text files
Include page numbers as inline headers if you need to cite back to the original
Strip headers and footers with a quick find-and-replace in your editor
Run a spell-check pass — Catches OCR errors and ligature artifacts
Save the original PDF too — Text extraction is one-way; you can't reconstruct the layout
Use markdown for structure — If you're extracting to feed into an AI, lightly format with `#` headers and `-` bullets after extraction

Common use cases

Quoting research papers — Pull the abstract and key paragraphs for a literature review
Building searchable archives — Convert a folder of PDFs to text and index with a desktop search tool
AI summarization — Feed the extracted text into an LLM for a summary or Q&A
Translation — Get text into a translator that doesn't accept PDFs
Spreadsheet imports — Pull tabular data from PDF reports into a CSV
Proofreading — Read your own PDFs in a focused, distraction-free text editor

Privacy considerations

Text extraction tools that run on a server have full access to every word in your PDF. For confidential documents — contracts, medical records, legal filings, internal reports — the safe choice is a client-side tool where the file is read and parsed entirely in your browser. Nothing is sent over the network, nothing is logged, and the extracted text never leaves your device.

Related Guides

What Is a PDF File? — understanding the format helps explain extraction quirks
How to Convert PDF to JPG — when you want images instead of text
How to Convert PDF to Image — bulk page-to-image conversion
Best Free PDF Tools Online — choosing privacy-respecting tools
PDF Security and Privacy — why local processing matters for sensitive content