How to Convert PDF to Text — Extract and Reuse PDF Content
PDFs look great but lock content away. You can read it on screen, but the moment you want to copy a paragraph into an email, search across documents, or paste content into a writing tool, the rigid layout fights back. Converting a PDF to plain text breaks that lock and gives you raw, editable content.
When to extract text from a PDF
Common scenarios:
- Reusing content — Quoting a long passage in a report, an email, or a presentation
- Searching across documents — Plain text indexes faster and is searchable by tools that don't read PDFs
- Feeding into AI tools — ChatGPT, Claude, and similar tools work better with clean text than raw PDFs
- Translation workflows — Translation tools usually want plain text input
- Accessibility — Screen readers and assistive tech handle text better than complex PDF layouts
- Data extraction — Pulling structured information (names, dates, amounts) for spreadsheets
- Plain-text archives — Long-term archives that survive future format changes
Two kinds of PDFs (this matters)
Not every PDF is equal when it comes to text extraction:
1. Text-based PDFs — Created by Word, Google Docs, LaTeX, web "Print to PDF", or any tool that produces real text content. The text is stored as actual characters and extraction is fast and accurate.
2. Image-based (scanned) PDFs — Created by scanners, "snap a photo of a doc" apps, or older fax tools. Each page is essentially a picture; there are no real characters to extract. You'll need OCR (Optical Character Recognition) to convert the images to text first.
A quick test: open the PDF and try to select a paragraph with your cursor. If the text highlights cleanly, it's text-based. If you get a blue rectangle around the whole page, it's image-based and needs OCR.
This article focuses on text-based PDFs. For scanned PDFs, you'll want a dedicated OCR tool first.
Free methods to convert PDF to text
Method 1: Copy and paste
Open the PDF in any reader, select all (⌘/Ctrl + A), copy, paste into a text editor. Works for short documents but:
- Page breaks usually disappear
- Formatting like columns and tables get garbled
- Headers and footers get inlined into the body
- Hyphenated line-end words may stay split
Good for a paragraph or two; painful for an entire document.
Method 2: macOS Preview (Export As Text)
Preview can export, but plain-text export was removed in newer macOS versions. Workaround: open in Preview → File → Export → choose PDF (with text annotations) → then copy the text. Or use a third-party tool.
Method 3: Adobe Acrobat (paid)
File → Export To → Text (Plain) — produces a `.txt` file. Free Acrobat Reader doesn't include this feature.
Method 4: Command line (pdftotext)
The Poppler suite includes `pdftotext`:
``` pdftotext input.pdf output.txt ```
Add `-layout` to preserve column layout, or `-raw` for the rawest possible extraction. Excellent quality and great for scripting batch jobs.
Method 5: Browser-based tools
The most accessible option for non-technical users. Our PDF to Text tool extracts text from any PDF, lets you preview the result, copy it to your clipboard, or download as a `.txt` file. The PDF is processed entirely in your browser — nothing is uploaded.
Choosing how pages are joined
When a multi-page PDF becomes a text file, you have to decide what happens at page boundaries:
- Double newline (recommended) — Adds a blank line between pages. Keeps reading flow but makes pages distinguishable.
- Single newline — Smaller separation; treats the document as one continuous stream.
- Form-feed character (`\f`) — The Unix-traditional page separator. Preserved by many text editors and useful when you'll process the file programmatically.
- Custom separator — Insert your own marker like `--- Page Break ---` for visual clarity.
If you're going to feed the text into an AI or search index, single or double newline is best. If you'll print it or process it as a structured document, page numbers + a clear separator wins.
Why text extraction sometimes looks weird
Even with a perfectly text-based PDF, output can have quirks:
- Column ordering — A two-column document may have all of column 1 followed by all of column 2, or alternating lines, depending on how the PDF stores text positions
- Reading order — Sidebars, captions, and footnotes may appear in unexpected places
- Hyphenation — Words split at line ends with a `-` may stay split (`exam-ple` rather than `example`)
- Ligatures — `fi`, `fl`, and `ffi` ligatures sometimes extract as single characters that don't render
- Tables — Complex tables flatten into linear text, losing structure
- Page headers/footers — Repeat on every page in the output unless you filter them
- Special characters — Math symbols, accented characters, and CJK text may need a Unicode-aware viewer
These are limitations of how PDFs store text, not the extraction tool. For pristine output, the source format (Word, Markdown, etc.) is always better.
Filtering specific pages
If you only need text from certain pages, extract just those. Range syntax like `1-3, 5, 8-10` is supported by most modern tools. This is faster than extracting everything and trimming, especially for long documents where you only care about an abstract or conclusion.
Tips for best results
- Extract per chapter or section — Long documents are easier to use as several smaller text files
- Include page numbers as inline headers if you need to cite back to the original
- Strip headers and footers with a quick find-and-replace in your editor
- Run a spell-check pass — Catches OCR errors and ligature artifacts
- Save the original PDF too — Text extraction is one-way; you can't reconstruct the layout
- Use markdown for structure — If you're extracting to feed into an AI, lightly format with `#` headers and `-` bullets after extraction
Common use cases
- Quoting research papers — Pull the abstract and key paragraphs for a literature review
- Building searchable archives — Convert a folder of PDFs to text and index with a desktop search tool
- AI summarization — Feed the extracted text into an LLM for a summary or Q&A
- Translation — Get text into a translator that doesn't accept PDFs
- Spreadsheet imports — Pull tabular data from PDF reports into a CSV
- Proofreading — Read your own PDFs in a focused, distraction-free text editor
Privacy considerations
Text extraction tools that run on a server have full access to every word in your PDF. For confidential documents — contracts, medical records, legal filings, internal reports — the safe choice is a client-side tool where the file is read and parsed entirely in your browser. Nothing is sent over the network, nothing is logged, and the extracted text never leaves your device.
Related Guides
- What Is a PDF File? — understanding the format helps explain extraction quirks
- How to Convert PDF to JPG — when you want images instead of text
- How to Convert PDF to Image — bulk page-to-image conversion
- Best Free PDF Tools Online — choosing privacy-respecting tools
- PDF Security and Privacy — why local processing matters for sensitive content