PDF to Text — Free & Private

How to extract text from a PDF

Drop your PDF into the zone above, or click to browse and select it.
Click Extract text. The tool processes each page in turn — you will see the progress as it works.
The text appears in the box above, organised by page. Click Copy all to copy it to your clipboard, or Download .txt to save it as a plain text file.

Everything runs in your browser using pdf.js. No file is sent to a server — open DevTools (F12) → Network while extracting to confirm zero upload requests.

Text PDFs vs. scanned PDFs: understanding the difference

Not all PDFs contain extractable text. Before running the tool, it helps to know which type you have.

A text-based PDF has actual text data embedded in the file. When you open it in a PDF viewer and can click-drag to select words, the text is extractable. This includes PDFs created by word processors (Word, Google Docs), design tools (InDesign, Illustrator), presentation software, and most modern office applications. These extract cleanly.

An image-based PDF — commonly called a scanned PDF — is a sequence of page images with no text layer. When you scan a paper document and save as PDF, each page is a photograph of the paper, not a structured text document. Opening a scanned PDF in a viewer means you are looking at a picture of text, not text itself. You cannot select words, because there are no words to select — only pixels. This tool returns empty output for scanned PDFs.

If you are unsure which type you have, try selecting a word in your PDF viewer. If you can highlight it, the PDF has a text layer. If the cursor does not respond to selection, it is likely scanned. The PDF to JPG tool can convert scanned pages to images for use with external OCR services.

When to extract text from a PDF

Copying content into another document — extract the text from a report or article so you can paste it into a Word document, Google Doc, or email.
Searching inside a large PDF — extract and paste into a text editor with better search than your PDF viewer.
Feeding text to an AI tool — extract the content of a PDF so you can paste it into ChatGPT, Claude, or another AI for summarisation or analysis.
Data extraction from reports — pull tables and figures from PDFs for further processing in a spreadsheet.
Accessibility — convert a PDF to plain text for easier reading in a text-to-speech tool or screen reader.

How it works under the hood

pdf.js loads the PDF and parses its page content streams. Each page in a PDF contains drawing instructions — text is stored as positioned character sequences, not as flowing paragraphs. pdf.js calls page.getTextContent() to retrieve each text item with its position data, then the tool joins adjacent items into lines and pages.

Because PDF does not natively store reading order, the tool reconstructs it from character positions. Simple single-column documents extract cleanly. Complex multi-column layouts, tables, and rotated text may produce text in an unexpected order — this is a structural limitation of the PDF format.

Limits and what to expect

Scanned PDFs: a PDF that is just a scan (image of a page) contains no text data — only pixels. The tool will return empty output. To extract text from a scanned PDF, you need OCR (optical character recognition) — a feature we plan to add via optional cloud processing.
Complex layouts: multi-column layouts, text in tables, and rotated text may extract in a different order than they visually appear. For clean extraction, straightforward single-column documents work best.
Ligatures and special characters: some fonts use ligatures (like "fi" rendered as a single glyph) that may not extract correctly. This depends on how the font is embedded in the PDF.
Password-protected PDFs: the PDF must be openable without a password. PDFs that restrict content viewing cannot be extracted.
Large documents: each page is processed in sequence. Very large PDFs (100+ pages) take a few seconds — you will see progress as each page completes.

Why multi-column layouts extract poorly

PDF does not store text in reading order. Text is stored as a series of positioned drawing instructions: draw this character at coordinate (x, y), draw this character at (x+8, y), and so on. The format was designed for visual rendering, not for text reuse. When extracting text, a parser has to reconstruct reading order from those coordinates — and for complex layouts, that reconstruction is imperfect.

A two-column academic paper, for example, may extract with alternating lines from each column: the first line of column one, then the first line of column two, then the second line of column one, and so on. Positionally correct — but unreadable as prose. Tables present the same problem: cells may extract row by row, or in column order, or in the order they were drawn, which may match neither. This is not a flaw in any particular extraction tool. It is a structural limitation of the PDF format itself.

For documents with complex layouts, the extracted text is still useful — it contains all the content, just in a non-ideal order. For data extraction purposes (pulling figures or identifiers from a table), searching and parsing the raw extracted text is usually workable. For feeding a large PDF to an AI tool, the reading-order imperfection rarely matters — language models are robust to minor ordering issues in long documents.

What you can do with extracted text

Once you have plain text from a PDF, a range of workflows become available:

AI summarisation: Paste the extracted text into Claude, ChatGPT, or Gemini to get a summary, answer specific questions, or extract structured data. This is often faster than reading a long report. The text extraction removes the need to upload the PDF to the AI service — you paste clean text instead of sending the file.
Search and find: If your PDF viewer's search is slow or unreliable, extract the text and open it in a text editor with fast search. Code editors like VS Code can search across thousands of lines instantly.
Translation: Machine translation services (DeepL, Google Translate) accept plain text but not PDFs directly. Extract first, then translate. The resulting translation is clean text you can paste back into a document.
Accessibility: Plain text is the most compatible format for screen readers and text-to-speech software. Extracting a PDF's text and reformatting it slightly can make a document accessible to people who cannot use a PDF viewer effectively.
Data pipelines: Developers processing reports, invoices, or filings programmatically often need a quick way to inspect a PDF's text content before writing a parser. This tool lets you see what the text layer contains without writing any code.

Privacy compared to other PDF-to-text tools

Online PDF-to-text converters upload your document to a server for processing. For PDFs containing sensitive information — client reports, legal documents, financial statements, medical records — that upload is a real risk.

keptlocal extracts text entirely inside your browser. Your PDF never leaves your device. The extracted text is displayed in the browser and downloaded as a .txt file — no intermediate server is involved at any point.

Frequently asked questions

Are my files uploaded to a server?

No. Text extraction runs entirely in your browser using pdf.js. Your PDF never leaves your device — open DevTools → Network while processing to confirm zero upload requests.

Why does the extracted text look scrambled or out of order?

PDF does not store text in reading order — it stores it as positioned drawing instructions. pdf.js reconstructs reading order from position data, but complex multi-column layouts, tables, and rotated text can produce unexpected ordering. This is a limitation of the PDF format, not the tool.

Can I extract text from a scanned PDF?

Only if the PDF was OCR-processed and contains an embedded text layer. A scanned PDF that is just an image of a page contains no extractable text — the tool will return empty output for those pages.

What encoding is the downloaded .txt file?

UTF-8, which supports all languages and special characters present in the source PDF.

Is there a page limit?

No hard limit. Large PDFs take a few seconds as each page is processed in turn — you will see the progress update as pages are extracted.

Can I extract text from a password-protected PDF?

Only PDFs that can be opened without a password. If the PDF restricts content viewing, extraction will fail — unlock it in your PDF reader first.