keptlocal
· 7 min read · PDF

Compressing PDFs: File Size vs. Quality Explained

HM
Hiten Mahalwar
Founder, keptlocal · Technical Lead, Healthcare IT

A one-page PDF from Microsoft Word can be 500 KB. A scanned document of the same page might be 3 MB. A slide deck exported from PowerPoint can reach 50 MB. Compressing a PDF is rarely as simple as "make it smaller" — the right approach depends on what is making it large in the first place.

Why are PDFs so large?

A PDF is a container format. It can hold text (as actual characters), vector graphics (as drawing instructions), raster images (as pixel data), fonts, metadata, and more. The dominant contributor to file size varies by document type:

  • Image-heavy documents: embedded images — photographs, screenshots, scanned pages — are almost always the largest component. A slide deck with high-resolution stock photos is large because those photos are large.
  • Scanned PDFs: a scanned page is a photograph of paper. A colour scan of one A4 page at 300 DPI produces an image of 2480 × 3508 pixels — roughly 25 megapixels. Stored as JPEG at reasonable quality, that is 500 KB–2 MB per page. A 100-page scanned document is easily 100–200 MB before any compression.
  • Embedded fonts: PDFs embed font data so the document renders identically on every device. A document using three different font families, each with multiple weights, may carry 1–3 MB of font data before a single word of content is counted. Subsetting — embedding only the characters actually used — is a common optimisation that reduces this significantly.
  • Uncompressed image data: some tools export PDFs with images stored as raw uncompressed bitmaps rather than JPEG or PNG. The same photograph at the same resolution is 10–20× larger uncompressed than JPEG-compressed.
  • Duplicate resources: PDFs assembled by merging other PDFs sometimes contain duplicated fonts, colour profiles, or images — the same resource embedded multiple times because each source document carried its own copy.

What compression actually does

"Compressing a PDF" can mean several different things depending on the tool:

Re-encoding embedded images

The most impactful type of compression for most documents. Images in the PDF are decoded, re-encoded as JPEG (or JPEG 2000) at a lower quality setting, and re-embedded. A photograph stored as a high-quality JPEG at 95% quality might drop from 2 MB to 300 KB at 75% quality with little visible difference on screen.

The caveat: if the original image was already compressed before being embedded, re-compressing it again introduces generation loss. Each re-encode at a lossy quality setting discards additional data. The degradation is cumulative.

Reducing image resolution (downsampling)

A photograph exported at 600 DPI is overkill for a document that will only be read on screen or printed on a standard office printer. Downsampling reduces the image dimensions before re-encoding — a 600 DPI image scaled to 150 DPI has one-sixteenth the pixel count and therefore roughly one-sixteenth the file size (before compression). The trade-off is visible when zooming or printing at high quality.

Font subsetting

If a PDF embeds a complete font — all characters, all weights — but only uses a subset of those characters, the tool can strip the unused character data. A document that uses Helvetica for English text does not need the Cyrillic or Arabic glyphs. Subsetting can reduce font data by 50–80%.

Removing metadata and hidden content

PDFs can contain author comments, revision history, embedded thumbnails, document properties, and other metadata that add to file size without contributing to the visible content. Stripping these is lossless — the visible document is unchanged.

Lossless stream compression

PDF streams (the internal data containers) can be compressed using algorithms like Flate (zlib/deflate). Applying or improving this compression is lossless — the output is bit-for-bit identical to the input from a content perspective, just smaller.

Why browser-based PDF compression has limits

True PDF compression — particularly image downsampling and re-encoding at the byte level — requires native code execution for practical performance. Tools like Ghostscript and MuPDF are compiled binaries that can process a 100-page scanned document in seconds. Reimplementing that in pure JavaScript, running inside a browser tab, is computationally prohibitive for large files.

What browser-based tools like keptlocal's Compress Image can do: compress individual images using the Canvas API and JPEG encoder built into the browser. This works well for image files and for PDFs where the dominant content is photographs.

For aggressive PDF-level compression — downsampling embedded images, applying Ghostscript-style linearisation, stripping metadata across a complex document — a desktop tool or a server-side compressor is the right choice. When using a server-side compressor, be aware of the privacy implications of uploading discussed elsewhere on this site.

Choosing a target file size

The right target depends on how the document will be used:

  • Email attachment: most email providers enforce limits of 10–25 MB. A practical target for an attachment is under 5 MB to ensure delivery across varied mail servers. For a document with photos, 150 DPI at 75% JPEG quality typically achieves this without perceptible quality loss at normal reading size.
  • Web upload / CMS: depends on the platform. WordPress default is 2 MB. Government portals often impose 5 MB or 10 MB limits. A 150 DPI, 75% quality setting usually clears both.
  • Print-ready output: do not compress a document you intend to send to a commercial printer. Print requires 300 DPI minimum; many specifications require 300–600 DPI for images. Compressing below 300 DPI produces visible pixelation when printed.
  • Screen-only display: 96–150 DPI is indistinguishable from 300 DPI on screen. Aggressive downsampling to 96 DPI is appropriate for documents that will only ever be read digitally.
  • Archival storage: do not compress archival documents. Store originals at full quality and make compressed copies for distribution if needed.

What you lose when you compress

Lossless compression (stream compression, metadata removal, font subsetting) loses nothing visible. The document looks and prints identically.

Lossy compression (image re-encoding, downsampling) loses data permanently. The discarded data cannot be recovered from the compressed file. Specific things that degrade:

  • Fine text on image-based pages: scanned documents with small text at the margins may become unreadable after aggressive compression. 75% JPEG quality on a dense text scan can introduce blocking artefacts.
  • Photographs with subtle gradients: skin tones, skies, and gradients show JPEG banding at lower quality settings. On screen this may be acceptable; in a printed portfolio or report it is not.
  • Line art and diagrams: JPEG is terrible at compressing sharp-edged content like text, diagrams, and graphs. The compression algorithm blurs edges. For documents with this content, PNG or lossless PDF compression preserves quality better.

A practical workflow

  1. Identify what is making the file large. Open the PDF properties (File → Properties in Adobe Reader, or inspect via our PDF Info Viewer). If the file is under 5 MB, it may not need compression at all — check the target limit first.
  2. Try lossless options first. Remove unnecessary metadata and apply stream compression before touching image quality. Many tools can reduce file size 20–40% losslessly.
  3. Set the minimum quality that meets your use case. If the document will be read on screen only, 150 DPI and 75% JPEG quality is the sweet spot. Aggressive settings below 100 DPI or below 60% quality introduce visible degradation.
  4. Test the result before distributing. Open the compressed file, zoom to 100%, and look at a text-heavy region and an image-heavy region. If both look acceptable, the compression is fine. If either shows artefacts, reduce the compression level.
  5. Keep the original. Never overwrite the source file with the compressed version. Compression is irreversible — you cannot recover quality once discarded.

Working with individual images rather than a whole PDF? Try the keptlocal Compress Image tool — runs in your browser with no upload required.