Table of Contents

In this post, I want to share how to convert PDF to images using the command line tool pdftoppm.

Install Poppler
#

pdftoppm is provided by the poppler project.

Install on Windows
#

For Windows, in order to install the latest version of poppler, we can install it via conda:

conda install -c conda-forge poppler

On Windows, the pdftoppm tool will be installed in ANACONDA_ROOT/Library/bin. We should add this directory to the Windows PATH.

We need to install newer version of pdftoppm to use some of its features, for example, exporting to JPEG format¹. Note that the poppler provided by this page is too old to be useful.

Install on Ubuntu
#

To install popper on Ubuntu, use apt-get:

apt-get update && apt-get install -y poppler-utils

This package installs the poppler command line utilities, such as pdftoppm, which we are going to use.

Install on macOS
#

On macOS, poppler can be easily installed via homebrew:

brew install poppler

How to use
#

To convert a single page of PDF to image, we can run the following command:

pdftoppm -singlefile -f 4 -r 72 -jpeg -jpegopt quality=90 presentation.pdf test_poppler

The PDF file we want to convert to images is presentation.pdf. The generated image name prefix is test_poppler. The image extension is decided by the exported image format. An explanation of the options used:

-singlefile: only convert one page of PDF. It is used together with the -f option to convert a single PDF page.
-f: index of the PDF page you want to convert. The page index starts at 1.
-r: image DPI in both x and y direction. If you want to set DPI in x and y direction separately, use -rx and -ry instead.
-jpeg: convert PDF page to JPEG format.
-jpegopt: option used when convert PDF pages to JPEG images. For options and their meanings, see here.

According to my test, pdftoppm works great and can produce the needed images quickly.

Using pdf2image
#

If you want to use Python, there is also a package named pdf2image, which is a thin wrapper around pdftoppm. Make sure you have installed pdftoppm and set its PATH correctly.

In the following script, I show an example on how to use the package.

from pdf2image import convert_from_path

def main():
    pages = convert_from_path("presentation.pdf", first_page=2,
                              single_file=True)
    pages[0].save("test_pdf2image.jpg", quality=85)

if __name__ == "__main__":
    main()

The function convert_from_path() will convert the PDF to a list of PIL Image object. You can then manipulate the images with the powerful functionality provided by the Pillow package.

There also a few important parameters to note:

dpi: this change the size and quality of the generated images. If you want to generate high quality images, use a large dpi, e.g., 300.
thread_count: Use multi-threading to accelerate image generation. The author suggests no more than 4 threads, however, I found more threads lead to lightly faster speed. You may tweak it to fit your need.

I have also written a more detailed script to directly generate images from PPT file on the command. You can find the script here.

References
#

Note that older version of pdftoppm only support PPM and PNG format. Newer versions support exporting to JPEG and TIFF format image. You should check whether exporting to JPEG is supported by using pdftoppm --help in the command line. ↩︎