
This how-to tutorial covers the extraction of text from PDF files in R, including digital PDF text extraction, optical character recognition (OCR) for scanned documents, and the batch processing of PDF collections for use in downstream text analysis. It is aimed at researchers in corpus linguistics and digital humanities who need to convert PDF documents into plain text for computational analysis. This tutorial is part of the Language Technology and Data Analysis Laboratory (LADAL), a free, open-access research infrastructure at the University of Queensland. LADAL provides tutorials, tools, and courses for researchers working with language data. All materials are freely available at https://ladal.edu.au and are part of the Language Data Commons of Australia (LDaCA), funded by ARDC and NCRIS.
