How to programmatically automate PDF files? When a PDF file is clicked, text and images are displayed on any machine, and it looks and feels like it was printed on paper. This article looks at PDF files from a programmer’s point of view, for you Programmatically automate PDF file opening ideas.
Basic Framework of PDF Technology
Before we think about how to automatically process PDF files programmatically, we need to understand the basic framework of PDF technology. The content of the PDF file contains the necessary elements to “reproduce the same appearance of the document in various machine environments”. A file format developed specifically for this, and an application specifically designed to interpret and display it, is arguably the most basic framework for “PDF” technology.
|Internal structure of PDF files|
|the same display|
|PDF display application||Application displayed by another machine|
I want to automate the processing of PDF files programmatically
Earlier, I summarized the purpose of the technology called PDF as follows.
Reproduce the same look and feel of documents in various machine environments .
However, if you are used to batch processing multiple files by writing your own programs or combining Unix tools, you may want to use PDF files for other purposes. I guess. For example, there are many people who want to use PDF files for the following purposes.
I want to write a program that converts PDF files to documents in other formats.
I would like to use the data provided as a PDF file as a source for statistical processing .
I want to search the full text of multiple PDF files using something like grep .
For these purposes, the original “visual identity” purpose of the PDF is not very important. Instead, it should be possible to efficiently extract “text data” (see note below) from a PDF file without having to launch an application to view it. It appears that the PDF file contains textual data that can be used for the above purpose, so it seems that it is possible to open the PDF file directly in a general programming language and process only the textual data.
However, in fact, extracting text data from PDF files is very difficult. So, in this article, I’ll explain why it’s hard, and if you actually do it, what ways are possible, mostly for people with programming experience.
In this paper, we use the term “text data” to refer to the general data representation when dealing with information that can be represented in characters. It doesn’t matter how the data is represented, but think of it as a “human-meaningful sequence of Unicode characters” (encoded on a computer as a string of bytes). You can think of it as data that can be manipulated as strings in general-purpose programming languages, or used as input/output for grep and other basic Unix tools.
The PDF file itself does not contain text data
You cannot get textual data just by unraveling the structure of the PDF file. Conversely, depending on the PDF file, “characters that make up text data” may not be included in the first place.
Instead, the PDF file contains information about which character of which font should be placed where on the screen. This information is sufficient for PDF’s purpose of “reproducing the same appearance in various machine environments”.
In other words, it can be said that what is needed to display a PDF file is “characters that are pictures”, not “characters that make up text data” (see the figure below). Conversely, text data is not necessary to display the PDF file. In short, this is the main reason why extracting text data from PDF files is so difficult.
|“Word” (text) as painting||“literal” as code|
I still want to extract text data from PDF files
The PDF display does not require text data, but the display application does not require text data at all. Text data is required for functions such as selecting displayed characters and copying them to the clipboard, and functions such as searching in files. So the PDF specification also has a section on “Retrieving textual data” (Section 9.10 of ISO 32000-1:2008).
However, the information in this section does not describe “how to read a PDF file for textual data”. It’s just “after the display application’s information is ready to be retrieved” and explains how to extract text data from it.
A brief explanation of how PDF files work and an example of the process :
1.Parse binary data to find content stream
First, the binary data is parsed to find the data structure that will become the page when viewing the PDF file. This data structure, called a “content stream”, is scattered throughout the PDF file (as mentioned earlier, this article does not discuss how to find a content stream in a PDF file).
It is confused with “text data”, but in the PDF specification, the characters displayed on the page (that is, the sequence of “characters as pictures”) are simply referred to as “text”. The basic strategy thereafter is to read the text placed on the page from the content stream and interpret it as textual data.
2. Read the content stream
At least the following four types of PDF operators need to be implemented to extract textual data from a content stream.
BT and ET operators that indicate the presence of text in the content stream.
For example, the Tm and Td operators are used to position text on a page.
Tf operator for font selection.
TJ operator, Tj operator, etc. for drawing text.
3. Get text data from the parameters of the text drawing operator
If you use an editor to view the content stream in plain text, the TJ operator and the arguments to the Tj operator look like “text data-like information”. However, even if the argument is read as it is, it cannot be used as text data. You can’t just say “if I use a regex to search the content stream for TJ/Tj operators etc. and extract it, I’ll get text data”.
Why can’t we? There are three main reasons:
The format and encoding of the parameters depend on the PDF generation tool implementation and font type.
What can be directly understood from the parameters is how to find the information to draw characters as pictures from a certain font, not necessarily text data.
The order of text data cannot be determined solely by the positional relationship of TJ/Tj operators in the content stream.
The first is how to read the parameters of the TJ/Tj operator. By design, the arguments to the PDF operator used to draw text can be either “literal strings” or “hex strings”, which have completely different formats. Also, the encoding of these strings depends on the font. In some cases the encoding can only be determined by looking inside the embedded font.
The second problem is that the parameters read this way are usually not text data themselves. Especially for Japanese fonts, in many cases this parameter is nothing more than “find an identifier for the character in this font”.
How to programmatically automate PDF files? Frankly, you don’t need to read binary files yourself to extract textual data from PDF files. There are some existing tools for extracting textual data from PDF files . Based on the research on this knowledge, some developers got together and launched an Online PDF on the AbcdPDF platform Editor ‘s PDF editing tool, of course, this is off topic.