Chapter.12 Parameters for Commands - 12-25 -extractText: Extracting Text - Antenna House PDF Tool API V8.0 コマンドライン説明書

12-25 -extractText: Extracting Text

Processing

Extracts the text from the input PDF file and outputs it in the form of a text file.

The text file will be output to the output path specified by the parameter.
The setting of output file (-o) is not required to run this command.

Example of commands

[Executing example commands]

Extracts the text of test.pdf and outputs it to out.txt.

When extracting, page settings and the text extraction order follow the settings below:
- Page setting: 1st page, 3rd to 5th page.
- Text extraction order: Sort the text on the page in order of coordinates

[Windows]

AHPDFToolCmd80.exe -extractText C:\sav\out.txt -pageNo 0,2-4 -sort -d C:\test\test.pdf

[Linux]

AHPDFToolCmd80 -extractText /home/antenna/sav/out.txt -pageNo 0,2-4 -sort -d /home/antenna/test/test.pdf

Folder settings: applied

You can perform batch processing by specifying the input folder to the -d parameter.

If a folder is specified, text will be extracted from the PDF file in the input folder. Specify the output folder with the parameter [outTextFilePath].

The output file will be output to the specified folder with the input file name with the extension changed to ".txt"

Parameters

Parameter	Content
<outTextFilePath>	[required] Sets the file path for the text output. If there are multiple pages to be extracted, "pageX" is output in the first line.
-pageNo <Val>	Sets the page number to extract text from. Can be omitted. If not specified, extracts text from all pages. Page number is 0 origin. Therefore, the first page is counted from "0." If specifying multiple names, separate them with commas. (Example) "0,2-4"
-sort	Sorts text by coordinate.
-rect <left> <bottom> <right> <top>	Can be omitted. The range of text to obtain (unit: mm). If not specified, the text of the entire page will be obtained. If -sort is specified: Sorts within the specified range. If multiple -rect are specified: The first -rect is used, and any subsequent -rect are ignored.