Antenna House PDFXML Conversion Library について

PDFをXMLに変換

Antenna House PDFXML Conversion Library allows you to unlock the content from your legacy PDFs. If you want to reuse content from old PDFs, you no longer need to retype or go through the trouble of reconstructing your documents’ content from the PDF binary format. Antenna House PDFXML is designed for those organizations that need to convert large volumes of PDFs into XML, HTML5, XSL-FO, DocBook, or any other file formats. The Antenna House PDFXML Conversion Library extracts text, tables and images from PDFs and converts them to an XML format called "AHPDFXML". The data can then be transformed to any desired output by applying XSLT stylesheets.

Benefits and uses for XML include:

  • Content Re-usability.
  • Improved Search-ability.
  • Good for Accessibility.
  • Promotes Interoperability and Data Integration.
  • Platform Independent.
  • Vendor Independent.

The Antenna House PDFXML Conversion Library is a C/C++ library which also includes a Command-line program, that generates a richly structured XML document from the PDFs by using Antenna House’s PDF Analyzer Technology.

How it works:

  • Loads the information for each page from PDF.
  • Extracts vertical and horizontal lines from line drawings.
  • Analyzes the tables.
  • Creates text in the table cell.
  • Creates text lines of the body.
  • Creates paragraphs from lines.
  • Creates the area information from paragraphs.
  • Creates sections (columns).
  • Outputs the information for each page to AHPDFXML.

What is AHPDFXML

The XML format outputted by this conversion library is called Antenna House PDFXML format. It is a verbose format defined by Antenna House representing the content of a PDF in an intermediate XML structure. It is created by converting the contents in a PDF into XML expressions for text, tables, and images.

Antenna House PDFXML consists of multiple files:

  • Catalog File (input file for stylesheets) – manages the AHPDFXML files.
  • Document File – stores the main body of a PDF document configuration.
  • Style File – defines the style applied to the respective elements of a document.
  • External Files – outputs JPEG, PNG, BMP, SVG, etc.
  • See Antenna House PDFXML Schema Documentation for more detail.

The resulting XML can then be transformed with XSLT to any format that displays the document structure such as XSL-FO, DocBook, HTML5, or simply text. With Antenna House PDFXML, you now have the means to take advantage of PDF content for a wide range of environments. Transforming PDF content to XML makes it much easier to reuse, transform, manipulate, and search for data. By applying an XSLT stylesheet, there is more flexibility to processing data depending on how it’s being used.

PDF Support

Antenna House PDFXML Conversion Library supports:

  • PDF 1.3—1.7.
  • PDFs compliant with ISO 32000-1: 2008.
  • PDFs created with Antenna House software.