GroupDocs.Parser for Java について

文書、プレゼンテーション、アーカイブ、Eメールの画像とテキスト(メタデータ付き)を解析・抽出するJava API

GroupDocs.Parser for Java is a text, image and metadata extractor API for applications that support parsing raw, structured and formatted text. It also allows you to retrieve metadata of supported file formats. GroupDocs.Parser for Java enables you to extract text and metadata from password protected files in all popular formats including word processing documents, Microsoft Excel spreadsheets, Microsoft PowerPoint presentations, Microsoft OneNote, PDF files and ZIP archives.

Supported file formats

Microsoft Office formats

  • Word: DOCX, DOC, DOCM, DOT, DOTX, DOTM, RTF
  • Excel: XLSX, XLS, XLSM, XLSB, XLTM, XLT, XLTM, XLTX, XLAM, SXC, SpreadsheetML
  • PowerPoint: PPT, PPTX, PPS, PPSX, PPSM, POT, POTM, POTX, PPTM

Images and Other Formats

  • Portable: PDF
  • Images: JPG, BMP, PNG, TIFF, GIF, DICOM, WEBP
  • Other office formats: ODT, OTT, OTS, ODS, ODP, OTP, ODG

Other formats

  • Web: HTML, MHTML
  • Archives: ZIP, TAR, 7Z
  • Ebooks: CHM, EPUB, FB2, MOBI

GroupDocs.Parser for Java features

  • Extract text - Extract textual information from various file formats such as office documents, PDF files and images for easy readability and analysis.
  • Extract images - Retrieve visual content from diverse sources like office documents, PDF files for convenient access and use.
  • Scan QR Codes - Detect and decode QR codes present within office documents, PDF files, or visual content for efficient information retrieval.
  • Extract data from email attachments and archives - Gather valuable information from email messages, file attachments, and compressed data sources for effective analysis and utilization.
  • Extract tables - Identify and extract tabular data from PDF documents for organized analysis and use.
  • Extract hyperlinks - Locate and extract hyperlinks and email addresses within office documents or PDF files for efficient access.
  • Parse PDF Forms - PDF Forms are digital documents featuring fillable fields for user interaction, allowing them to input information electronically. Java API can be utilized to extract data from these forms for efficient processing.
  • Parse data by templates - Create custom templates and utilize them with Java API to parse specific information from PDF files, simplifying data extraction processes.
  • Search a text in documents - Quickly locate specific words or patterns within documents.