Document Reading Module

A unified interface for extracting text and rendering images from a wide variety of document formats. Leverages industry-standard libraries for robust document processing capabilities.

📄 Unified API 📑 Pagination 🖼️ Rendering 📎 Attachments ✂️ Smart Splitting

Supported Formats

Format Extension Reader Class Features
PDF .pdf PDFReader Text Pagination Rendering
Word .docx DocxReader Text Tables
Word (Legacy) .doc DocReader Text
Excel .xlsx XlsxReader Text Sheet-aware
Excel (Legacy) .xls XlsReader Text Sheet-aware
PowerPoint .pptx PptxReader Text Slide-aware Notes
PowerPoint (Legacy) .ppt PptReader Text Slide-aware
OpenDocument .odt OdtReader Text
Rich Text .rtf RtfReader Text
HTML .html, .htm HTMLReader Text Pagination
Email .eml EmlReader Text Headers Attachments
Plain Text .txt (default) TextReader Text Pagination

Core Interfaces

DocumentReader

Base interface for all readers. Extends AutoCloseable for proper resource management.

interface DocumentReader : AutoCloseable {
  fun getText(): String
}

PaginatedDocumentReader

Extends DocumentReader for formats that support or simulate pagination.

interface PaginatedDocumentReader : DocumentReader {
  fun getPageCount(): Int
  fun getText(startPage: Int, endPage: Int): String
}

RenderableDocumentReader

Extends DocumentReader for formats that can be rendered as images.

interface RenderableDocumentReader : DocumentReader {
  fun getPageCount(): Int
  fun renderImage(pageIndex: Int, dpi: Float): BufferedImage
}

Usage

Getting a Reader

The easiest way to obtain a reader is via the File extension function:

kotlin
import com.simiacryptus.cognotik.docs.getDocumentReader
import com.simiacryptus.cognotik.docs.isDocumentFile
import java.io.File

val file = File("document.pdf")
if (file.isDocumentFile()) {
  file.getDocumentReader().use { reader ->
    val text = reader.getText()
    println(text)
  }
}

Checking File Support

Use the isDocumentFile() extension function to check if a file is supported:

kotlin
val file = File("example.docx")
if (file.isDocumentFile()) {
  // File format is supported
}

Handling Paginated Documents

kotlin
val reader = file.getDocumentReader()
if (reader is PaginatedDocumentReader) {
  val pageCount = reader.getPageCount()
  println("Document has $pageCount pages")

  // Get text from first page only
  val firstPageText = reader.getText(0, 1)
}

Rendering PDF Pages

kotlin
val reader = file.getDocumentReader()
if (reader is RenderableDocumentReader) {
  val pageCount = reader.getPageCount()
  for (i in 0 until pageCount) {
    val image = reader.renderImage(i, 150f) // 150 DPI
    // Process the BufferedImage
  }
}

Configuration

The Settings data class allows you to configure behavior for certain readers:

kotlin
data class Settings(
  val dpi: Float = 120f,
  val maxPages: Int = Int.MAX_VALUE,
  val outputFormat: String = "PNG",
  val fileInputs: List<String>? = null,
  val showImages: Boolean = true,
  val pagesPerBatch: Int = 1,
  val saveImageFiles: Boolean = false,
  val saveTextFiles: Boolean = false,
  val saveFinalJson: Boolean = true,
  val fastMode: Boolean = true,
  val addLineNumbers: Boolean = false
)

Example with TextReader

kotlin
val settings = Settings(addLineNumbers = true)
val reader = TextReader(file)
reader.configure(settings)
val textWithLineNumbers = reader.getText()

Reader Details

PDFReader

Implements both PaginatedDocumentReader and RenderableDocumentReader. Uses Apache PDFBox for text extraction and image rendering.

  • Supports page-range text extraction
  • Renders pages to BufferedImage at configurable DPI
  • Automatically registers ImageIO service providers for proper image format support

DocxReader / DocReader

Extracts text from Microsoft Word files using Apache POI.

  • DocxReader: Extracts text from paragraphs and tables (tab-separated cells)
  • DocReader: Extracts text from legacy .doc files using HWPF library

XlsxReader / XlsReader

Extracts text from Excel spreadsheets.

  • Processes all sheets in the workbook
  • Handles STRING, NUMERIC, BOOLEAN, and FORMULA cell types
  • Tab-separated cell values within rows

PptxReader / PptReader

Extracts text from PowerPoint presentations.

  • Extracts slide titles and text from all text shapes
  • PptxReader also extracts speaker notes

HTMLReader

Parses HTML files using Jsoup.

  • Implements PaginatedDocumentReader with smart page splitting
  • Configurable line numbering via Settings
  • Splits large documents into ~16KB pages at paragraph boundaries

TextReader

Reads plain text files with pagination support.

  • Implements PaginatedDocumentReader
  • Uses entropy-based splitting algorithm to find optimal page breaks
  • Prefers splitting at empty lines
  • Configurable line numbering via Settings

EmlReader

Parses email files using Jakarta Mail.

  • Extracts email headers (From, To, CC, Subject, Date)
  • Processes message body (text/plain and text/html)
  • Recursively processes attachments using appropriate document readers
  • Automatically cleans up temporary files on close

OdtReader / RtfReader

  • OdtReader: Reads OpenDocument Text files using ODF Toolkit
  • RtfReader: Reads Rich Text Format files using Java's built-in RTF support (RTFEditorKit)

Dependencies

Apache POI .doc, .docx, .xls, .xlsx, .ppt, .pptx
Apache PDFBox PDF processing & rendering
Jsoup HTML parsing & extraction
Jakarta Mail .eml file parsing
ODF Toolkit .odt files
Java Swing RTF support (built-in)

Resource Management

All readers implement AutoCloseable. Always use use blocks or try-with-resources to ensure proper cleanup:

kotlin
file.getDocumentReader().use { reader ->
  // Work with reader
} // Automatically closed

Note: The PDFReader temporarily sets the thread's context class loader during image rendering to ensure proper ImageIO service provider discovery. This is handled internally and transparent to users.