Supported Formats
| Format | Extension | Reader Class | Features |
|---|---|---|---|
.pdf |
PDFReader |
Text Pagination Rendering | |
| Word | .docx |
DocxReader |
Text Tables |
| Word (Legacy) | .doc |
DocReader |
Text |
| Excel | .xlsx |
XlsxReader |
Text Sheet-aware |
| Excel (Legacy) | .xls |
XlsReader |
Text Sheet-aware |
| PowerPoint | .pptx |
PptxReader |
Text Slide-aware Notes |
| PowerPoint (Legacy) | .ppt |
PptReader |
Text Slide-aware |
| OpenDocument | .odt |
OdtReader |
Text |
| Rich Text | .rtf |
RtfReader |
Text |
| HTML | .html, .htm |
HTMLReader |
Text Pagination |
.eml |
EmlReader |
Text Headers Attachments | |
| Plain Text | .txt (default) |
TextReader |
Text Pagination |
Core Interfaces
DocumentReader
Base interface for all readers. Extends AutoCloseable for proper resource management.
interface DocumentReader : AutoCloseable {
fun getText(): String
}
PaginatedDocumentReader
Extends DocumentReader for formats that support or simulate pagination.
interface PaginatedDocumentReader : DocumentReader {
fun getPageCount(): Int
fun getText(startPage: Int, endPage: Int): String
}
RenderableDocumentReader
Extends DocumentReader for formats that can be rendered as images.
interface RenderableDocumentReader : DocumentReader {
fun getPageCount(): Int
fun renderImage(pageIndex: Int, dpi: Float): BufferedImage
}
Usage
Getting a Reader
The easiest way to obtain a reader is via the File extension function:
import com.simiacryptus.cognotik.docs.getDocumentReader
import com.simiacryptus.cognotik.docs.isDocumentFile
import java.io.File
val file = File("document.pdf")
if (file.isDocumentFile()) {
file.getDocumentReader().use { reader ->
val text = reader.getText()
println(text)
}
}
Checking File Support
Use the isDocumentFile() extension function to check if a file is supported:
val file = File("example.docx")
if (file.isDocumentFile()) {
// File format is supported
}
Handling Paginated Documents
val reader = file.getDocumentReader()
if (reader is PaginatedDocumentReader) {
val pageCount = reader.getPageCount()
println("Document has $pageCount pages")
// Get text from first page only
val firstPageText = reader.getText(0, 1)
}
Rendering PDF Pages
val reader = file.getDocumentReader()
if (reader is RenderableDocumentReader) {
val pageCount = reader.getPageCount()
for (i in 0 until pageCount) {
val image = reader.renderImage(i, 150f) // 150 DPI
// Process the BufferedImage
}
}
Configuration
The Settings data class allows you to configure behavior for certain readers:
data class Settings(
val dpi: Float = 120f,
val maxPages: Int = Int.MAX_VALUE,
val outputFormat: String = "PNG",
val fileInputs: List<String>? = null,
val showImages: Boolean = true,
val pagesPerBatch: Int = 1,
val saveImageFiles: Boolean = false,
val saveTextFiles: Boolean = false,
val saveFinalJson: Boolean = true,
val fastMode: Boolean = true,
val addLineNumbers: Boolean = false
)
Example with TextReader
val settings = Settings(addLineNumbers = true)
val reader = TextReader(file)
reader.configure(settings)
val textWithLineNumbers = reader.getText()
Reader Details
PDFReader
Implements both PaginatedDocumentReader and RenderableDocumentReader. Uses Apache PDFBox for text extraction and image rendering.
- Supports page-range text extraction
- Renders pages to
BufferedImageat configurable DPI - Automatically registers ImageIO service providers for proper image format support
DocxReader / DocReader
Extracts text from Microsoft Word files using Apache POI.
- DocxReader: Extracts text from paragraphs and tables (tab-separated cells)
- DocReader: Extracts text from legacy
.docfiles using HWPF library
XlsxReader / XlsReader
Extracts text from Excel spreadsheets.
- Processes all sheets in the workbook
- Handles STRING, NUMERIC, BOOLEAN, and FORMULA cell types
- Tab-separated cell values within rows
PptxReader / PptReader
Extracts text from PowerPoint presentations.
- Extracts slide titles and text from all text shapes
- PptxReader also extracts speaker notes
HTMLReader
Parses HTML files using Jsoup.
- Implements
PaginatedDocumentReaderwith smart page splitting - Configurable line numbering via
Settings - Splits large documents into ~16KB pages at paragraph boundaries
TextReader
Reads plain text files with pagination support.
- Implements
PaginatedDocumentReader - Uses entropy-based splitting algorithm to find optimal page breaks
- Prefers splitting at empty lines
- Configurable line numbering via
Settings
EmlReader
Parses email files using Jakarta Mail.
- Extracts email headers (From, To, CC, Subject, Date)
- Processes message body (text/plain and text/html)
- Recursively processes attachments using appropriate document readers
- Automatically cleans up temporary files on close
OdtReader / RtfReader
- OdtReader: Reads OpenDocument Text files using ODF Toolkit
- RtfReader: Reads Rich Text Format files using Java's built-in RTF support (
RTFEditorKit)
Dependencies
Resource Management
All readers implement AutoCloseable. Always use use blocks or try-with-resources to ensure proper cleanup:
file.getDocumentReader().use { reader ->
// Work with reader
} // Automatically closed
Note: The PDFReader temporarily sets the thread's context class loader during image rendering to ensure proper ImageIO service provider discovery. This is handled internally and transparent to users.