OCRTask
Convert documents (PDF, Images) to Markdown text using Vision models. Supports high-DPI rendering, figure extraction, and metadata parsing.
Category: File
Model: Vision-Capable (GPT-4o/Claude 3.5)
Output: Markdown + Assets
OCRTaskExecutionConfigData.json
● Config
{
"files": ["internal/q3_report.pdf"],
"dpi": 200,
"extract_figures": true,
"extract_metadata": true,
"extract_text": true
}
→
Session Output
● Result
✔ Processed 12 pages
Generated Files:
- 📄 q3_report.md
- 📄 q3_report_metadata.json
- 📄 q3_report_text.txt
- 📁 q3_report_figures/
- └─ p1_1_Revenue_Chart.png
# Quarterly Report
## Financial Highlights
The revenue for Q3 exceeded expectations...
## Financial Highlights
The revenue for Q3 exceeded expectations...
Live Results Showcase
Explore actual artifacts generated by the OCR engine, including extracted figures and structured metadata.
Configuration Parameters
| Field | Type | Default | Description |
|---|---|---|---|
files * |
List<String> | - | List of workspace-relative paths to PDF or image files. |
dpi |
Float | 150.0 | Rendering resolution for PDF pages before Vision processing. |
extract_figures |
Boolean | false | If true, identifies and crops charts/images into a sub-directory. |
extract_metadata |
Boolean | false | Extracts form fields and key-value pairs into a JSON file. |
extract_text |
Boolean | false | Extracts raw embedded text from PDFs (non-OCR) to a .txt file. |
Task Lifecycle
- Initialization: Validates file existence and prepares the
TabbedDisplay. - Rendering: Converts PDF pages to high-resolution images using
RenderableDocumentReader. - Vision OCR: Sends page images to the LLM with a system prompt optimized for Markdown conversion.
- Asset Analysis: If enabled, runs
ParsedImageAgentto detect bounding boxes for figures and metadata. - Persistence: Saves the final Markdown, JSON metadata, and cropped figure PNGs to the workspace.
Kotlin Usage
Invoke as a standalone tool using the UnifiedHarness.
import com.simiacryptus.cognotik.plan.tools.file.OCRTask
import com.simiacryptus.cognotik.plan.tools.file.OCRTask.Companion.OCR
// 1. Define Runtime Input
val executionConfig = OCRTask.OCRTaskExecutionConfigData(
files = listOf("test_document.pdf"),
dpi = 300f,
extract_figures = true,
extract_metadata = true,
extract_text = true
)
// 2. Run via Harness
harness.runTask(
taskType = OCR,
typeConfig = TaskTypeConfig(name = "DocumentProcessor"),
executionConfig = executionConfig,
workspace = File("./workspaces/test-20260110_143732"),
autoFix = true // Automatically save results to disk
)
Prompt Segment
The internal description used by the AI orchestrator:
OCR - Convert documents (PDF, Images) to Markdown text.
* Extracts text from images and PDFs using Vision models.
* Preserves formatting as Markdown.
* Optionally extracts figures as images and metadata/form fields.
* Saves output to a .md file with the same name.
extract_metadata = true,
dpi = 300f