🌐 Request Configuration
{
  "url": "https://example.com/api/docs",
  "headers": {
    "User-Agent": "CognotikBot/1.0"
  },
  "options": {
    "followRedirects": "NORMAL",
    "timeout": "60s",
    "trustAllCerts": true
  }
}
📄 Processed Content
✔ HTTP 200 OK [text/html]
// Simplified HTML Output
<main>
  <h1>API Documentation</h1>
  <p>This endpoint provides...</p>
  <section id="auth">...</section>
</main>
Reduced from 145KB to 12KB (91% compression)

Supported Content Types

Type Handler Output Path
text/html HtmlSimplifier (Scrubbing) reduced_pages/
application/pdf DocumentReader (Extraction) extracted_text/
application/msword Apache Tika / POI extracted_text/
text/plain Direct Pass-through text_pages/

Technical Constraints

  • HTML Limit: 5MB raw (Truncated to 1MB simplified if exceeded).
  • Document Limit: 10MB (Skipped if exceeded).
  • Timeout: 30s Connect / 60s Request.
  • 2
    Request Execution

    Dispatches GET request with CognotikBot User-Agent and standard browser headers to minimize bot-detection triggers.

  • 3
    Content Transformation

    HTML is scrubbed of scripts, styles, and interactive elements. Documents are parsed into plain text using specialized readers.

  • 4
    Workspace Persistence

    Saves raw, reduced, and extracted versions to the webSearchDir for auditability and caching.