Cognotik | HttpClientFetch

🌐 Request Configuration

{
  "url": "https://example.com/api/docs",
  "headers": {
    "User-Agent": "CognotikBot/1.0"
  },
  "options": {
    "followRedirects": "NORMAL",
    "timeout": "60s",
    "trustAllCerts": true
  }
}

→

📄 Processed Content

✔ HTTP 200 OK [text/html]

// Simplified HTML Output

<main>
  <h1>API Documentation</h1>
  <p>This endpoint provides...</p>
  <section id="auth">...</section>
</main>

Reduced from 145KB to 12KB (91% compression)

Supported Content Types

Type	Handler	Output Path
`text/html`	HtmlSimplifier (Scrubbing)	`reduced_pages/`
`application/pdf`	DocumentReader (Extraction)	`extracted_text/`
`application/msword`	Apache Tika / POI	`extracted_text/`
`text/plain`	Direct Pass-through	`text_pages/`

Technical Constraints

HTML Limit: 5MB raw (Truncated to 1MB simplified if exceeded).
Document Limit: 10MB (Skipped if exceeded).
Timeout: 30s Connect / 60s Request.

Request Execution

Dispatches GET request with CognotikBot User-Agent and standard browser headers to minimize bot-detection triggers.

Content Transformation

HTML is scrubbed of scripts, styles, and interactive elements. Documents are parsed into plain text using specialized readers.

Workspace Persistence

Saves raw, reduced, and extracted versions to the webSearchDir for auditability and caching.

Supported Content Types

Technical Constraints

Kotlin Implementation