HttpClientFetch
Industrial-grade HTTP retrieval engine featuring SSL-bypass, automatic HTML scrubbing, and deep document text extraction for LLM context preparation.
Side-Effect: Safe
SSL-Tolerant
Multi-Format
🌐 Request Configuration
{
"url": "https://example.com/api/docs",
"headers": {
"User-Agent": "CognotikBot/1.0"
},
"options": {
"followRedirects": "NORMAL",
"timeout": "60s",
"trustAllCerts": true
}
}
→
📄 Processed Content
✔ HTTP 200 OK [text/html]
// Simplified HTML Output
<main>
<h1>API Documentation</h1>
<p>This endpoint provides...</p>
<section id="auth">...</section>
</main>
<h1>API Documentation</h1>
<p>This endpoint provides...</p>
<section id="auth">...</section>
</main>
Reduced from 145KB to 12KB (91% compression)
Supported Content Types
| Type | Handler | Output Path |
|---|---|---|
text/html |
HtmlSimplifier (Scrubbing) | reduced_pages/ |
application/pdf |
DocumentReader (Extraction) | extracted_text/ |
application/msword |
Apache Tika / POI | extracted_text/ |
text/plain |
Direct Pass-through | text_pages/ |
Technical Constraints
- HTML Limit: 5MB raw (Truncated to 1MB simplified if exceeded).
- Document Limit: 10MB (Skipped if exceeded).
- Timeout: 30s Connect / 60s Request.