Scrape raw HTML with AgentQL's REST API
Query HTML without a URL. Extract structured data from raw HTML with AgentQL's REST API. Perfect for crawlers, snapshots, and firewalled pages.
Sometimes you need to extract data from HTML but you don't have a URL to pass to AgentQL's REST API. Fret not: now our REST API endpoint supports querying directly from raw HTML!
With this functionality, you can scrape data from pages even if you're working behind a firewall, fetching pages with a custom crawler, or integrating with internal tools. Pass the HTML as a string and your AgentQL query, and AgentQL will return structured data in JSON.
You asked for it: scraping web pages without a URL
You asked if it was possible to scrape data without Playwright. You told us you were already fetching HTML using custom crawlers. We heard you! This new capability is perfect for querying data from:
- Private and internal network pages
- Previously crawled pages and HTML dumps
- Archived HTML files and snapshots
It can even be used to scrape difficult-to-reach and heavily anti-botted pages. You can navigate to the page using a stealth crawler or your own browser, save the page's HTML or copy it as a string, and follow the steps below!
How to extract data from an HTML string
You can pass HTML directly in your API request like so:
AgentQL will process the HTML and return structured JSON:
Got a large, unwieldy chunk of HTML? Or a local file(s) you want to send without the copy-pasting all the HTML every time? Most HTML is going to run into JSON formatting errors if you pass it through raw, anyway. Try this out:
This combines reading the file with cat
alongside jq's power to properly format HTML for a JSON context (escaping double quotes, etc).
Get started extracting data with HTML and AgentQL
This feature is available now—no opt-in or special flag required. Learn more in our guide to getting data from HTML with AgentQL or the REST API Reference
If you have any questions, join our Discord, and we will help you out. We love hearing from you! Find us on X, or Bluesky, too!
—The TinyFish team building AgentQL