The Scrape API allows you to get the data you want from web pages using with a single call. You can scrape page content and capture it's data in various formats.
Hyperbrowser exposes endpoints for starting a scrape request and for getting it's status and results. By default, scraping is handled in an asynchronous manner of first starting the job and then checking it's status until it is completed. However, with our SDKs, we provide a simple function that handles the whole flow and returns the data once the job is completed.
Installation
npminstall@hyperbrowser/sdk
or
yarnadd@hyperbrowser/sdk
pipinstallhyperbrowser
Usage
import { Hyperbrowser } from"@hyperbrowser/sdk";import { config } from"dotenv";config();constclient=newHyperbrowser({ apiKey:process.env.HYPERBROWSER_API_KEY,});constmain=async () => {// Handles both starting and waiting for scrape job responseconstscrapeResult=awaitclient.scrape.startAndWait({ url:"https://example.com", });console.log("Scrape result:", scrapeResult);};main();
The Start Scrape Job POST /scrape endpoint will return a jobId in the response which can be used to get information about the job in subsequent requests.
{"jobId":"962372c4-a140-400b-8c26-4ffe21d9fb9c"}
The Get Scrape Job GET /scrape/{jobId} will return the following data:
{"jobId":"962372c4-a140-400b-8c26-4ffe21d9fb9c","status":"completed","data": {"metadata": {"title":"Example Page","description":"A sample webpage" },"markdown":"# Example Page\nThis is content...", }}
The status of a scrape job can be one of pending, running, completed, failed . There can also be other optional fields like error with an error message if an error was encountered, and html and links in the data object depending on which formats are requested for the request.
You can also provide configurations for the session that will be used to execute the scrape job just as you would when creating a new session itself. These could include using a proxy or solving CAPTCHAs.
You can also provide optional parameters for the scrape job itself such as the formats to return, only returning the main content of the page, setting the maximum timeout for navigating to a page, etc.
For a full reference on the scrape endpoint, checkout the API Reference, or read the Advanced Scraping Guide to see more advanced options for scraping.