Extract
Extract data from pages using AI
Last updated
Extract data from pages using AI
Last updated
The Extract API allows you to get data in a structured format for any provided URLs with a single call.
Hyperbrowser exposes endpoints for starting an extract request and for getting it's status and results. By default, extracting is handled in an asynchronous manner of first starting the job and then checking it's status until it is completed. However, with our SDKs, we provide a simple function that handles the whole flow and returns the data once the job is completed.
or
You can configure the extract request with the following parameters:
urls
- A required list of urls you want to use to extract data from. To allow crawling for any of the urls provided in the list, simply add /*
to the end of the url (https://hyperbrowser.ai/*
). This will crawl other pages on the site with the same origin and find relevant pages to use for the extraction context.
schema
- A strict json schema you want the returned data to be structured as. Gives the best results.
prompt
- A prompt describing how you want the data structured. Useful if you don't have a specific schema in mind.
maxLinks
- The maximum number of links to look for if performing a crawl for any given url.
waitFor
- A delay in milliseconds to wait after the page loads before initiating the scrape to get data for extraction from page. This can be useful for allowing dynamic content to fully render. This is also useful for waiting to detect CAPTCHAs on the page if you have solveCaptchas
set to true in the sessionOptions
.
You must provide either a schema
or a prompt
in your request, and if both are provided the schema takes precedence.
For the Node SDK, you can simply pass in a zod schema for ease of use or an actual json schema. For the Python SDK, you can pass in a pydantic model or an actual json schema.
Ensure that the root level of the schema is type: "object"
.
The Start Extract Job POST /extract
endpoint will return a jobId
in the response which can be used to get information about the job in subsequent requests.
The Get Extract Job GET /extract/{jobId}
will return the following data:
The status of an extract job can be one of pending
, running
, completed
, failed
. There can also be an optional error
field with an error message if an error was encountered.
sessionOptions
- .
To see the full schema, checkout the .
You can also provide configurations for the session that will be used to execute the extract job just as you would when creating a new session itself. These could include using a proxy or solving CAPTCHAs. To see the full list of session configurations, checkout the .
For a full reference on the extract endpoint, checkout the .