Scraping
Advanced Options for Hyperbrowser Scraping
Scraping a web page
With supplying just a url, you can easily extract the contents of a page in markdown format with the /scrape
endpoint.
Now, let's take an in depth look at all the provided options for scraping.
Session Options
All Scraping APIs, like
Scrape
Crawl, and
Extract
Support the session parameters. You can see the session parameters listed here.
Scrape Options
formats
Type:
array
Items:
string
Enum:
["html", "links", "markdown", "screenshot"]
Description: Choose the formats to include in the API response:
html
- Returns the scraped content as HTML.links
- Includes a list of links found on the page.markdown
- Provides the content in Markdown format.screenshot
- Provides a screenshot of the page.
Default:
["markdown"]
includeTags
Type:
array
Items:
string
Description: Provide an array of HTML tags, classes, or IDs to include in the scraped content. Only elements matching these selectors will be returned.
Default:
undefined
excludeTags
Type:
array
Items:
string
Description: Provide an array of HTML tags, classes, or IDs to exclude from the scraped content. Elements matching these selectors will be omitted from the response.
Default:
undefined
onlyMainContent
Type:
boolean
Description: When set to
true
(default), the API will attempt to return only the main content of the page, excluding common elements like headers, navigation menus, and footers. Set tofalse
to return the full page content.Default:
true
waitFor
Type:
number
Description: Specify a delay in milliseconds to wait after the page loads before initiating the scrape. This can be useful for allowing dynamic content to fully render. This is also useful for waiting to detect CAPTCHAs on the page if you have
solveCaptchas
set to true in thesessionOptions
.Default:
0
timeout
Type:
number
Description: Specify the maximum time in milliseconds to wait for the page to load before timing out. This would be like doing:
Default:
30000
(30 seconds)
waitUntil
Type:
string
Enum:
["load", "domcontentloaded", "networkidle"]
Description: Specify the condition to wait for the page to load:
domcontentloaded
: Wait until the HTML is fully parsed and DOM is readyload
- Wait until DOM and all resources are completely loadednetworkidle
- Wait until no more network requests occur for a certain period of time
Default:
load
screenshotOptions
Type:
object
Properties:
fullPage - Take screenshot of the full page beyond the viewport
Type:
boolean
Default:
false
format - The image type of the screenshot
Type:
string
Enum:
["webp", "jpeg", "png"]
Default:
webp
Description: Configurations for the returned screenshot. Only applicable if
screenshot
is provided in theformats
array.
Example
By configuring these options when making a scrape request, you can control the format and content of the scraped data, as well as the behavior of the scraper itself.
For example, to scrape a page with the following:
In stealth mode
With CAPTCHA solving
Return only the main content as HTML
Exclude any
<span>
elementsWait 2 seconds after the page loads and before scraping
Crawl a Site
Instead of just scraping a single page, you might want to get all the content across multiple pages on a site. The /crawl
endpoint is perfect for such a task. You can use the same sessionOptions
and scrapeOptions
as before for this endpoint as well. The crawl endpoint does have some extra parameters that are used to tailor the crawl to your scraping needs.
Crawl Options
Limiting the Number of Pages to Crawl with maxPages
Type:
integer
Minimum: 1
Description: The maximum number of pages to crawl before stopping.
Following Links with followLinks
Type:
boolean
Default:
true
Description: When set to
true
, the crawler will follow links found on the pages it visits, allowing it to discover new pages and expand the scope of the crawl. When set tofalse
, the crawler will only visit the starting URL and any explicitly specified pages, without following any additional links.
Ignoring the Sitemap with ignoreSitemap
Type:
boolean
Default:
false
Description: When set to
true
, the crawler will not pre-generate a list of urls from potential sitemaps it finds. The crawler will try to locate sitemaps beginning at the base URL of the URL provided in theurl
param.
Excluding Pages with excludePatterns
Type:
array
Items:
string
Description: An array of regular expressions or wildcard patterns specifying which URLs should be excluded from the crawl. Any pages whose URLs' path match one of these patterns will be skipped.
Including Pages with includePatterns
Type:
array
Items:
string
Description: An array of regular expressions or wildcard patterns specifying which URLs should be included in the crawl. Only pages whose URLs' path match one of these path patterns will be visited.
Example
By configuring these options when initiating a crawl, you can control the scope and behavior of the crawler to suit your specific needs.
For example, to crawl a site with the following:
Maximum of 5 pages
Only include
/blog
pagesReturn only the main content as HTML
Exclude any
<span>
elementsWait 2 seconds after the page loads and before scraping
Structured Extraction
The Extract API allows you to fetch data in a well-defined structure from any webpage or website with just a few lines of code. You can provide a list of web pages, and hyperbrowser will collate all the information together and extract the information that best fits the provided schema (or prompt). You have access to the same SessionOptions
available here as well.
Extract Options:
Specifying all page to collect data from with urls
Type:
array
Items:
string
Required: Yes
Description: List of URLs to extract data from. To crawl a site, add
/*
to a URL (e.g.,https://example.com/*
). This will crawl other pages on the site with the same origin and find relevant pages to use for the extraction context.
Specify the extraction schema
schema
Type:
object
Required: No
Description: JSON schema defining the structure of the data you want to extract. Gives the best results with clear data structure requirements.
Note: You must provide either a
schema
or aprompt
. If both are provided, the schema takes precedence.Default:
undefined
Specify the data to be extracted from a prompt
Type:
string
Required: No
Description: A prompt describing how you want the data structured. Useful if you don't have a specific schema in mind.
Note: You must provide either a
schema
or aprompt
. If both are provided, the schema takes precedence.Default:
undefined
Further specify the extraction process with a systemPrompt
Type:
string
Required: No
Description: Additional instructions for the extraction process to guide the AI's behavior.
Default:
undefined
Specify the number of pages to collect information from with maxLinks
Type:
number
Description: Maximum number of links to follow when crawling a site for any given URL with
/*
suffix.Default:
undefined
Max time to wait on a page before extraction using waitFor
Type:
number
Description: Time in milliseconds to wait after page load before extraction. This can be useful for allowing dynamic content to fully render or for waiting to detect CAPTCHAs if you have
solveCaptchas
set to true.Default:
0
Set options for the session with sessionOptions
Type: object
Default:
undefined
One of schema
or prompt
must be defined.
Example
By configuring these options when initiating a structured extraction, you can control the scope and behavior to suit your specific needs.
For example, to crawl a site with the following:
Maximum of 5 pages
Include
/products
on example.com, and as many subsequent pages as possible on test.comReturn the extracted data in the specified schema.
Wait 2 seconds after the page loads and before scraping
For more detail, check out the Extract page.
Last updated