Scraping
Advanced Options for Hyperbrowser Scraping
Basic Usage
With supplying just a url, you can easily extract the contents of a page in markdown format with the /scrape
endpoint.
Now, let's take an in depth look at all the provided options for scraping.
Scrape Options
Configuring Response Formats with formats
Type:
array
Items:
string
Enum:
["html", "links", "markdown", "screenshot"]
Description: Choose the formats to include in the API response:
html
: Returns the scraped content as HTML.links
: Includes a list of links found on the page.markdown
: Provides the content in Markdown format.screenshot
: Provides a screenshot of the page.
Default:
["markdown"]
Specifying Tags to Include with includeTags
Type:
array
Items:
string
Description: Provide an array of HTML tags, classes, or IDs to include in the scraped content. Only elements matching these selectors will be returned.
Default:
undefined
Specifying Tags to Exclude with excludeTags
Type:
array
Items:
string
Description: Provide an array of HTML tags, classes, or IDs to exclude from the scraped content. Elements matching these selectors will be omitted from the response.
Default:
undefined
Focusing on Main Content with onlyMainContent
Type:
boolean
Description: When set to
true
(default), the API will attempt to return only the main content of the page, excluding common elements like headers, navigation menus, and footers. Set tofalse
to return the full page content.Default:
true
Delaying Scraping with waitFor
Type:
number
Description: Specify a delay in milliseconds to wait after the page loads before initiating the scrape. This can be useful for allowing dynamic content to fully render.
Default:
0
Setting a Timeout with timeout
Type:
number
Description: Specify the maximum time in milliseconds to wait for the page to load before timing out. This would be like doing:
Default:
30000
(30 seconds)
Session Options
Enabling Stealth Mode with useStealth
Type:
boolean
Description: When set to
true
, the session will be launched in stealth mode, which employs various techniques to make the browser harder to detect as an automated tool.Default:
false
Using a Proxy with useProxy
Type:
boolean
Description: When set to
true
, the session will be launched with a proxy server.Default:
false
Specifying a Custom Proxy Server with proxyServer
Type:
string
Description: The hostname or IP address of the proxy server to use for the session. This option is only used when
useProxy
is set totrue
.Default:
undefined
Providing Proxy Server Authentication with proxyServerUsername
and proxyServerPassword
Type:
string
Description: The username and password to use for authenticating with the proxy server, if required. These options are only used when
useProxy
is set totrue
and the proxy server requires authentication.Default:
undefined
Selecting a Proxy Location with proxyCountry
Type:
string
Enum:
["US", "GB", "CA", ...]
Description: The country where the proxy server should be located.
Default:
"US"
Specifying Operating Systems with operatingSystems
Type:
array
Items:
string
Enum:
["windows", "android", "macos", "linux", "ios"]
Description: An array of operating systems to use for fingerprinting.
Default:
undefined
Choosing Device Types with device
Type:
array
Items:
string
Enum:
["desktop", "mobile"]
Description: An array of device types to use for fingerprinting.
Default:
undefined
Selecting Browser Platforms with platform
Type:
array
Items:
string
Enum:
["chrome", "firefox", "safari", "edge"]
Description: An array of browser platforms to use for fingerprinting.
Default:
undefined
Setting Browser Locales with locales
Type:
array
Items:
string
Enum:
["en", "es", "fr", ...]
Description: An array of browser locales to specify the language for the browser.
Default:
["en"]
Customizing Screen Resolution with screen
Type:
object
Properties:
width
(number, default 1280): The screen width in pixels.height
(number, default 720): The screen height in pixels.
Description: An object specifying the screen resolution to emulate in the session.
Solving CAPTCHAs Automatically with solveCaptchas
Type:
boolean
Description: When set to
true
, the session will attempt to automatically solve any CAPTCHAs encountered during the session.Default:
false
Blocking Ads with adblock
Type:
boolean
Description: When set to
true
, the session will attempt to block ads and other unwanted content during the session.Default:
false
Blocking Trackers with trackers
Type:
boolean
Description: When set to
true
, the session will attempt to block web trackers and other privacy-invasive technologies during the session.Default:
false
Blocking Annoyances with annoyances
Type:
boolean
Description: When set to
true
, the session will attempt to block common annoyances like pop-ups, overlays, and other disruptive elements during the session.Default:
false
Example
By configuring these options when making a scrape request, you can control the format and content of the scraped data, as well as the behavior of the scraper itself.
For example, to scrape a page with the following:
In stealth mode
With CAPTCHA solving
Return only the main content as HTML
Exclude any
<span>
elementsWait 2 seconds after the page loads and before scraping
Crawl a Site
Instead of just scraping a single page, you might want to get all the content across multiple pages on a site. The /crawl
endpoint is perfect for such a task. You can use the same sessionOptions
and scrapeOptions
as before for this endpoint as well. The crawl endpoint does have some extra parameters that are used to tailor the crawl to your scraping needs.
Crawl Options
Limiting the Number of Pages to Crawl with maxPages
Type:
integer
Minimum: 1
Description: The maximum number of pages to crawl before stopping.
Following Links with followLinks
Type:
boolean
Default:
true
Description: When set to
true
, the crawler will follow links found on the pages it visits, allowing it to discover new pages and expand the scope of the crawl. When set tofalse
, the crawler will only visit the starting URL and any explicitly specified pages, without following any additional links.
Ignoring the Sitemap with ignoreSitemap
Type:
boolean
Default:
false
Description: When set to
true
, the crawler will not pre-generate a list of urls from potential sitemaps it finds. The crawler will try to locate sitemaps beginning at the base URL of the URL provided in theurl
param.
Excluding Pages with excludePatterns
Type:
array
Items:
string
Description: An array of regular expressions or wildcard patterns specifying which URLs should be excluded from the crawl. Any pages whose URLs' path match one of these patterns will be skipped.
Including Pages with includePatterns
Type:
array
Items:
string
Description: An array of regular expressions or wildcard patterns specifying which URLs should be included in the crawl. Only pages whose URLs' path match one of these path patterns will be visited.
Example
By configuring these options when initiating a crawl, you can control the scope and behavior of the crawler to suit your specific needs.
For example, to crawl a site with the following:
Maximum of 5 pages
Only include
/blog
pagesReturn only the main content as HTML
Exclude any
<span>
elementsWait 2 seconds after the page loads and before scraping
Last updated