Scraping

Advanced Options for Hyperbrowser Scraping

Basic Usage

With supplying just a url, you can easily extract the contents of a page in markdown format with the /scrape endpoint.

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  // Handles both starting and waiting for scrape job response
  const scrapeResult = await client.scrape.startAndWait({
    url: "https://example.com",
  });
  console.log("Scrape result:", scrapeResult);
};

main();

Now, let's take an in depth look at all the provided options for scraping.

Scrape Options

Configuring Response Formats with formats

  • Type: array

  • Items: string

  • Enum: ["html", "links", "markdown", "screenshot"]

  • Description: Choose the formats to include in the API response:

    • html: Returns the scraped content as HTML.

    • links: Includes a list of links found on the page.

    • markdown: Provides the content in Markdown format.

    • screenshot: Provides a screenshot of the page.

  • Default: ["markdown"]

Specifying Tags to Include with includeTags

  • Type: array

  • Items: string

  • Description: Provide an array of HTML tags, classes, or IDs to include in the scraped content. Only elements matching these selectors will be returned.

  • Default: undefined

Specifying Tags to Exclude with excludeTags

  • Type: array

  • Items: string

  • Description: Provide an array of HTML tags, classes, or IDs to exclude from the scraped content. Elements matching these selectors will be omitted from the response.

  • Default: undefined

Focusing on Main Content with onlyMainContent

  • Type: boolean

  • Description: When set to true (default), the API will attempt to return only the main content of the page, excluding common elements like headers, navigation menus, and footers. Set to false to return the full page content.

  • Default: true

Delaying Scraping with waitFor

  • Type: number

  • Description: Specify a delay in milliseconds to wait after the page loads before initiating the scrape. This can be useful for allowing dynamic content to fully render.

  • Default: 0

Setting a Timeout with timeout

  • Type: number

  • Description: Specify the maximum time in milliseconds to wait for the page to load before timing out. This would be like doing:

await page.goto("https://example.com", { waitUntil: "load", timeout: 30000 })
  • Default: 30000 (30 seconds)

Session Options

Enabling Stealth Mode with useStealth

  • Type: boolean

  • Description: When set to true, the session will be launched in stealth mode, which employs various techniques to make the browser harder to detect as an automated tool.

  • Default: false

Using a Proxy with useProxy

  • Type: boolean

  • Description: When set to true, the session will be launched with a proxy server.

  • Default: false

Specifying a Custom Proxy Server with proxyServer

  • Type: string

  • Description: The hostname or IP address of the proxy server to use for the session. This option is only used when useProxy is set to true.

  • Default: undefined

Providing Proxy Server Authentication with proxyServerUsername and proxyServerPassword

  • Type: string

  • Description: The username and password to use for authenticating with the proxy server, if required. These options are only used when useProxy is set to true and the proxy server requires authentication.

  • Default: undefined

Selecting a Proxy Location with proxyCountry

  • Type: string

  • Enum: ["US", "GB", "CA", ...]

  • Description: The country where the proxy server should be located.

  • Default: "US"

Specifying Operating Systems with operatingSystems

  • Type: array

  • Items: string

  • Enum: ["windows", "android", "macos", "linux", "ios"]

  • Description: An array of operating systems to use for fingerprinting.

  • Default: undefined

Choosing Device Types with device

  • Type: array

  • Items: string

  • Enum: ["desktop", "mobile"]

  • Description: An array of device types to use for fingerprinting.

  • Default: undefined

Selecting Browser Platforms with platform

  • Type: array

  • Items: string

  • Enum: ["chrome", "firefox", "safari", "edge"]

  • Description: An array of browser platforms to use for fingerprinting.

  • Default: undefined

Setting Browser Locales with locales

  • Type: array

  • Items: string

  • Enum: ["en", "es", "fr", ...]

  • Description: An array of browser locales to specify the language for the browser.

  • Default: ["en"]

Customizing Screen Resolution with screen

  • Type: object

  • Properties:

    • width (number, default 1280): The screen width in pixels.

    • height (number, default 720): The screen height in pixels.

  • Description: An object specifying the screen resolution to emulate in the session.

Solving CAPTCHAs Automatically with solveCaptchas

  • Type: boolean

  • Description: When set to true, the session will attempt to automatically solve any CAPTCHAs encountered during the session.

  • Default: false

Blocking Ads with adblock

  • Type: boolean

  • Description: When set to true, the session will attempt to block ads and other unwanted content during the session.

  • Default: false

Blocking Trackers with trackers

  • Type: boolean

  • Description: When set to true, the session will attempt to block web trackers and other privacy-invasive technologies during the session.

  • Default: false

Blocking Annoyances with annoyances

  • Type: boolean

  • Description: When set to true, the session will attempt to block common annoyances like pop-ups, overlays, and other disruptive elements during the session.

  • Default: false

Example

By configuring these options when making a scrape request, you can control the format and content of the scraped data, as well as the behavior of the scraper itself.

For example, to scrape a page with the following:

  • In stealth mode

  • With CAPTCHA solving

  • Return only the main content as HTML

  • Exclude any <span> elements

  • Wait 2 seconds after the page loads and before scraping

curl -X POST https://app.hyperbrowser.ai/api/scrape \
    -H 'Content-Type: application/json' \
    -H 'x-api-key: <YOUR_API_KEY>' \
    -d '{
            "url": "https://example.com",
            "sessionOptions": {
                    "useStealth": true,
                    "solveCaptchas": true
            },
            "scrapeOptions": {
                    "formats": ["html"],
                    "onlyMainContent": true, 
                    "excludeTags": ["span"],
                    "waitFor": 2000
            }
    }'

Crawl a Site

Instead of just scraping a single page, you might want to get all the content across multiple pages on a site. The /crawl endpoint is perfect for such a task. You can use the same sessionOptions and scrapeOptions as before for this endpoint as well. The crawl endpoint does have some extra parameters that are used to tailor the crawl to your scraping needs.

Crawl Options

Limiting the Number of Pages to Crawl with maxPages

  • Type: integer

  • Minimum: 1

  • Description: The maximum number of pages to crawl before stopping.

Following Links with followLinks

  • Type: boolean

  • Default: true

  • Description: When set to true, the crawler will follow links found on the pages it visits, allowing it to discover new pages and expand the scope of the crawl. When set to false, the crawler will only visit the starting URL and any explicitly specified pages, without following any additional links.

Ignoring the Sitemap with ignoreSitemap

  • Type: boolean

  • Default: false

  • Description: When set to true, the crawler will not pre-generate a list of urls from potential sitemaps it finds. The crawler will try to locate sitemaps beginning at the base URL of the URL provided in the url param.

Excluding Pages with excludePatterns

  • Type: array

  • Items: string

  • Description: An array of regular expressions or wildcard patterns specifying which URLs should be excluded from the crawl. Any pages whose URLs' path match one of these patterns will be skipped.

Including Pages with includePatterns

  • Type: array

  • Items: string

  • Description: An array of regular expressions or wildcard patterns specifying which URLs should be included in the crawl. Only pages whose URLs' path match one of these path patterns will be visited.

Example

By configuring these options when initiating a crawl, you can control the scope and behavior of the crawler to suit your specific needs.

For example, to crawl a site with the following:

  • Maximum of 5 pages

  • Only include /blog pages

  • Return only the main content as HTML

  • Exclude any <span> elements

  • Wait 2 seconds after the page loads and before scraping

curl -X POST https://app.hyperbrowser.ai/api/crawl \
    -H 'Content-Type: application/json' \
    -H 'x-api-key: <YOUR_API_KEY>' \
    -d '{
            "url": "https://example.com",
            "maxPages": 5,
            "includePatterns": ["/blog/*"],
            "scrapeOptions": {
                    "formats": ["html"],
                    "onlyMainContent": true, 
                    "excludeTags": ["span"],
                    "waitFor": 2000
            }
    }'

Last updated

Logo

© 2025 S2 Labs, Inc. All rights reserved.