Scraping

Advanced Options for Hyperbrowser Scraping

Scraping a web page

With supplying just a url, you can easily extract the contents of a page in markdown format with the /scrape endpoint.

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  // Handles both starting and waiting for scrape job response
  const scrapeResult = await client.scrape.startAndWait({
    url: "https://example.com",
  });
  console.log("Scrape result:", scrapeResult);
};

main();

import os
from dotenv import load_dotenv
from hyperbrowser import Hyperbrowser
from hyperbrowser.models import StartScrapeJobParams

# Load environment variables from .env file
load_dotenv()

# Initialize Hyperbrowser client
client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


# Start scraping and wait for completion
scrape_result = client.scrape.start_and_wait(
    StartScrapeJobParams(url="https://example.com")
)
print("Scrape result:", scrape_result)

Start Scrape Job

curl -X POST https://api.hyperbrowser.ai/api/scrape \
    -H 'Content-Type: application/json' \
    -H 'x-api-key: <YOUR_API_KEY>' \
    -d '{
        "url": "https://example.com"
    }'

Get Scrape Job Status

curl https://api.hyperbrowser.ai/api/scrape/{jobId}/status \
    -H 'x-api-key: <YOUR_API_KEY>'

Get Scrape Job Status and Data

curl https://api.hyperbrowser.ai/api/scrape/{jobId} \
    -H 'x-api-key: <YOUR_API_KEY>'

Now, let's take an in depth look at all the provided options for scraping.

Session Options

All Scraping APIs (scrape, crawl, extract) support session parameters. You can see the session parameters listed here.

Scrape Options

formats

Type: array
Items: string
Enum: ["html", "links", "markdown", "screenshot"]
Description: Choose the formats to include in the API response:
- html - Returns the scraped content as HTML.
- links - Includes a list of links found on the page.
- markdown - Provides the content in Markdown format.
- screenshot - Provides a screenshot of the page.
Default: ["markdown"]

includeTags

Type: array
Items: string
Description: Provide an array of HTML tags, classes, or IDs to include in the scraped content. Only elements matching these selectors will be returned.
Default: undefined

excludeTags

Type: array
Items: string
Description: Provide an array of HTML tags, classes, or IDs to exclude from the scraped content. Elements matching these selectors will be omitted from the response.
Default: undefined

onlyMainContent

Type: boolean
Description: When set to true (default), the API will attempt to return only the main content of the page, excluding common elements like headers, navigation menus, and footers. Set to false to return the full page content.
Default: true

waitFor

Type: number
Description: Specify a delay in milliseconds to wait after the page loads before initiating the scrape. This can be useful for allowing dynamic content to fully render. This is also useful for waiting to detect CAPTCHAs on the page if you have solveCaptchas set to true in the sessionOptions.
Default: 0

timeout

Type: number
Description: Specify the maximum time in milliseconds to wait for the page to load before timing out. This would be like doing:

await page.goto("https://example.com", { waitUntil: "load", timeout: 30000 })

Default: 30000 (30 seconds)

waitUntil

Type: string
Enum: ["load", "domcontentloaded", "networkidle"]
Description: Specify the condition to wait for the page to load:
- domcontentloaded: Wait until the HTML is fully parsed and DOM is ready
- load - Wait until DOM and all resources are completely loaded
- networkidle - Wait until no more network requests occur for a certain period of time
Default: load

screenshotOptions

Type: object
Properties:
- fullPage - Take screenshot of the full page beyond the viewport
  - Type: boolean
  - Default: false
- format - The image type of the screenshot
  - Type: string
  - Enum: ["webp", "jpeg", "png"]
  - Default: webp
Description: Configurations for the returned screenshot. Only applicable if screenshot is provided in the formats array.

Example

By configuring these options when making a scrape request, you can control the format and content of the scraped data, as well as the behavior of the scraper itself.

For example, to scrape a page with the following:

In stealth mode
With CAPTCHA solving
Return only the main content as HTML
Exclude any <span> elements
Wait 2 seconds after the page loads and before scraping

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  const scrapeResult = await client.scrape.startAndWait({
    url: "https://example.com",
    sessionOptions: {
      useStealth: true,
      solveCaptchas: true,
    },
    scrapeOptions: {
      formats: ["html"],
      onlyMainContent: true,
      excludeTags: ["span"],
      waitFor: 2000,
    },
  });
  console.log("Scrape result:", scrapeResult);
};

main();

import os
from dotenv import load_dotenv
from hyperbrowser import Hyperbrowser
from hyperbrowser.models import StartScrapeJobParams, CreateSessionParams, ScrapeOptions


load_dotenv()


client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


scrape_result = client.scrape.start_and_wait(
    StartScrapeJobParams(
        url="https://example.com",
        session_options=CreateSessionParams(use_stealth=True, solve_captchas=True),
        scrape_options=ScrapeOptions(
            formats=["html"],
            only_main_content=True,
            exclude_tags=["span"],
            wait_for=2000,
        ),
    )
)

print("Scrape result:", scrape_result.model_dump_json(indent=2))

curl -X POST https://api.hyperbrowser.ai/api/scrape \
    -H 'Content-Type: application/json' \
    -H 'x-api-key: <YOUR_API_KEY>' \
    -d '{
            "url": "https://example.com",
            "sessionOptions": {
                    "useStealth": true,
                    "solveCaptchas": true
            },
            "scrapeOptions": {
                    "formats": ["html"],
                    "onlyMainContent": true, 
                    "excludeTags": ["span"],
                    "waitFor": 2000
            }
    }'

Crawl a Site

Instead of just scraping a single page, you might want to get all the content across multiple pages on a site. The /crawl endpoint is perfect for such a task. You can use the same sessionOptions and scrapeOptions as before for this endpoint as well. The crawl endpoint does have some extra parameters that are used to tailor the crawl to your scraping needs.

Crawl Options

Limiting the Number of Pages to Crawl with maxPages

Type: integer
Minimum: 1
Description: The maximum number of pages to crawl before stopping.

Following Links with followLinks

Type: boolean
Default: true
Description: When set to true, the crawler will follow links found on the pages it visits, allowing it to discover new pages and expand the scope of the crawl. When set to false, the crawler will only visit the starting URL and any explicitly specified pages, without following any additional links.

Ignoring the Sitemap with ignoreSitemap

Type: boolean
Default: false
Description: When set to true, the crawler will not pre-generate a list of urls from potential sitemaps it finds. The crawler will try to locate sitemaps beginning at the base URL of the URL provided in the url param.

Excluding Pages with excludePatterns

Type: array
Items: string
Description: An array of regular expressions or wildcard patterns specifying which URLs should be excluded from the crawl. Any pages whose URLs' path match one of these patterns will be skipped.

Including Pages with includePatterns

Type: array
Items: string
Description: An array of regular expressions or wildcard patterns specifying which URLs should be included in the crawl. Only pages whose URLs' path match one of these path patterns will be visited.

Example

By configuring these options when initiating a crawl, you can control the scope and behavior of the crawler to suit your specific needs.

For example, to crawl a site with the following:

Maximum of 5 pages
Only include /blog pages
Return only the main content as markdown
Exclude any <span> elements

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  const crawlResult = await client.crawl.startAndWait({
    url: "https://hyperbrowser.ai",
    maxPages: 5,
    includePatterns: ["/blog/*"],
    scrapeOptions: {
      formats: ["markdown"],
      onlyMainContent: true,
      excludeTags: ["span"],
    },
  });
  console.log("Crawl result:", crawlResult);
};

main();

import os
from dotenv import load_dotenv
from hyperbrowser import Hyperbrowser
from hyperbrowser.models import StartCrawlJobParams, ScrapeOptions


load_dotenv()


client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


crawl_result = client.crawl.start_and_wait(
    StartCrawlJobParams(
        url="https://hyperbrowser.ai",
        max_pages=5,
        include_patterns=["/blog/*"],
        scrape_options=ScrapeOptions(
            formats=["markdown"],
            only_main_content=True,
            exclude_tags=["span"],
        ),
    )
)

print("Crawl result:", crawl_result.model_dump_json(indent=2))

curl -X POST https://api.hyperbrowser.ai/api/crawl \
    -H 'Content-Type: application/json' \
    -H 'x-api-key: <YOUR_API_KEY>' \
    -d '{
            "url": "https://hyperbrowser.ai",
            "maxPages": 5,
            "includePatterns": ["/blog/*"],
            "scrapeOptions": {
                    "formats": ["markdown"],
                    "onlyMainContent": true, 
                    "excludeTags": ["span"]
            }
    }'

Structured Extraction

The Extract API allows you to fetch data in a well-defined structure from any webpage or website with just a few lines of code. You can provide a list of web pages, and hyperbrowser will collate all the information together and extract the information that best fits the provided schema (or prompt). You have access to the same SessionOptionsavailable here as well.

Extract Options:

Specifying all page to collect data from with urls

Type: array
Items: string
Required: Yes
Description: List of URLs to extract data from. To crawl a site, add /* to a URL (e.g., https://example.com/*). This will crawl other pages on the site with the same origin and find relevant pages to use for the extraction context.

Specify the extraction `schema`

Type: object
Required: No
Description: JSON schema defining the structure of the data you want to extract. Gives the best results with clear data structure requirements.
Note: You must provide either a schema or a prompt. If both are provided, the schema takes precedence.
Default: undefined

Specify the data to be extracted from a prompt

Type: string
Required: No
Description: A prompt describing how you want the data structured. Useful if you don't have a specific schema in mind.
Note: You must provide either a schema or a prompt. If both are provided, the schema takes precedence.
Default: undefined

Further specify the extraction process with a systemPrompt

Type: string
Required: No
Description: Additional instructions for the extraction process to guide the AI's behavior.
Default: undefined

Specify the number of pages to collect information from with maxLinks

Type: number
Description: Maximum number of links to follow when crawling a site for any given URL with /* suffix.
Default: undefined

Max time to wait on a page before extraction using waitFor

Type: number
Description: Time in milliseconds to wait after page load before extraction. This can be useful for allowing dynamic content to fully render or for waiting to detect CAPTCHAs if you have solveCaptchas set to true.
Default: 0

Set options for the session with sessionOptions

Type: object
Default: undefined

One of schema or prompt must be defined.

Example

By configuring these options when initiating a structured extraction, you can control the scope and behavior to suit your specific needs.

For example, to crawl a site with the following:

Maximum of 5 pages per URL
Include /products on example.com, and as many subsequent pages as possible on test.com up to 5 pages
Return the extracted data in the specified schema
Wait 2 seconds after the page loads and before extracting

For more detail, check out the Extract page.

curl -X POST https://api.hyperbrowser.ai/api/extract \
    -H 'Content-Type: application/json' \
    -H 'x-api-key: <YOUR_API_KEY>' \
    -d '{
        "urls": ["https://example.com/products","https://www.test.com/*"],
        "prompt": "Extract the product information from this page",
        "schema": {
            "type": "object",
            "properties": {
                "productName": {
                    "type": "string"
                },
                "price": {
                    "type": "string"
                },
                "features": {
                    "type": "array",
                    "items": {
                        "type": "string"
                    }
                }
            },
            "required": [
                "productName",
                "price",
                "features"
            ]
        },
        "maxLinks": 5,
        "waitFor": 2000,
        "sessionOptions": {
            "useStealth": true,
            "solveCaptchas": true,
            "adblock": true
        }
    }'

PreviousModel Context Protocol NextAI Function Calling

Last updated 1 month ago

Scraping a web page

Session Options

Scrape Options

Example

Crawl a Site

Crawl Options

Example

Structured Extraction

Extract Options:

Specify the extraction schema

Example

Specify the extraction `schema`