Crawl

Crawl a website and it's links to extract all it's data

Last updated 1 month ago

Crawl

Crawl a website and it's links to extract all it's data

The Crawl API allows you to crawl through an entire website and get all it's data with a single call.

For detailed usage, checkout the

Hyperbrowser exposes endpoints for starting a crawl request and for getting it's status and results. By default, crawling is handled in an asynchronous manner of first starting the job and then checking it's status until it is completed. However, with our SDKs, we provide a simple function that handles the whole flow and returns the data once the job is completed.

Installation

npm install @hyperbrowser/sdk

yarn add @hyperbrowser/sdk

pip install hyperbrowser

Usage

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  // Handles both starting and waiting for crawl job response
  const crawlResult = await client.crawl.startAndWait({
    url: "https://hyperbrowser.ai",
  });
  console.log("Crawl result:", crawlResult);
};

main();

import os
from dotenv import load_dotenv
from hyperbrowser import Hyperbrowser
from hyperbrowser.models import StartCrawlJobParams

# Load environment variables from .env file
load_dotenv()

# Initialize Hyperbrowser client
client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


# Start crawling and wait for completion
crawl_result = client.crawl.start_and_wait(
    StartCrawlJobParams(url="https://hyperbrowser.ai")
)
print("Crawl result:", crawl_result)

Start Crawl Job

curl -X POST https://app.hyperbrowser.ai/api/crawl \
    -H 'Content-Type: application/json' \
    -H 'x-api-key: <YOUR_API_KEY>' \
    -d '{
        "url": "https://hyperbrowser.ai"
    }'

Get Crawl Job Status and Data

curl https://app.hyperbrowser.ai/api/crawl/{jobId} \
    -H 'x-api-key: <YOUR_API_KEY>'

Response

The Start Crawl Job POST /crawl endpoint will return a jobId in the response which can be used to get information about the job in subsequent requests.

{
    "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c"
}

The Get Crawl Job GET /crawl/{jobId} will return the following data:

{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c",
  "status": "completed",
  "totalCrawledPages": 2,
  "totalPageBatches": 1,
  "currentPageBatch": 1,
  "batchSize": 20,
  "data": [
    {
      "url": "https://example.com",
      "status": "completed",
      "metadata": {
        "title": "Example Page",
        "description": "A sample webpage"
      },
      "markdown": "# Example Page\nThis is content...",
    },
    ...
  ]
}

The status of a crawl job can be one of pending, running, completed, failed . There can also be other optional fields like error with an error message if an error was encountered, and html and links in the data object depending on which formats are requested for the request.

Unlike the scrape endpoint, the crawl endpoint returns a list in the data field with the all the pages that were crawled in the current page batch. The SDKs also provide a function which will start the crawl job, wait until it's complete, and return all the crawled pages for the entire crawl.

Each crawled page has it's own status of completed or failed and can have it's own error field, so be cautious of that.

Additional Crawl Configurations

The crawl endpoint provides additional parameters you can provide to tailor the crawl to your needs. You can narrow down the pages crawled by setting a limit to the maximum number of pages visited, only including paths that match a certain pattern, excluding paths that match another pattern, etc.

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  // Handles both starting and waiting for crawl job response
  const crawlResult = await client.crawl.startAndWait({
    url: "https://hyperbrowser.ai",
    maxPages: 5,
    includePatterns: ["/blogs/*"],
  });
  console.log("Crawl result:", crawlResult);
};

main();

import os
from dotenv import load_dotenv
from hyperbrowser import Hyperbrowser
from hyperbrowser.models import StartCrawlJobParams

# Load environment variables from .env file
load_dotenv()

# Initialize Hyperbrowser client
client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


# Start crawling and wait for completion
crawl_result = client.crawl.start_and_wait(
    StartCrawlJobParams(
        url="https://hyperbrowser.ai",
        max_pages=5,
        include_patterns: ["/blogs/*"],
    )
)
print("Crawl result:", crawl_result)

Session Configurations

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  const crawlResult = await client.crawl.startAndWait({
    url: "https://example.com",
    sessionOptions: {
      useProxy: true,
      solveCaptchas: true,
      proxyCountry: "US",
      locales: ["en"],
    },
  });
  console.log("Crawl result:", crawlResult);
};

main();

import os
from dotenv import load_dotenv
from hyperbrowser import Hyperbrowser
from hyperbrowser.models import StartCrawlJobParams, CreateSessionParams

# Load environment variables from .env file
load_dotenv()

# Initialize Hyperbrowser client
client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


# Start crawling and wait for completion
crawl_result = client.crawl.start_and_wait(
    StartCrawlJobParams(
        url="https://example.com",
        session_options=CreateSessionParams(use_proxy=True, solve_captchas=True),
    )
)
print("Crawl result:", crawl_result)

Using proxy and solving CAPTCHAs will slow down the crawl so use it only if necessary.

Scrape Configurations

You can also provide optional scrape options for the crawl job such as the formats to return, only returning the main content of the page, setting the maximum timeout for navigating to a page, etc.

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  const crawlResult = await client.crawl.startAndWait({
    url: "https://example.com",
    scrapeOptions: {
      formats: ["markdown", "html", "links"],
      onlyMainContent: false,
      timeout: 10000,
    },
  });
  console.log("Crawl result:", crawlResult);
};

main();

import os
from dotenv import load_dotenv
from hyperbrowser import Hyperbrowser
from hyperbrowser.models import ScrapeOptions, StartCrawlJobParams

# Load environment variables from .env file
load_dotenv()

# Initialize Hyperbrowser client
client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


# Start crawling and wait for completion
crawl_result = client.crawl.start_and_wait(
    StartCrawlJobParams(
        url="https://example.com",
        scrape_options=ScrapeOptions(
            formats=["html", "links", "markdown"], only_main_content=False, timeout=10000
        ),
    )
)
print("Crawl result:", crawl_result)

PreviousScrape NextExtract

Last updated 1 month ago

The Crawl API allows you to crawl through an entire website and get all it's data with a single call.

For detailed usage, checkout the

Installation

npm install @hyperbrowser/sdk

yarn add @hyperbrowser/sdk

pip install hyperbrowser

Usage

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  // Handles both starting and waiting for crawl job response
  const crawlResult = await client.crawl.startAndWait({
    url: "https://hyperbrowser.ai",
  });
  console.log("Crawl result:", crawlResult);
};

main();

import os
from dotenv import load_dotenv
from hyperbrowser import Hyperbrowser
from hyperbrowser.models import StartCrawlJobParams

# Load environment variables from .env file
load_dotenv()

# Initialize Hyperbrowser client
client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


# Start crawling and wait for completion
crawl_result = client.crawl.start_and_wait(
    StartCrawlJobParams(url="https://hyperbrowser.ai")
)
print("Crawl result:", crawl_result)

Start Crawl Job

curl -X POST https://app.hyperbrowser.ai/api/crawl \
    -H 'Content-Type: application/json' \
    -H 'x-api-key: <YOUR_API_KEY>' \
    -d '{
        "url": "https://hyperbrowser.ai"
    }'

Get Crawl Job Status and Data

curl https://app.hyperbrowser.ai/api/crawl/{jobId} \
    -H 'x-api-key: <YOUR_API_KEY>'

Response

The Start Crawl Job POST /crawl endpoint will return a jobId in the response which can be used to get information about the job in subsequent requests.

{
    "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c"
}

The Get Crawl Job GET /crawl/{jobId} will return the following data:

{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c",
  "status": "completed",
  "totalCrawledPages": 2,
  "totalPageBatches": 1,
  "currentPageBatch": 1,
  "batchSize": 20,
  "data": [
    {
      "url": "https://example.com",
      "status": "completed",
      "metadata": {
        "title": "Example Page",
        "description": "A sample webpage"
      },
      "markdown": "# Example Page\nThis is content...",
    },
    ...
  ]
}

Each crawled page has it's own status of completed or failed and can have it's own error field, so be cautious of that.

To see the full schema, checkout the .

Additional Crawl Configurations

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  // Handles both starting and waiting for crawl job response
  const crawlResult = await client.crawl.startAndWait({
    url: "https://hyperbrowser.ai",
    maxPages: 5,
    includePatterns: ["/blogs/*"],
  });
  console.log("Crawl result:", crawlResult);
};

main();

import os
from dotenv import load_dotenv
from hyperbrowser import Hyperbrowser
from hyperbrowser.models import StartCrawlJobParams

# Load environment variables from .env file
load_dotenv()

# Initialize Hyperbrowser client
client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


# Start crawling and wait for completion
crawl_result = client.crawl.start_and_wait(
    StartCrawlJobParams(
        url="https://hyperbrowser.ai",
        max_pages=5,
        include_patterns: ["/blogs/*"],
    )
)
print("Crawl result:", crawl_result)

Session Configurations

You can also provide configurations for the session that will be used to execute the crawl job just as you would when creating a new session itself. These could include using a proxy or solving CAPTCHAs. To see all the different available session parameters, checkout the or .

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  const crawlResult = await client.crawl.startAndWait({
    url: "https://example.com",
    sessionOptions: {
      useProxy: true,
      solveCaptchas: true,
      proxyCountry: "US",
      locales: ["en"],
    },
  });
  console.log("Crawl result:", crawlResult);
};

main();

import os
from dotenv import load_dotenv
from hyperbrowser import Hyperbrowser
from hyperbrowser.models import StartCrawlJobParams, CreateSessionParams

# Load environment variables from .env file
load_dotenv()

# Initialize Hyperbrowser client
client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


# Start crawling and wait for completion
crawl_result = client.crawl.start_and_wait(
    StartCrawlJobParams(
        url="https://example.com",
        session_options=CreateSessionParams(use_proxy=True, solve_captchas=True),
    )
)
print("Crawl result:", crawl_result)

Using proxy and solving CAPTCHAs will slow down the crawl so use it only if necessary.

Scrape Configurations

You can also provide optional scrape options for the crawl job such as the formats to return, only returning the main content of the page, setting the maximum timeout for navigating to a page, etc.

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  const crawlResult = await client.crawl.startAndWait({
    url: "https://example.com",
    scrapeOptions: {
      formats: ["markdown", "html", "links"],
      onlyMainContent: false,
      timeout: 10000,
    },
  });
  console.log("Crawl result:", crawlResult);
};

main();

import os
from dotenv import load_dotenv
from hyperbrowser import Hyperbrowser
from hyperbrowser.models import ScrapeOptions, StartCrawlJobParams

# Load environment variables from .env file
load_dotenv()

# Initialize Hyperbrowser client
client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


# Start crawling and wait for completion
crawl_result = client.crawl.start_and_wait(
    StartCrawlJobParams(
        url="https://example.com",
        scrape_options=ScrapeOptions(
            formats=["html", "links", "markdown"], only_main_content=False, timeout=10000
        ),
    )
)
print("Crawl result:", crawl_result)

For a full reference on the crawl endpoint, checkout the , or read the to see more advanced options for scraping.