🦙LlamaIndex

Using Hyperbrowser's Web Reader Integration

Installation and Setup

To get started with LlamaIndex and Hyperbrowser, you can install the necessary packages using pip:

pip install llama-index-core llama-index-readers-web hyperbrowser

And you should configure credentials by setting the following environment variables:

HYPERBROWSER_API_KEY=<your-api-key>

You can get an API Key easily from the dashboard. Once you have your API Key, add it to your .env file as HYPERBROWSER_API_KEY or you can pass it via the api_key argument in the HyperbrowserWebReader constructor.

Usage

Once you have your API Key and have installed the packages you can load webpages into LlamaIndex using HyperbrowserWebReader.

from llama_index.readers.web import HyperbrowserWebReader

reader = HyperbrowserWebReader(api_key="your_api_key_here")

To load data, you can specify the operation to be performed by the loader. The default operation is scrape. For scrape, you can provide a single URL or a list of URLs to be scraped. For crawl, you can only provide a single URL. The crawl operation will crawl the provided page and subpages and return a document for each page. HyperbrowserWebReader supports loading and lazy loading data in both sync and async modes.

documents = reader.load_data(
    urls=["https://example.com"],
    operation="scrape",
)

Optional params for the loader can also be provided in the params argument. For more information on the supported params, you can see the params on the scraping guide.

The params will be Snake case for python code, so here for example, it is max_pages instead of maxPages.

# Scrape
documents = reader.load_data(
    urls=["https://example.com"],
    operation="scrape",
    params={"scrape_options": {"include_tags": ["h1", "h2", "p"]}},
)

# Crawl
documents = reader.load_data(
    urls=["https://example.com"],
    operation="crawl",
    params={
        "max_pages": 10,
        "scrape_options": {
            "formats": ["markdown"],
        },
        "session_options": {
            "use_stealth": True,
        }
    }
)

Last updated