🦙LlamaIndex
Using Hyperbrowser's Web Reader Integration
Installation and Setup
To get started with LlamaIndex and Hyperbrowser, you can install the necessary packages using pip:
pip install llama-index-core llama-index-readers-web hyperbrowser
And you should configure credentials by setting the following environment variables:
HYPERBROWSER_API_KEY=<your-api-key>
You can get an API Key easily from the dashboard. Once you have your API Key, add it to your .env
file as HYPERBROWSER_API_KEY
or you can pass it via the api_key
argument in the HyperbrowserWebReader
constructor.
Usage
Once you have your API Key and have installed the packages you can load webpages into LlamaIndex using HyperbrowserWebReader
.
from llama_index.readers.web import HyperbrowserWebReader
reader = HyperbrowserWebReader(api_key="your_api_key_here")
To load data, you can specify the operation to be performed by the loader. The default operation is scrape
. For scrape
, you can provide a single URL or a list of URLs to be scraped. For crawl
, you can only provide a single URL. The crawl
operation will crawl the provided page and subpages and return a document for each page. HyperbrowserWebReader
supports loading and lazy loading data in both sync and async modes.
documents = reader.load_data(
urls=["https://example.com"],
operation="scrape",
)
Optional params for the loader can also be provided in the params
argument. For more information on the supported params, you can see the params on the scraping guide.
# Scrape
documents = reader.load_data(
urls=["https://example.com"],
operation="scrape",
params={"scrape_options": {"include_tags": ["h1", "h2", "p"]}},
)
# Crawl
documents = reader.load_data(
urls=["https://example.com"],
operation="crawl",
params={
"max_pages": 10,
"scrape_options": {
"formats": ["markdown"],
},
"session_options": {
"use_stealth": True,
}
}
)
Last updated