Extract Information with an LLM

Use Hyperbrowser to scrape a wikipedia page and extract information with an LLM

In this guide, we'll use Hyperbrowser's Node.js SDK to get formatted data from a Wikipedia page and then feed it into an LLM like ChatGPT to extract the information we want. Our goal is to get a list of the most populous cities.

Setup

First, lets create a new Node.js project.

mkdir wiki-scraper && cd wiki-scraper
npm init -y

Installation

Next, let's install the necessary dependencies to run our script.

npm install @hyperbrowser/sdk dotenv openai zod

Setup your Environment

To use Hyperbrowser with your code, you will need an API Key. You can get one easily from the dashboard. Once you have your API Key, add it to your .env file as HYPERBROWSER_API_KEY . You will also need an OPENAI_API_KEY to use ChatGPT to extract information from our scraped data.

Code

Next, create a new file scraper.js and add the following code:

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";
import { z } from "zod";
import OpenAI from "openai";
import { zodResponseFormat } from "openai/helpers/zod";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const CitySchema = z.object({
  city: z.string(),
  country: z.string(),
  population: z.number(),
  rank: z.number(),
});

const ResponseSchema = z.object({ cities: z.array(CitySchema) });

const SYSTEM_PROMPT = `You are a helpful assistant that can extract information from markdown and convert it into a structured format.
Ensure the output adheres to the following:
- city: The name of the city
- country: The name of the country
- population: The population of the city
- rank: The rank of the city

Provide the extracted data as a JSON object. Parse the Markdown content carefully to identify and categorize the city details accurately.
`;

const main = async () => {
  console.log("Started scraping");
  const scrapeResult = await client.scrape.startAndWait({
    url: "https://en.wikipedia.org/wiki/List_of_largest_cities",
    scrapeOptions: {
      // Only return the markdown for the scraped data
      formats: ["markdown"],
      // Only include the table element with class `wikitable` from the page
      includeTags: [".wikitable"],
      // Remove any img tags from the table
      excludeTags: ["img"],
    },
  });
  console.log("Finished scraping");
  if (scrapeResult.status === "failed") {
    console.error("Scrape failed:", scrapeResult.error);
    return;
  }
  if (!scrapeResult.data.markdown) {
    console.error("No markdown data found in the scrape result");
    return;
  }

  console.log("Extracting data from markdown");
  const completion = await openai.beta.chat.completions.parse({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content: SYSTEM_PROMPT,
      },
      { role: "user", content: scrapeResult.data.markdown },
    ],
    response_format: zodResponseFormat(ResponseSchema, "cities"),
  });
  console.log("Finished extracting data from markdown");

  const cities = completion.choices[0].message.parsed;

  const data = JSON.stringify(cities, null, 2);
  fs.writeFileSync("cities.json", data);
};

main();

With just a single call to the SDKs crawl startAndWait function, we can get back the exact information we need from the page in properly formatted markdown. To make sure we narrow down the data we get back to just the information we need, we make sure to only include the wikiTable class element and remove any unnecessary image tags.

Once we have the markdown text, we can simply just pass it into the request to the parse function from the openai library with the response_format we want and we will have our list of the most populous cities.

Run the Scraper

Once you have the code copied, you can run the script with:

node scraper.js

If everything completes successfully, you should see a cities.json file in your project directory with the data in this format:

{
  "cities": [
    {
      "city": "Tokyo",
      "country": "Japan",
      "population": 37468000,
      "rank": 1
    },
    {
      "city": "Delhi",
      "country": "India",
      "population": 28514000,
      "rank": 2
    },
    {
      "city": "Shanghai",
      "country": "China",
      "population": 25582000,
      "rank": 3
    },
    ...
  ]
}

Next Steps

This is a simple example, but you can adapt it to scrape more complex data from other sites, or crawl entire websites.

PreviousAI Function Calling NextUsing Hyperbrowser Session

Last updated 2 months ago