Use Hyperbrowser to scrape a wikipedia page and extract information with an LLM
In this guide, we'll use Hyperbrowser's Node.js SDK to get formatted data from a Wikipedia page and then feed it into an LLM like ChatGPT to extract the information we want. Our goal is to get a list of the most populous cities.
Setup
First, lets create a new Node.js project.
mkdirwiki-scraper&&cdwiki-scrapernpminit-y
Installation
Next, let's install the necessary dependencies to run our script.
npminstall@hyperbrowser/sdkdotenvopenaizod
Setup your Environment
To use Hyperbrowser with your code, you will need an API Key. You can get one easily from the dashboard. Once you have your API Key, add it to your .env file as HYPERBROWSER_API_KEY . You will also need an OPENAI_API_KEY to use ChatGPT to extract information from our scraped data.
Code
Next, create a new file scraper.js and add the following code:
import { Hyperbrowser } from"@hyperbrowser/sdk";import { config } from"dotenv";import { z } from"zod";import OpenAI from"openai";import { zodResponseFormat } from"openai/helpers/zod";config();constclient=newHyperbrowser({ apiKey:process.env.HYPERBROWSER_API_KEY,});constopenai=newOpenAI({ apiKey:process.env.OPENAI_API_KEY });constCitySchema=z.object({ city:z.string(), country:z.string(), population:z.number(), rank:z.number(),});constResponseSchema=z.object({ cities:z.array(CitySchema) });constSYSTEM_PROMPT=`You are a helpful assistant that can extract information from markdown and convert it into a structured format.Ensure the output adheres to the following:- city: The name of the city- country: The name of the country- population: The population of the city- rank: The rank of the cityProvide the extracted data as a JSON object. Parse the Markdown content carefully to identify and categorize the city details accurately.`;constmain=async () => {console.log("Started scraping");constscrapeResult=awaitclient.scrape.startAndWait({ url:"https://en.wikipedia.org/wiki/List_of_largest_cities", scrapeOptions: {// Only return the markdown for the scraped data formats: ["markdown"],// Only include the table element with class `wikitable` from the page includeTags: [".wikitable"],// Remove any img tags from the table excludeTags: ["img"], }, });console.log("Finished scraping");if (scrapeResult.status ==="failed") {console.error("Scrape failed:",scrapeResult.error);return; }if (!scrapeResult.data.markdown) {console.error("No markdown data found in the scrape result");return; }console.log("Extracting data from markdown");constcompletion=awaitopenai.beta.chat.completions.parse({ model:"gpt-4o-mini", messages: [ { role:"system", content:SYSTEM_PROMPT, }, { role:"user", content:scrapeResult.data.markdown }, ], response_format:zodResponseFormat(ResponseSchema,"cities"), });console.log("Finished extracting data from markdown");constcities=completion.choices[0].message.parsed;constdata=JSON.stringify(cities,null,2);fs.writeFileSync("cities.json", data);};main();
With just a single call to the SDKs crawl startAndWait function, we can get back the exact information we need from the page in properly formatted markdown. To make sure we narrow down the data we get back to just the information we need, we make sure to only include the wikiTable class element and remove any unnecessary image tags.
Once we have the markdown text, we can simply just pass it into the request to the parse function from the openai library with the response_format we want and we will have our list of the most populous cities.
Run the Scraper
Once you have the code copied, you can run the script with:
nodescraper.js
If everything completes successfully, you should see a cities.json file in your project directory with the data in this format: