Web Scraping with ChatGPT API and Browserbear (Python Example)

ChatGPT's web browsing feature is exclusive to ChatGPT Plus subscribers. If you want to extract information from a URL using its API, there is an alternative solution. Let's learn how to do it with Browserbear and use ChatGPT 4 to perform tasks on the data in Python.
by Josephine Loo · December 2023

Contents

    Since the rollout of ChatGPT, there has been a growing interest in using it for various kinds of tasks to improve efficiency. The current paid version of ChatGPT, ChatGPT Plus has introduced many features that are exclusive to the subscribers like creating custom GPTs, optimized performance, web browsing, and more. However, when integrating ChatGPT into your application, certain features, like web browsing, are not directly available through the API.

    To overcome this limitation, we can browse the target website and scrape data from it using Browsebear, a cloud-based browser automation tool. Then, use ChatGPT API to read the scraped data and perform various tasks like data analysis, text summarization, sentiment analysis, and more. In this tutorial, we'll show you how to scrape data from a job board using Browserbear, and utilize ChatGPT API to analyze the data and get insights like the number of jobs, average salary, etc.

    What is Browserbear

    Browserbear is a scalable, cloud-based browser automation tool that helps you automate browser tasks. You can use it to automate website testing, scrape data from websites, take scheduled screenshots for archiving, and more without writing a single line of code.

    Unlike other web automation tools that require coding, Browserbear is incredibly simple to set up and use. You can easily locate your target HTML element (the blue box in the screenshot below) by hovering your mouse over it with the Browserbear Helper extension…

    … and add necessary steps to your automation task from the Browserbear dashboard.

    There are two ways to run a task: run it directly from the account dashboard, or use the API (if you want to integrate it into coding projects). And we'll be using the API in this tutorial.

    What Is the ChatGPT API? How Is It Different from ChatGPT?

    The ChatGPT API enables developers to access the functionalities of ChatGPT, the conversational AI model, and integrate it into their applications and websites. Different from using the ChatGPT chat interface, the API provides more flexibility with integration ability.

    While ChatGPT Plus offers a chat interface with built-in features like integrated browsing, code execution, plugins, etc., the API offers developers the flexibility to create custom interfaces and functionalities tailored to their specific needs. It can be integrated into applications to build chatbots, assistants, and other tools.

    Here are some real-life examples:

    • Snapchat uses the ChatGPT API to build a customizable chatbot that offers recommendations and answers all types of questions.
    • Instacart uses the ChatGPT API alongside its own AI and product data from its 75,000+ retail partner store locations to help customers discover shopping ideas and get personalized suggestions.
    • Shopify uses the ChatGPT API to power Shop’s (Shopify’s consumer app) shopping assistant, which provides personalized recommendations based on shoppers' requests.

    (Reference: https://openai.com/blog/introducing-chatgpt-and-whisper-apis)

    Setting up a Web Scraping Task in Browserbear

    We will scrape information like job title, company name, salary, and location from this job board for this tutorial:

    job board for Browserbear web scraping demonstration.png

    Browserbear has an AI web scraper that scrapes data based on your input. All you need to do to scrape information from a website is enter the website URL, type of data (jobs/articles/products/people/places), and fields that you want to scrape (eg. job title, company name, salary, etc.) and click the “Start Scraping” button:

    Browserbear AI Web Scraper.png

    If you’re not logged in to a Browserbear account, you’ll be prompted to create one.  Simply sign up for an account (it’s free!) to continue:

    Browserbear AI Web Scraper - sign up prompt.png

    If you’re logged in to your account, you’ll be able to see the task running in your account:

    Browserbear AI Web Scraper task.png

    You can also run the task manually by clicking the “Run Task” button:

    Browserbear AI Web Scraper - run task.png

    Scroll down to the bottom of the page to check the progress. After the task has finished running, the result can be found in the log:

    Browserbear AI Web Scraper - log.png Browserbear AI Web Scraper - log result.png

    As we want to feed the scraped data to ChatGPT API in our Python project, we need the Browserbear project’s API key , task ID , and result ID to access the task and the result. You can find them from your account, as shown in the screenshots below:

    Browserbear AI Web Scraper - API key.png Browserbear AI Web Scraper - task ID.png Browserbear AI Web Scraper - result ID.png

    Copy and save them somewhere safe. We will use them in our code later to scrape data.

    🐻 Bear Tips: If you want more control over which HTML element to scrape the data from, you can create a web scraping task in Browserbear manually (doesn’t use AI). Read How to Scrape Data from a Website Using Browserbear to learn how.

    Getting the ChatGPT API Key

    To use the ChatGPT API in our Python project, sign up for an OpenAI account and log in to the API dashboard.

    OpenAI post-login screen - choose API.png

    Then, go to the “API Keys” tab and click “Create new secret key” to create a new secret/API key.

    ChatGPT API Key - create secret key.png

    Like before, copy the key and save it somewhere safe.

    Python Code

    After setting up a web scraping task in Browserbear and retrieving the required API keys and IDs, it’s time for coding! If you don’t have an existing project/script, create a file for writing the code (eg. script.py). Then, follow the steps below:

    Step 1. Install the OpenAI Python Library

    Run the command below in your terminal/command line to install the OpenAI Python library:

    pip install --upgrade openai
    

    Note: If you’re using Python 3.x, replace pip in the command above with pip3.

    Step 2. Import the OpenAI Python Library and Declare Required Constants

    In your code, import requests, OpenAI, and json:

    import requests
    from openai import OpenAI
    import json
    

    … and declare the following constants using your API keys and relevant IDs:

    SCRAPING_TASK_UID = "your_browserbear_task_id"
    BROWSERBEAR_API_KEY = "your_browserbear_api_key"
    BROWERBEAR_RESULT_UID = "your_browserbear_task_result_id"
    OPENAI_API_KEY = "your_openai_api_key"
    

    🐻 Bear Tips: To protect the private keys and IDs, it’s best to save them as environment variables in a .env file and retrieve them using os.environ.get( “VARIABLE_NAME”).

    Then, create a new instance of OpenAI:

    client = OpenAI(api_key=OPENAI_API_KEY)
    

    Step 3. Scrape Data Using Browserbear API

    Define a function named scrape_data() and send a POST request to Browserbear inside the function to trigger the task. As Browserbear API allows us to pass data (URL, type of job, fields to extract) to the task to reuse it for different websites and scrape different fields, let's define the  function to take in parameters like target_url, job_type, and fields_array so that we can use it for other websites too:

    def scrape_data(target_url, job_type, fields_array):
        url = f"https://api.browserbear.com/v1/tasks/{SCRAPING_TASK_UID}/runs"
        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {BROWSERBEAR_API_KEY}",
        }
        
        search_fields = ""
        for i, field in enumerate(fields_array):
            search_fields += f"{field}\n" if i != len(fields_array)-1 else field
        
        data = {
            "steps": [
                {
                    "uid": "1RNV0rb0dQwbOxmQ6X",
                    "action": "go",
                    "config": {
                        "url": target_url,
                        "waitUntil": "networkidle"
                    }
                },
                {
                    "uid": "qL25wVyl6gdzE4mJ1x",
                    "action": "ai_save_links",
                    "config": {
                        "type": job_type
                    }
                },
                {
                    "uid": "lYXA8Kz19QDzLV94N7",
                    "action": "ai_save_data",
                    "config": {
                    "labels": search_fields
                    }
                }
            ]
        }
    
        response = requests.post(url, headers=headers, json=data)
        
        print(f"Scraping {fields_array} from {target_url}...")
        
        return response.json()
    
    scraping_task = scrape_data("https://playground.browserbear.com/jobs/", "jobs", ["job_title", "company_name", "salary", "location"]) # scraping jobs
    

    Note: Replace the IDs in the data object with your task’s steps’ actual IDs.

    You can use the same task to scrape products and articles too:

    # scraping_task = scrape_data("https://playground.browserbear.com/products/", "produtcs", ["title", "price", "rating"]) # scraping products
    # scraping_task = scrape_data("https://www.browserbear.com/blog/", "articles", ["title", "author", "description"]) # scraping articles
    

    Step 4. Retrieve the Web Scraping Result

    Next, create a function named get_scraped_data() and send a GET request to Browserbear inside the function to retrieve the web scraping result. As the task might take a while to complete, we will poll the result by sending continuous API requests to Browserbear until the result is returned:

    def get_scraped_data(result_uid):
    
        job_status = "running"
        result = ""
    
        get_url = f"https://api.browserbear.com/v1/tasks/{SCRAPING_TASK_UID}/runs/{result_uid}"
        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {BROWSERBEAR_API_KEY}",
        }
        
        while job_status == "running":
            response = requests.get(get_url, headers=headers)
            result = response.json()
            job_status = result["status"]
    
        return result["outputs"][BROWERBEAR_RESULT_UID]
    
    data = get_scraped_data(scraping_task["uid"])
    
    print(data)
    

    Here’s an example of the data returned:

    [
    		{
          "link": "https://playground.browserbear.com/jobs/7dJRGX9wgRQ-farming-executive/",
          "job_title": "Farming Executive",
          "company": "Job Inc",
          "salary": "$113,000 / year ",
          "location": "Fiji"
        },
        {
          "link": "https://playground.browserbear.com/jobs/9XxqTZH25WQ-accounting-orchestrator/",
          "job_title": "Accounting Orchestrator",
          "company": "Mat Lam Tam Group",
          "salary": "$15,000 / year ",
          "location": "Israel"
        },
        {
          "link": "https://playground.browserbear.com/jobs/0fIJzIQwf0c-central-accounting-developer/",
          "job_title": "Central Accounting Developer",
          "company": "Home Ing and Sons",
          "salary": "$10,000 / year ",
          "location": "Afghanistan"
        },
    	...
    ]
    

    Step 5. Using ChatGPT API to Analyze the Data

    Now, we can pass the scraped data to ChatGPT API in an array of message objects. Each object has a role (system/user/assistant) and content, and the array typically starts with a system message, followed by alternating user and assistant messages.

    Let’s prepare an array of message objects that contain the scraped data and some questions for ChatGPT and add it to our code:

    # Set up the initial system message
    system_message = {"role": "system",
                      "content": "You are a helpful assistant that analyzes information. Analyze this data and answer my questions based on the data:" + json.dumps(data)}
    
    # Start the conversation with the system message
    conversation = [system_message]
    
    # Add user message to the conversation
    user_message = {"role": "user", "content": "How many jobs are there? What are the jobs with the top 5 highest salary? What is the average salary?"}
    conversation.append(user_message)
    

    Then, make a request to the Chat Completion API:

    # Pass data to ChatGPT and get the response
    response = client.chat.completions.create(model="gpt-4-1106-preview", messages=conversation)
    
    # Extract and print the assistant's reply
    assistant_reply = response.choices[0].message.content
    print(f"Assistant: {assistant_reply}")
    

    You can choose which ChatGPT model to use and the cost is different for every model (ChatGPT API cost). For this tutorial, we're using ChatGPT 4 (gpt-4-1106-preview).

    The code is complete! When you run python script.py or python3 script.py to execute it, you will see ChatGPT’s answer in the terminal/command prompt :

    Browserbear AI Web Scraper - ChatGPT answer.png

    Note: As the answer is given by ChatGPT, the result might not be the same every time you run the script.

    🐻 View the full code on GitHub.

    What’s Next

    In this article, we have learned how to scrape data from a website using Browserbear and use ChatGPT API to analyze the scraped data and extract insights from it. But there's so much more you can do with this integration…

    For example, you can set up a daily automation that uses Browsebear to scrape the latest articles and instruct ChatGPT to summarize them in 100 words. This helps you to stay updated without having to read the entire article and saves you valuable time.

    When it comes to optimizing time, automation is often the key and Browserbear can help you do this easily. You can automate tasks like filling in form with data from a CSV file, monitoring competitor's price, etc. If you haven't already, sign up for an account now and start automating mundane tasks to free up your time!

    About the authorJosephine Loo
    Josephine is an automation enthusiast. She loves automating stuff and helping people to increase productivity with automation.

    Automate & Scale
    Your Web Scraping

    Browserbear helps you get the data you need to run your business, with our nocode task builder and integrations

    Web Scraping with ChatGPT API and Browserbear (Python Example)
    Web Scraping with ChatGPT API and Browserbear (Python Example)