Web Scraping with Selenium in Python

Selenium is often used for automating web applications for testing purposes but it is not what all it does. In this article, we’ll show you how to use Selenium for web scraping.
by Josephine Loo · June 2023

Contents

    Selenium, a versatile automation framework, offers developers a wide range of uses. One of the most common usages of Selenium for developers is doing automated testing for web applications. As it offers a wide range of functionalities to simulate user interactions with the browser and automate browser actions, you can utilize it to automate other tasks besides automated testing to increase your efficiency and productivity.

    One example is using Selenium to scrape websites. In today’s digital age where data is the key to making informed decisions, we often find ourselves in need of data from various websites. Therefore, web scraping has become an essential skill for developers looking to leverage the vast amount of information available online. In this article, let’s explore how to use Selenium to scrape websites and extract valuable data from them.

    What is Selenium

    Selenium logo

    Selenium is a widely-used browser automation tool that has been around since 2004. It consists of Selenium WebDriver, Selenium IDE, and Selenium Grid. Each of them complements the other to carry out automated tests on different browsers and environments, with Selenium WebDriver being the core of the automated testing ecosystem.

    Being a mature automation tool, it supports a wide range of web browsers and programming languages including non-mainstream ones like Haskell, Perl, Dart, and Pharo Smalltalk. It is also highly extensible—a large ecosystem of third-party plugins and frameworks that can enhance its functionality are available, which makes it very useful in various use cases.

    As Selenium is an open-source project, its growth is contributed by the support of the community. Besides efforts by individual contributors like programmers, designers, and QA engineers, it is also sponsored by established companies like BrowserStack, Lambdatest, Sauce Lab, and more.

    How to Use Selenium for Web Scraping

    Selenium supports several browser engines including Chrome, Firefox, Safari, Microsoft Edge, and Opera. Although there are custom capabilities and unique features for each browser, the main functionalities are available for every one of them.

    In this article, we will show how to use it on Chrome to scrape a job board. That said, you can follow the same steps to use it on other browsers too.

    Step 1. Install the Selenium Package

    In your project directory, run the command below to install the Selenium package so that it can be used in your Python code.

    pip install selenium
    

    Step 2. Install Browser Drivers

    To automate the browsers, you need the browser drivers. There are a few methods to install the drivers, with the easiest method being downloading them from the official website and configuring Selenium to use the specified drivers using one of the options below:

    Option 1: Save the Driver Location in the PATH Environment Variable

    You can place the drivers in a directory that is already listed in PATH or add the drivers’ location to PATH. To save them to the PATH environment variable, run the commands below in the Terminal or Command Prompt:

    In your code, import the Selenium package and create a new instance of the driver:

    from selenium import webdriver
    driver = webdriver.Chrome()
    

    Option 2: Specify the Driver Location in Your Code

    You can also hardcode the driver’s location to save the hassle of figuring out the environment variables on your system. However, this might make the code less flexible as you will need to change the code to use another browser.

    In your code, import the Selenium package. Then, specify the location of the driver and create a new instance of the driver:

    from selenium import webdriver
    
    driver_path = '/path/to/chromedriver'
    driver = webdriver.Chrome(executable_path=driver_path)
    

    Step 3. Navigate to the Target Website

    After completing the setup, we can start writing code to interact with the browser! To scrape data from a website, visit the URL using the get() method:

    driver.get("https://playground.browserbear.com/jobs/")
    

    Step 4. Locate Elements

    We must let Selenium know which HTML element it should get the data from. There are several ways to locate an HTML element from a web page. You need to identify the elements that you want to extract the data from and inspect them to determine which of the locator strategies to use for locating them.

    You can use the following:

    • Attributes like name, ID, class name, and tag name
      Example: button = driver.find_element(By.CLASS_NAME, "button")
    • Link text
      Example: link = driver.find_element(By.LINK_TEXT, 'link')
    • CSS selectors
      Example: content = driver.find_element(By.CSS_SELECTOR, 'p.content')
    • XPath expression
      Example: form = driver.find_element(By.XPATH, "/html/body/form[1]")

    Let’s locate the job cards and the elements within them as highlighted in the screenshot:

    4.png

    As the elements are contained within a card, we need to locate the parent card (job_card) first. Then, loop through each card to find the child elements.

    job_card = driver.find_elements(by=By.TAG_NAME, value="article")
    
    for card in job_card:
            
        job_element = card.find_element(by=By.TAG_NAME, value="a")
        company = card.find_element(by=By.CLASS_NAME, value="company")
        location = card.find_element(by=By.CLASS_NAME, value="location")
        salary = card.find_element(by=By.CLASS_NAME, value="salary")
    

    🐻 Bear Tips : You can use more than one type of locator strategy in the same code. Choose them based on what's available in the HTML code.

    Step 5. Extract Data

    Once you have located the desired elements, you can extract data from them using the text property. For links, use the get_attribute() method to get the href attribute.

    job_card = driver.find_elements(by=By.TAG_NAME, value="article")
    
    for card in job_card:
            
        job_element = card.find_element(by=By.TAG_NAME, value="a")
        job_title = job_element.text
        link = job_element.get_attribute("href")
        
        company = card.find_element(by=By.CLASS_NAME, value="company").text
        location = card.find_element(by=By.CLASS_NAME, value="location").text
        salary = card.find_element(by=By.CLASS_NAME, value="salary").text
        
        # Print the extracted text and link
        print("Job Title:", job_title, " - ", link)
        print("Company:", company)
        print("Salary: ", salary)
        print("Location: ", location)
        print()
    

    Step 6. Quit the Web Driver

    After scraping the data, quit the web driver.

    driver.quit()
    

    Here’s the full code:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    
    driver = webdriver.Chrome()
    
    driver.get("https://playground.browserbear.com/jobs/")
    
    job_card = driver.find_elements(by=By.TAG_NAME, value="article")
    
    for card in job_card:
            
        job_element = card.find_element(by=By.TAG_NAME, value="a")
        job_title = job_element.text
        link = job_element.get_attribute("href")
        
        company = card.find_element(by=By.CLASS_NAME, value="company").text
        location = card.find_element(by=By.CLASS_NAME, value="location").text
        salary = card.find_element(by=By.CLASS_NAME, value="salary").text
        
        # Print the extracted text and link
        print("Job Title:", job_title, " - ", link)
        print("Company:", company)
        print("Salary: ", salary)
        print("Location: ", location)
        print()
    
    driver.quit()
    

    Advanced: Interacting with the Elements to Scrape More Data

    Each job from the job board has a link that shows more information about the job. This is what you see when you click on it:

    screenshot of the job detail page.png

    To scrape these details, you can click on the links using click() and find the target elements using similar methods. Then, use driver.back() to go back to the main page and continue with the next job in the list.

    ...
    for card in job_card:
            
        job_element = card.find_element(by=By.TAG_NAME, value="a")
        ...
        job_element.click() # go to the link
        applicant = driver.find_element(by=By.CLASS_NAME, value="applicants").text
        driver.back()
        
        # Print the extracted text and link
        print("Job Title:", job_title, " - ", link)
        print("Company:", company)
        print("Salary: ", salary)
        print("Location: ", location)
        print("Applicants: ", applicant)
        print()
    
    ...
    

    Advantages vs. Disadvantages

    While Selenium does a good job in web scraping and other automated web tasks, it has some advantages and disadvantages. Let's take a closer look at both of them:

    Advantages

    • Open-Source - As mentioned previously, Selenium is an open-source project which growth is contributed by the community. This means that it's free to use while having great support from the community.
    • Rich Element Selection Options - Selenium provides a wide range of methods to locate elements on a web page, including by ID, class name, CSS selector, and XPath. This allows you to target desired elements precisely no matter how the website is coded.
    • Flexibility and Extensibility - Selenium offers APIs for multiple programming languages including Python, Java, C#, and more. This allows you to work with various programming languages and transfer Selenium's extensive functionalities across different environments.

    Disadvantages

    • Setup and Configuration - Setting up Selenium and its associated web drivers can be complex, especially for beginners. If it’s not configured correctly, it may cause compatibility issues and waste your time fixing it.
    • Learning Curve - Selenium has a steeper learning curve compared to simpler scraping tools, which might require more time to get yourself familiarized with it.
    • Limited Scalability - Selenium requires a web browser and associated drivers, which consume system resources. Therefore, running multiple instances of Selenium concurrently can be resource-intensive and may not be feasible for large-scale scraping operations.

    Other Selenium Alternatives

    If the disadvantages above make you hesitate to use Selenium for web scraping, you can consider using Browserbear. It has a user-friendly and intuitive interface which simplifies the process of creating and executing an automated web scraping task. Moreover, being a cloud-based solution, it provides effortless scalability, which means you can expand your web scraping task as needed easily without having to take care of the infrastructure.

    The embedded task below is an example of using Browserbear to do the same job:

    You can duplicate it to start scraping immediately or follow the instructions in How to Scrape Data from a Website Using Browserbear to learn how to do it step-by-step.

    Conclusion

    Utilizing automation is a good way to free up our time for more important tasks and increase productivity. No matter you are using Selenium which requires coding or Browserbear which uses an intuitive interface, web scraping is just one of the examples of how automation can significantly benefit developers. If you haven't incorporated automation in your workflow, you are doing extra work without realizing it. So, start automating now!

    About the authorJosephine Loo
    Josephine is an automation enthusiast. She loves automating stuff and helping people to increase productivity with automation.

    Automate & Scale
    Your Web Scraping

    Browserbear helps you get the data you need to run your business, with our nocode task builder and integrations

    Web Scraping with Selenium in Python
    Web Scraping with Selenium in Python