Web Scraping Jobs From Indeed

Would you like to build a web scraper to parse job postings from Indeed? We will scrape the data using python and selenium. Let’s start.

Many websites outrightly restrict scraping data from their pages. Users may be subject to legal ramifications depending on where and how they attempt to scrape information.. Be careful when scraping sites like Facebook or LinkedIn they do not take kindly to data scraped from their pages. Scrape carefully

For this project, we will explore any keyword related jobs posted in a variety of cities on Indeed.com. I conducted my scraping using the Selenium and beautifulSoup libraries in Python to parse information from my Indeed pages. Then, I used the ‘pandas’ library to assemble my data into a dataframe for further cleaning and analysis.

We are using windows OS

Before we start, let’s install the dependencies we need.

Install Python https://www.python.org/downloads/

Install Pip from here

Download and install Visual Studio Code or any other code editor

Download ChromeDriver

Finally open command prompt and install the libraries we need

pip install pandas
pip install beautifulsoup4
pip install -U selenium

Lets Examine The URL and Page

Lets look at a sample page from Indeed

Notice a few things about the way the URL is structured:

  • note “q=” begins the string for the “what” field on the page, separating search terms with “+” (i.e. searching for “food+nutritionist” jobs)

The URL structure will come in handy as we build a scraper to look at and gather data from a series of pages. Keep this in mind for later.

indeed pic

All of the information on this page is coded with HTML tags. HTML (HyperText Markup Language), is the coding that tells your internet browser how to display a given page’s contents upon accessing it. This includes its basic structure and order. HTML tags also have attributes that are a helpful way of keeping track of what information can be found where within the structure of the page.

Chrome users can examine the HTML structure of a page by right-clicking on a page and choosing “Inspect” from the menu that appears. A menu will appear on the right-hand side of your page, with a long list of nested HTML tags housing the information currently displayed in your browser window. In the upper-left of this menu, there’s a small box with an arrow icon in it. Once clicked, the box will illuminate in blue (notice in the screenshot below). This will allow you to cursor over the elements in the page to display both the tag associated with that item, and to bring your inspection window directly to that item’s place in the HTML for the page.

Now, let’s turn to python to extract the html from the page and look to building our scraper.

Building the Scraper Components

Now that we’ve looked at the basic structure of the page and know a little about it’s basic HTML structure, we can see about building code to pull out the information we’re interested in. We’ll import our libraries first. Note, I’m also importing “time”, which can be a helpful way of staggering page requests to not overwhelm a site’s servers when scraping information.

from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup
import time

Let’s start by telling the code where our chrome driver is

driver = webdriver. Chrome(executable_path=r'/chromedriver/chromedriver.exe')

Now lets define a class that will get the URL

def Get_jobs_ID(keywordss, num_jobs):
     url = 'https://ng.indeed.com/jobs?q=' + keywordss + '&l='
     driver.get(url)
     no_of_jobs = []
     print('Opened Page')

Since we do not want to be limited to one job search, I’ve added ‘ + keywordss +’ as a variable that will insert job titles each time we run the scraper. We will also add an array no_of_jobs to be able to capture the exact number of scraped data we need. Be it 10, 20, or even 100.

Let’s call the while loop to run for the number of jobs we need to be running

while len(no_of_jobs) < num_jobs:

Withdrawing Basic Elements of Data

Approaching this task, I wanted to find and extract five key pieces of information from each job posting: Link, Job Title, Company Name, Location, Salary, and Job Summary.

Link

As noted, I could tell that list of jobs in the page were all nested under <div> tags, with a “class” = “result”

So in able for selenium to iterate through all job listings use the code find_elements_by_. All links are under the attribute <a> so we’ll tell selenium to find the code using that attribute

all_jobs = driver.find_elements_by_class_name('result')
        
        for job in all_jobs:
            
            result_html = job.get_attribute('innerHTML')
            soup = BeautifulSoup(result_html,'html.parser')
            href = job.find_element_by_tag_name('a')
            link = href.get_attribute('href')

Job Title

From inspect element, we find that job title in the Indeed site is under the class “jobtitle”. Instead of using selenium to parse the data, we’ll use BeautifulSoup as it parses much faster. We’ll also use it to parse salary, location, company name, and uploaded date

print("Progress: {}".format("" + str(len(no_of_jobs)) + "/" + str(num_jobs)))
if len(no_of_jobs) >= num_jobs:
                break
            
            try:
                title = soup.find("a",class_="jobtitle").text.replace('\n','')
            except :
                title = 'None'
            
            try:
                location = soup.find(class_="location").text
            except :
                location = 'None'
            
            try:
                company = soup.find(class_="company").text.replace("\n","").strip()
            except:
                company='None'
            
            try:
                salary = soup.find(class_="salary").text.replace("\n","").strip()
            except:
                salary = 'None'
            
            try:
                dates = soup.find(class_="date").text.replace("\n","").strip()
            except :
                dates = 'None'

Next we’ll have to find the job summary or description and from the website we can find it shows up after clicking on the job box which ha the class name “summary”

sum_div = job.find_elements_by_class_name("summary")[0]
sum_div.click()
time.sleep(2)

The job summary is all under the id “vjs-desc” To get it’s text we use selenium text parser

job_desc = driver.find_element_by_id('vjs-desc').text

With all data found now it’s time to append the data so that we can be able to save it into a csv file.

no_of_jobs.append({"Link":link,
                               "Title":title,
                               "Location":location,
                               "Company":company,
                               "Salary":salary,
                               "Dates":dates,
                               "Description":job_desc})

If the number of jobs you needed suppased the number that was found in the first page, we need the code to be able to go to the next page. we can do this by using the following code

try:
            driver.find_element_by_class_name('pn').click()
        except :
            print("Scraping terminated before reaching target number of jobs. Needed {}, got {}.".format(num_jobs, len(no_of_jobs)))
            break

We finally close the get jobs class using

    return pd.DataFrame(no_of_jobs)

Now all we need to do is call the class or function with all our required parameters and you can cll the function as many times as you can. Have a csv already created so that the code can find a place to store the data.

Indeed_Developer = Get_jobs_ID("Developer", 20, 'verbose')
Indeed_Developer.to_csv('8.csv',float_format="%1.2f")
Indeed_Developer

SEOAnalyst = Get_jobs_ID("SEO+Analyst", 20, 'verbose')
SEOAnalyst.to_csv('12.csv',float_format="%1.2f")
SEOAnalyst

Each new call will open a new tab after the first scrape call has done its work.

csv_image

Let’s see the code run

If you have any questions about The code to scrape Indeed or you would like me to help you with it feel free to contact me here

Leave a Reply

Your email address will not be published. Required fields are marked *