Web Scraping H1B Salary DB (I): Exploring Business Analytics Job Market for Non-American Junior Job Seekers

8 min readOct 19, 2018

Introduction

On the journey of pursuing career in product analytics, I tend to browse on LinkedIn for job description and required skill sets. However, I’ve never done some field search about how big the product analyst market is. Only when I encounter with H1B Salary Online Database do I find it is really a big treasure.

There are two reasons for that: one ,it is designed exclusively for international student without USA identity, which really meet my needs, and second, the website structure is relatively simple and easy to scrape. Therefore, I decided to do some web scraping and conduct several exploration on the scraped data in order to provide an overview of the business analytics job market.

** Part 2 about companies H1B issues for specific role is here. (Click the links)

“white and black One Way-printed road signages” by Brendan Church on Unsplash

Target Readers

Current international students (without green cards) who are now pursuing business analytics, data science, analytics master degree in USA.
Aspiring international applicants who are now preparing to apply for related program in analytics fields in US.
International workers who are junior and just start working in US job market.
Those interested in web scraping and function creation in python.

Tools and Skill Used

Tools: python / spyder

Skills: Web scraping with BeautifulSoup, data cleaning, munging, grouping with Pandas, visualization with matplotlib, and function in python

Recommendation and Findings

Position listing for job market so far: Business analyst > Data scientist > product analyst
Data scientist positions are increasing over years; business analyst and product analyst positions are facing some decline in recent years.
Median salary: Data scientist > product analyst > business analyst
Median base salary for data scientist is decreasing over years.
Product analyst H1B application has relatively high denied ratio.

How to Use this Article

Read the articles and get some of my exploration and findings in business analytics job market in US.
Play around with my code. You can simply scrape the data you want to observe and make it into clear visualization with my function. You are able to find my code in my Github.

Methodology

Decide targeted positions for observation and scraping. I locate three positions: ‘Business Analyst, Data Scientist, and Product Analyst.’
Scrap the data from destination web pages and store them as dataframe.
Start cleaning and munging.
Visualize the result. (a. Comparison of three position. b. individual position analysis)
Turn them all in function for efficiency.

In this article, I will present my results of overall comparison of these three position. For the individual position analysis, I will put it into next articles

Step 0: Load in needed packages

The preparation step is to load in needed packages. Pandas and Numpy is our good friend. I used BeautifulSoup for scraping this time. And seaborn and matplotlib are used for visualization.

import urllib
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import matplotlib.pyplot as plt
import string as st
import seaborn as sns

Step 1: Scrap the web page

The web page structure is easy, just like the picture shown below:

load_data(words) is the function I created here. It is used to scrap the page with two words, like product analyst, data scientist or business analyst. All you have to do is type in the string you want to scrape for. The function scraped the page automatically and do some variable creations and return a data frame to the users.

def load_data(word):
    a = word.lower().split()[0]
    b = word.lower().split()[1]
    
    r = urllib.request.urlopen('https://h1bdata.info/index.php?em=&job='+ a +'+'+ b +'&city=&year=All+Years') 
    soup = BeautifulSoup(r)data2 = soup.find_all('tr')    
    labels = []
    for h in data2[0].find_all('th'):
        labels.append(h.get_text().strip().lower())
    
    final = []
    for data in data2[1:]:
        data_list = []
        for d in data.find_all('td'):
            d_str = d.get_text().replace(',','')
            
            if d_str.isnumeric():
                data_list.append(int(d_str))
            else:
                data_list.append(d_str)                      
        final.append(data_list)
    
    df = pd.DataFrame(final, columns = labels)    
    df['submit date'] = pd.to_datetime(df['submit date'])
    df['start date'] = pd.to_datetime(df['start date'])
    df['state'] = df['location'].str.split().str[-1] 
    df['year'] = df['submit date'].dt.year
    df['month'] = df['submit date'].dt.month
    return df

I also create other two helper functions: load_data_three_words(word) and load_data_three_words_comma(word). The reason for that is I found some data scientist position are written in three words (i.e. Data Scientist 1 / Data Scientist i) or in three words with comma (i.e. Data Scientist, analytics). And I found the data there is valuable for a overall picture, so I decided to include them all in the category of data scientist.

With that in mind, I create another function called replace_jobtitle(list_of_dataframe, word). All you have to do is to pass several dataframe into a list, and key in the replace word. The job title will turn in to identical.

def replace_jobtitle(list_of_dataframe, word):
    all_df = pd.concat(list_of_dataframe,ignore_index=True)
    all_df['job title'] = word
    return all_df

Execution: I finally get three data frames. BS for business analyst, pa for product analyst and ds for data scientist.

bs = load_data('business analyst')
pa = load_data('product analyst')ds1 = load_data('data scientist')
ds2 = load_data_three_words('Data Scientist I')
ds3 = load_data_three_words('Data Scientist 1')
ds4 = load_data_three_words_comma('Data Scientist, Analytics')ds_list = [ds1, ds2, ds3, ds4]
ds = replace_jobtitle(ds_list, 'DATA SCIENTIST')
ds4.head()
df.head()

Step 2: Overall H1B Issue number for different positions

Then we can start analyzing and visualizing the data. Firstly, I’d like to look into the overall market size of the three different position.

I created a position_comparison_ttl(list_of_dataframe) for creating such graph. All you have to do is put the list of data frame as argument. The graph will be shown. So you can just pass any positions you want to observe into data frames with the load_data function above, and bracket them into list. You are able to get the the number of jobs that has submit the H1B application from the whole H1B database.

list_of_dataframe = [bs, pa, ds]
position_comparison_ttl(list_of_dataframe)

Findings:

From the data scraped from H1B website, the most issue number of H1B is in business analyst position, which means positions of business analyst are the most compared to data scientist and product analyst.

Surprisingly for me, product analyst has least number of H1B issues. There are many possible explanation there. Business analyst position are across different industries, while product analyst position might only happen in tech industry, which lead to such a big difference between these two positions.

Step 3: Issue number across years for different positions

Next, I’d like to observe the trend of issuing H1B across year between different positions. position_no_yearly(list_of_dataframe) function helps us achieve that. It can return the same number of graphs aligned with the number of data frames in the list.

list_of_dataframe = [bs, pa, ds]
position_no_yearly(list_of_dataframe)

Findings:

Data scientist and product analyst position began to appear later than business analyst (2013 vs 2011). The former two are relatively new positions to the market.
Openings for business analyst and product analyst are declining, while the job market for data scientist is still booming.

Step 4: Base Salary Distribution

Salary is always a big incentives for job seekers. H1B Database includes the salary data. By using position_salary_compare(list_of_dataframe) function I created, we may know the distribution of salary for selected position. I think it is quite useful for job seeker to benchmark with market when negotiating the salary.

P.S. I remove the outlier whose salary is over $300,000, which I think for a junior graduate BA professions, the likelihood of earning over $300,000 is pretty low.

list_of_dataframe = [bs, pa, ds]
position_salary_compare(list_of_dataframe)

Findings:

Median base salary: data scientist > product analyst > business analyst
The salary distribution of business analyst is widely varied, same as data scientist. I will say that based on different industry and the actual tasks, business analyst and data scientist job description will be widely-defined, which leads to such variance.

Step 5: Base Median Salary across Year

Next, I want to check how the median salary changes across year for different position. Therefore I created position_median_salary_yearly(list_of_dataframe) for creating yearly trends.

list_of_dataframe = [bs, pa, ds]
position_median_salary_yearly(list_of_dataframe)

Findings:

Interestingly, the decline of salary median for data scientist position is observed, while the salary median for business analyst and product analyst is still increasing.

Step 6: Denied ratio for different position

In this part, I am quite interested in the case status column in data frames. Some of the H1B application will be denied. Though I can’t tell the reason behind denied cases. But I create position_denied_ratio(list_of_dataframe) function to observe the denied ratio.

list_of_dataframe = [bs, pa, ds]
position_denied_ratio(list_of_dataframe)

Findings:

It is a little bit frustrating that the H1B submission for product analyst results in relatively higher portion of denial. For applicants to product analytics position, this should be taken more notes.

Step 7: Waiting Days for different position

From the database, there are two columns related to date: ‘submit date’ & ‘start date.’ For submit date, it is the date where company file the H1B application for the employer, and start date is the date when employees work using H1B. So the time lapse between these two date, the employee still work with other type of visa, like OPT.

Therefore, I want to tell the difference between diverse positions that the average days the employees need to stay longer with their original visa instead of H1B. I created position_wait_days(list_of_dataframe) function for this analysis.

list_of_dataframe = [bs, pa, ds]
position_wait_days(list_of_dataframe)

Findings:

It takes longer wait time for data scientist and product analyst for their H1B visa.

Brief Conclusion

With simple scraping and creation of functions. I look into the overall picture of business analytics job market for international job seekers in U.S. In the next article, I will go deeper for each of the position kind, providing more insightful findings.

If you like the article, feel free to give me 5+ claps
If you want to read more articles like this, give me 10+ claps
If you want to read articles with different topics, give me 15+ claps and leave the comment hereThank for the reading

Web Scraping H1B Salary DB (I): Exploring Business Analytics Job Market for Non-American Junior Job Seekers

Introduction

Target Readers

Tools and Skill Used

Recommendation and Findings

How to Use this Article

Methodology

Step 0: Load in needed packages

Step 1: Scrap the web page

Step 2: Overall H1B Issue number for different positions

Step 3: Issue number across years for different positions

Step 4: Base Salary Distribution

Step 5: Base Median Salary across Year

Step 6: Denied ratio for different position

Step 7: Waiting Days for different position

Brief Conclusion

Written by Henry Feng

Responses (1)