Datasets:

The Labor Condition Application (LCA) data was crawled from h1bdata.info website using Scrapy package. LCA is a prior required step to fill H-1B petition. The dataset includes ~2.1 millions entries with information on the submited date, case status, salary, employer, job title, working location, begining employment date. I took quite some time to clean up the dataset like correcting names, units, html symbol, .... I clean the data mostly by using pattern matching with VIM. Size of the dataset is 255 MB.

The data on position of each state's state capitol were collected from internet, mostly from wiki.

The data on Regional Price Parity (RPP) are from Bureau of Economic Analysis.

Analysis and Prediction:

I will analyze the dataset to find top companies having the most numbers of applications and their offering salary. Most popular jobs from top companies and from whole job market are also identified. The geographic job market is also analyzed with regional price parity taken into account. There are detailed analysis and prediction for the hot Data Scientist job as well. There are also analyses on the submission date and the beginning of employment date. These analyses will be benificial to international students, who plan to work in US after graduation.

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
from matplotlib import gridspec
import ipywidgets as widgets
from IPython.display import display

h1bdat = pd.read_csv('h1b.4.csv')
In [16]:
h1bdat.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2092004 entries, 0 to 2092003
Data columns (total 14 columns):
BASE_SALARY    float64
CASE_STATUS    object
EMPLOYER       object
JOB_TITLE      object
LOCATION       object
START_DATE     object
SUBMIT_DATE    object
YEAR           int64
MONTH          int64
DOM            int64
DOW            int64
STATE          object
S_DOW          int64
S_MONTH        int64
dtypes: float64(1), int64(6), object(7)
memory usage: 239.4+ MB

Which companies filling the most petitions?

Using the accumulated numbers of applications last 2 to select top filling petition companies. If we use the all time accumulated numbers, we may miss new comers. We will have a look at overall number of applications with time from entire job market.

In [18]:
nby = h1bdat[['YEAR','CASE_STATUS']].groupby('YEAR').count()
ax = plt.subplot(111)
ax.plot(nby.index,nby.values/1000,marker='o', markersize=10)
ax.xaxis.set_tick_params(labelsize=14)
ax.yaxis.set_tick_params(labelsize=14)
ax.xaxis.set_ticks(np.arange(2013,2018,1))
ax.set_title('Number of Applications with Time', fontsize=16)
ax.set_xlabel("Year", fontsize=14)
ax.set_ylabel("Number of Applications (x1000)", fontsize=14)
plt.show()
In [140]:
ntop=20
topEmp = list(h1bdat['EMPLOYER'][h1bdat['YEAR'] >= 2016].groupby(h1bdat['EMPLOYER']).count().sort_values(ascending=False).head(ntop).index)
byEmpYear = h1bdat[['EMPLOYER', 'YEAR', 'BASE_SALARY']][h1bdat['EMPLOYER'].isin(topEmp)]
byEmpYear = byEmpYear.groupby([h1bdat['EMPLOYER'],h1bdat['YEAR']])

#Top companies now ordering by mean salary of 2017
topEmp2 = h1bdat[['EMPLOYER','BASE_SALARY']][h1bdat['YEAR'] == 2017]
topEmp2 = topEmp2[topEmp2['EMPLOYER'].isin(topEmp)].groupby(h1bdat['EMPLOYER']).mean().sort_values(by='BASE_SALARY',ascending=False)
topEmp2 = list(topEmp2.index)

markers=['o','v','^','<','>','d','s','p','*','h','x','o','v','^','<','>','d','s','p','*','h','x','D']
colors=['C0','C1','C2','C3','C4','C5','C6','C7','C8','C9','C0','C1','C2','C3','C4','C5','C6','C7','C8','C9']
fig = plt.figure(figsize=(10,12))
ax1 = plt.subplot(2, 1, 1)
for company in topEmp:
    tmp = byEmpYear.count().loc[company]
    ax1.plot(tmp.index.values, tmp["BASE_SALARY"].values, label=company, linewidth=2,
             marker=markers[topEmp.index(company)], markersize=8, color=colors[topEmp.index(company)])

ax1.set_title('Top '+str(ntop)+' Companies Sponsoring H-1B', fontsize=16)
ax1.set_xlabel("Year", fontsize=14)
ax1.set_ylabel("Number of Jobs", fontsize=14)
ax1.xaxis.set_tick_params(labelsize=14)
ax1.yaxis.set_tick_params(labelsize=14)
box = ax1.get_position()
ax1.set_position([box.x0, box.y0, box.width * 0.8, box.height])
ax1.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax1.xaxis.set_ticks(np.arange(2013,2018,1))

ax2 = plt.subplot(2, 1, 2)
for company in topEmp2:
    tmp = byEmpYear.mean().loc[company]
    ax2.plot(tmp.index.values, tmp["BASE_SALARY"].values/1000, label=company, linewidth=2,
             marker=markers[topEmp.index(company)], markersize=8, color=colors[topEmp.index(company)])

ax2.set_title('')
ax2.set_xlabel("Year", fontsize=14)
#ax2.set_ylabel("Average Salary (x1000 USD/year)", fontsize=14)
ax2.set_ylabel("Average Salary (x1000 USD/year)", fontsize=14)
ax2.xaxis.set_tick_params(labelsize=14)
ax2.yaxis.set_tick_params(labelsize=14)
box = ax2.get_position()
ax2.set_position([box.x0, box.y0, box.width * 0.8, box.height])
ax2.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax2.xaxis.set_ticks(np.arange(2013,2018,1))

plt.show()

We can see clearly from the figure that there are 2 new big players: CAPGEMINI AMERICA and TECH MAHINDRA. From this plot we can draw some interesting points:

  • Most of top companies are IT and Tech companies.
  • INFOSIS shows a very rapid development, especially during period from 2013 to 2015, where it came from about 4k to 30k applications! However, after 2015, its number of application decrease quickly
  • TATA also shows a significant development.
  • CAPGEMINI AMERICAN INC shows a big jump in 2016 from almost zero to 15k applications then decreases about half.

For salary, we can group GOOGLE, MICROSOFT and AMAZON as the 1st tie salary group. There is a gap between their salaries with the rest. There is also a gap between first 2 with AMAZON. The rest companies can be devided into tie number 2 and 3. The 2nd tie has about 3-5 companies and there is a gap betwwn them as well. There are transitions between 2nd and 3rd ties: DELOITTE CONSULTING and CAPGEMINI AMERICA, who showd decreases, ~10-12%, in salary recently.

In [21]:
PopJobs = h1bdat[['JOB_TITLE', 'EMPLOYER', 'BASE_SALARY']][h1bdat['EMPLOYER'].isin(topEmp)].groupby(['JOB_TITLE'])
# List of top jobs
topJobs = list(PopJobs.count().sort_values(by='EMPLOYER', ascending=False).head(30).index)
df = PopJobs.count().loc[topJobs].assign(mean_wage=PopJobs.mean().loc[topJobs])
fig = plt.figure(figsize=(8,10))
ax1 = fig.add_subplot(111)
ax2 = ax1.twiny()
width = 0.35
df.EMPLOYER.plot(kind='barh', ax=ax1, color='C0', width=0.4, position=0, label='# of Jobs')
df.mean_wage.plot(kind='barh', ax=ax2, color='C5', width=0.4, position=1, label='Mean Salary')
ax1.set_xlabel('Number of Applications', fontsize=14)
ax1.set_ylabel('')
ax1.legend(loc=(0.65,0.60), fontsize=12)
ax2.set_xlabel('Mean Salary (USD/year)', fontsize=14)
ax2.set_ylabel('Job Title', fontsize=16)
ax2.legend(loc=(0.65,0.55), fontsize=12)
plt.show()

Top jobs from top companies are mostly IT jobs. The most popular jobs are "Technology Lead" and "Technology Analyst".

In [22]:
fig = plt.figure(figsize=(8,6))
salary = h1bdat[['JOB_TITLE', 'EMPLOYER', 'BASE_SALARY']][h1bdat['EMPLOYER'].isin(topEmp)]['BASE_SALARY'].values
plt.hist(salary, bins=200, align='mid')
plt.xlabel('Offering Salary (USD/year)', fontsize=14)
plt.ylabel('Number of Jobs', fontsize=14)
plt.xlim(25000,200000)
plt.tick_params(labelsize=12)
plt.title('Distribution of Salaries From Top Companies', fontsize=16)
plt.show()

The distribution of salaries for top companies is quite righ skewed. It is understandable since there is a lower bound limit for salary of H-1B. The distribution has mean = 82.5k, median = 75.0k, mode = 60.0k and deviation = 24.1k.

Popular Jobs in Entire Job Market

In [23]:
PopJobsAll = h1bdat[['JOB_TITLE', 'EMPLOYER', 'BASE_SALARY']].groupby(['JOB_TITLE'])
# List of top jobs
topJobsAll = list(PopJobsAll.count().sort_values(by='EMPLOYER', ascending=False).head(30).index)
dfAll = PopJobsAll.count().loc[topJobsAll].assign(mean_wage=PopJobsAll.mean().loc[topJobsAll])
fig = plt.figure(figsize=(10,12))
ax1 = fig.add_subplot(111)
ax2 = ax1.twiny()
width = 0.35
dfAll.EMPLOYER.plot(kind='barh', ax=ax1, color='C0', width=0.4, position=0, label='# of Jobs')
dfAll.mean_wage.plot(kind='barh', ax=ax2, color='C5', width=0.4, position=1, label='Mean Salary')
ax1.set_xlabel('Number of Applications', fontsize=14)
ax1.set_ylabel('')
ax1.legend(loc=(0.70,0.60), fontsize=12)
ax2.set_xlabel('Mean Salary (USD)', fontsize=14)
ax2.set_ylabel('Job Title', fontsize=16)
ax2.legend(loc=(0.70,0.65), fontsize=12)
plt.show()

We can see that indeed IT jobs are the most popular ones. There are only few non-IT jobs in top 20. All top 5 polular jobs are IT. Note that the #6 is "Business analyst", which also needs many computer and analysis skills.We can see a very dominance of PROGRAMMER ANALYST.

It is very surprised to see "Assistant professor" in the list of the most popular jobs (#16). I don't think we are expecting to see it in this list. Note that Asistant Professor job does not include in the H-1B quote of 85k each year, so it should have quite higher rank in the list of jobs really getting H-1B. It also have the 2nd highest salary!

Another interesting observation here is that "Data Scientist" jobs is not in the list. People are talking about it everyday at every corners. Posting jobs for Data Scientist can be found plenty on any job searching site like Glassdoor. The reason for this missing is that, although the number of Data Scientist jobs is increasing rapidly recently, but it has become very hot for short time so it needs time to accumulate large enough number of jobs to be in the top list above! After a brief analysis of the distribution of salary from whole job market, we will analyze more about Data Scientist job.

In [24]:
fig = plt.figure(figsize=(8,6))
salary = h1bdat['BASE_SALARY'].values
plt.hist(salary, bins=200, align='mid')
plt.xlabel('Offering Salary (USD/year)', fontsize=14)
plt.ylabel('Number of Jobs', fontsize=14)
plt.xlim(25000,200000)
plt.tick_params(labelsize=12)
plt.title('Distribution of Salaries From Top Companies', fontsize=16)
plt.show()

The distribution of H-1B salaries is righ skewed as mentioned above that is because there is a lower bound limit for salary of H-1B. The distribution has mean = 83.2k, median = 74.8k, mode = k and deviation = 34.0k.

Data Scientist job

In [25]:
fig = plt.figure(figsize=(12,5))
ax1 = plt.subplot(1, 2, 1)
dsj = h1bdat[['JOB_TITLE','YEAR']][h1bdat['JOB_TITLE'] == "DATA SCIENTIST"].groupby('YEAR').count()['JOB_TITLE']
X = np.array(dsj.index)
Y = dsj.values                                                          
slope, intercept, r_value, p_value, std_err = stats.linregress(X, Y)
def func(x, a, b):
    return a*x + b
X1 = np.linspace(2013,2019,7)
X2 = np.linspace(2017,2019,3)
X3 = np.linspace(2018,2019,2)
ax1.scatter(list(dsj.index), dsj.values, c='C0', marker='o', s=140, label='Data')
ax1.plot(X1, func(X1,slope,intercept), color='C3', label='')
ax1.plot(X2, func(X2,slope,intercept), color='C4', linewidth=2, marker='s', markersize=1, label='')
ax1.plot(X3, func(X3,slope,intercept), color='C4', marker='s', markersize=14, label='Prediction')
ax1.legend(fontsize=14)
ax1.set_title('Number of Data Scientist Jobs vs. Year', fontsize=14)
ax1.set_xlabel('Year', fontsize=14)
ax1.set_xticks(np.arange(2013,2020,1))
ax1.set_ylabel('Number of Jobs', fontsize=14)
ax1.text(2016,600,r'slope $\alpha$ = 303.8',fontsize=15, color='red')
plt.tick_params(labelsize=14)

ntop=15
ax2 = plt.subplot(1, 2, 2)
dstopcomp = h1bdat[h1bdat['JOB_TITLE'] == "DATA SCIENTIST"]['EMPLOYER'].groupby(h1bdat['EMPLOYER']).count().sort_values(ascending=False).head(ntop)
lcomp = list(dstopcomp.index)
ax2.barh(ntop-np.arange(ntop),dstopcomp.values,align='center')
ax2.set_yticks(ntop-np.arange(ntop))
ax2.set_yticklabels(lcomp)
ax2.xaxis.set_tick_params(labelsize=14)
ax2.yaxis.set_tick_params(labelsize=12)
ax2.set_xlabel('Number of Jobs', fontsize=14)
ax2.set_title('Top '+str(ntop)+' Data Scientist Hiring Companies, 2013-2017', fontsize=14)
plt.tight_layout()
plt.show()
#plt.savefig('Data_Scientist_Job.pdf')

Since Data Scientist job is hot, so let do some analysis and prediction on the number of Data Scientist jobs. First we can see that the number of jobs incresing continuously with time. Using linear regression we can build a good model, as can be seen in the figure, with standard error of 11.7. The number of Data Scientie jobs for 2018 and 2019 are predicted and shown by squares in figure.

The right figure shows the top companies hiring Data Scientists. It is interesting that MICROSOFT hired about 1/5 of total Data Scientists. FACEBOOK and UBER together hired the other 1/5. It is reasonable to see these companies hiring many Data Scientiest. However, it is surprised to see WAL-MART in the top list but do not see AMAZON. How did Amazon develop Alexa? How did AMAZON run its online shopping very efficiently?

When were the petition filed and when is the prefer start working time?

For these analyses, I already processed the data to extract day of week (DOW) and month (MONTH) from submited and beginning employment dates by datatime library.

In [26]:
fig = plt.figure(figsize=(12,4))
gs = gridspec.GridSpec(1, 2, width_ratios=[1,1.618])
plt.subplot(gs[0])
h1bsubdate = h1bdat.groupby('DOW').count()['CASE_STATUS']
plt.plot(h1bsubdate.index,h1bsubdate.values/1000, marker='o', markersize=10, color='C0')
plt.xticks(np.arange(0,7,1),['Mon','Tue','Wed','Thu','Fri','Sat','Sun'])
plt.title('Number of Submissions by Day',fontsize=15)
plt.ylabel('Number of Submissions (x1000)', fontsize=14)
plt.tick_params(labelsize=14)

plt.subplot(gs[1])
h1bsubdate = h1bdat.groupby('MONTH').count()['CASE_STATUS']
plt.plot(h1bsubdate.index,h1bsubdate.values/1000, marker='s', markersize=10, color='C3')
plt.xticks(np.arange(1,13,1),['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
plt.title('Number of Submissions by Month',fontsize=15)
plt.ylabel('Number of Submissions (x1000)', fontsize=14)
plt.tick_params(labelsize=14)
plt.show()
#plt.savefig('Submission_date.pdf')

The petitions are filed the most on Tuesdays. During weekday, Monday has the least fillings. Peole also file the petitions at weekend. Since the open time for H-1B petition filling is early Apr so it is understandable that there is a significan number of the petitions were filed in March and February.

In [27]:
fig = plt.figure(figsize=(12,4))
gs = gridspec.GridSpec(1, 2, width_ratios=[1,1.618])
plt.subplot(gs[0])
h1bsubdate = h1bdat.groupby('S_DOW').count()['CASE_STATUS']
plt.plot(h1bsubdate.index,h1bsubdate.values/1000, marker='o', markersize=12, color='C0')
plt.xticks(np.arange(0,7,1),['Mon','Tue','Wed','Thu','Fri','Sat','Sun'])
plt.yticks(np.arange(100,600,100))
plt.title('Beginning Employment Day',fontsize=15)
plt.ylabel('Number of Submissions (x1000)', fontsize=14)
plt.tick_params(labelsize=14)

plt.subplot(gs[1])
h1bsubdate = h1bdat.groupby('S_MONTH').count()['CASE_STATUS']
plt.plot(h1bsubdate.index,h1bsubdate.values/1000, marker='s', markersize=12, color='C3')
plt.xticks(np.arange(1,13,1),['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
plt.yticks(np.arange(100,700,100))
plt.title('Beginning Employment Month',fontsize=15)
plt.ylabel('Number of Submissions (x1000)', fontsize=14)
plt.tick_params(labelsize=14)
plt.show()
#plt.savefig('Begin_employment.pdf')

It is normal to see that Monday is the most preferable day to start a new job. But it is very interesting to see many petitions having the beginning employment day at the weekend! Note that number of jobs starting on Saturday is two third of any weekday except Monday.

It is also very interesting to see a large number of jobs starting in September, noting that the H-1B visa alloys petitioners to work in next fiscal year, which starts on October 1st! The reason for this is because LCA needs to be submitted before H-1B petition and the beginning employment date can not be more than 183 days from the submission date. So most of LCA petition wrote down a beginning employment date few days before October 1st. This also explain why there are beginning employment day at the weekend.

The number of jobs starting in Aug is partially due to the reason above. The other reason could be that the LCA petitions are for job transitions but not first time H-1B petition. We can see beginning employment dates in all other month are also large (~100k) due to job transitions.

Geographic dependences of job market

Below we will analyse the geographic dependence of job market in term of number of jobs and salary. In each plot, sizes of the cirles represent the numbers of jobs and colors of the circles represent the salaries.

Average salary over all jobs

In [28]:
# Dictionary of state capital positions in US
capitolpos={}
with open('CapitolPos.txt') as f:
    for line in f:
        state, lan, lon = line.strip().split(',')
        capitolpos[state.upper().strip()] = (float(lon), float(lan))
stlist = list(capitolpos.keys())
h1bdat = h1bdat[h1bdat.STATE.isin(capitolpos.keys())]

sbystate = h1bdat[['STATE','BASE_SALARY']].groupby(['STATE']).mean()
sbystate2 = h1bdat[['STATE','BASE_SALARY']].groupby(['STATE']).count()
X = []
Y = []
for state in list(sbystate.index):
    (lon, lan) = capitolpos[state]
    X.append(lon)
    Y.append(lan)

fig = plt.figure(figsize=(16,8))
m = Basemap(llcrnrlon=-119,llcrnrlat=22,urcrnrlon=-64,urcrnrlat=49, projection='lcc',lat_1=32,lat_2=45,lon_0=-95)
m.drawcoastlines()
m.drawcountries(linewidth=3, color='C0')
m.drawstates(linewidth=1.5)
xmap, ymap = m(X, Y)

ax = m.scatter(xmap, ymap, s=8000*sbystate2.values/np.max(sbystate2.values), c=sbystate.BASE_SALARY.values/1000, cmap='cool', vmin=sbystate.values.min()/1000, vmax=sbystate.values.max()/1000, alpha=0.80)
cb = plt.colorbar(ax, format='%2.0f')
cb.ax.tick_params(labelsize=14)
plt.legend()
plt.title('Average Salary by State (x1000 USD/year), 2013-2017', fontsize=16)
plt.show()
/anaconda/lib/python3.6/site-packages/mpl_toolkits/basemap/__init__.py:3222: MatplotlibDeprecationWarning: The ishold function was deprecated in version 2.0.
  b = ax.ishold()
/anaconda/lib/python3.6/site-packages/mpl_toolkits/basemap/__init__.py:3231: MatplotlibDeprecationWarning: axes.hold is deprecated.
    See the API Changes document (http://matplotlib.org/api/api_changes.html)
    for more details.
  ax.hold(b)
/anaconda/lib/python3.6/site-packages/matplotlib/axes/_axes.py:545: UserWarning: No labelled objects found. Use label='...' kwarg on individual plots.
  warnings.warn("No labelled objects found. "

California filed the most applications for H1-B visa. The 2nd tie includes Taxes, New York, New Jersey, Illinois and Washington. This is because those states have big economic and tech centers, e.g., many tech and IT companies residing in such as San Francisco, Chicago, New York and Seattle. For salary, West Virginia, South Dakota, North Dakota, California and Wahsington are leading then some north east states (New York, New Jersey, Massachusetts, ..).

We all know that North Dakota and South Dakota are not on top strong economic state in country. Howcome are they on the top of high salary? It turns out that the average salary (over all jobs) comparison we are doing here is not a way to do comparison. Since there is regional dependence on type of jobs, i.e., west coast may have more IT and tech jobs, east coast may have more finance jobs, and mid-west may have more argricalture jobs, and different jobs could have very different salaries. Further analyses show that North Dakota and South Dakota are filing petitions for many very high pay jobs like medical doctors, surgeons and physicians. West Virginia is also filing petitions for many medical doctors, physicians, surgeons and professors so it also have very high average salary.

Therefore, comparing salary of a same job in different states is more meaningful. If you have a set of skills suitable for some certain jobs and want to find a place to have as high as possible income, you need to plot the regional dependent data for those jobs. Let's analyze the regional dependence of the most popular jobs.

Average salary for some specific job

In [15]:
jlist=['PROGRAMMER ANALYST']
sbystate = h1bdat[h1bdat['JOB_TITLE'].isin(jlist)][['STATE','BASE_SALARY']].groupby(['STATE']).mean()
sbystate2 = h1bdat[h1bdat['JOB_TITLE'].isin(jlist)][['STATE','BASE_SALARY']].groupby(['STATE']).count()
X = []
Y = []
for state in list(sbystate.index):
    (lon, lan) = capitolpos[state]
    X.append(lon)
    Y.append(lan)

fig = plt.figure(figsize=(16,8))
m = Basemap(llcrnrlon=-119,llcrnrlat=22,urcrnrlon=-64,urcrnrlat=49,projection='lcc',lat_1=32,lat_2=45,lon_0=-95)
m.drawcoastlines()
m.drawcountries(linewidth=3, color='C0')
m.drawstates(linewidth=1.5)
xmap, ymap = m(X, Y)

ax = m.scatter(xmap, ymap, s=8000*sbystate2.values/np.max(sbystate2.values), c=sbystate.BASE_SALARY.values/1000, cmap='cool', vmin=sbystate.values.min()/1000, vmax=sbystate.values.max()/1000, alpha=0.80)
cb = plt.colorbar(ax, format='%2.0f')
cb.ax.tick_params(labelsize=14)
plt.legend()
plt.title('Average Salary (x1000 USD/year) and Geographic Distribution of "Programmer Analyst" Job, 2013-2017', fontsize=13)
plt.show()
/anaconda/lib/python3.6/site-packages/mpl_toolkits/basemap/__init__.py:3222: MatplotlibDeprecationWarning: The ishold function was deprecated in version 2.0.
  b = ax.ishold()
/anaconda/lib/python3.6/site-packages/mpl_toolkits/basemap/__init__.py:3231: MatplotlibDeprecationWarning: axes.hold is deprecated.
    See the API Changes document (http://matplotlib.org/api/api_changes.html)
    for more details.
  ax.hold(b)
/anaconda/lib/python3.6/site-packages/matplotlib/axes/_axes.py:545: UserWarning: No labelled objects found. Use label='...' kwarg on individual plots.
  warnings.warn("No labelled objects found. "
In [16]:
jlist=['SOFTWARE ENGINEER']
sbystate = h1bdat[h1bdat['JOB_TITLE'].isin(jlist)][['STATE','BASE_SALARY']].groupby(['STATE']).mean()
sbystate2 = h1bdat[h1bdat['JOB_TITLE'].isin(jlist)][['STATE','BASE_SALARY']].groupby(['STATE']).count()
X = []
Y = []
for state in list(sbystate.index):
    (lon, lan) = capitolpos[state]
    X.append(lon)
    Y.append(lan)

fig = plt.figure(figsize=(16,8))
m = Basemap(llcrnrlon=-119,llcrnrlat=22,urcrnrlon=-64,urcrnrlat=49, projection='lcc',lat_1=32,lat_2=45,lon_0=-95)
m.drawcoastlines()
m.drawcountries(linewidth=3, color='C0')
m.drawstates(linewidth=1.5)
xmap, ymap = m(X, Y)
ax = m.scatter(xmap, ymap, s =8000*sbystate2.values/np.max(sbystate2.values), c=sbystate.BASE_SALARY.values/1000, cmap='cool', vmin=sbystate.values.min()/1000, vmax=sbystate.values.max()/1000, alpha=0.80)
cb = plt.colorbar(ax, format='%2.0f')
cb.ax.tick_params(labelsize=14)
plt.legend()
plt.title('Average Salary (x1000 USD/year) and Geographic Distribution of "Software Engineer" Job, 2013-2017', fontsize=13)
plt.show()
/anaconda/lib/python3.6/site-packages/mpl_toolkits/basemap/__init__.py:3222: MatplotlibDeprecationWarning: The ishold function was deprecated in version 2.0.
  b = ax.ishold()
/anaconda/lib/python3.6/site-packages/mpl_toolkits/basemap/__init__.py:3231: MatplotlibDeprecationWarning: axes.hold is deprecated.
    See the API Changes document (http://matplotlib.org/api/api_changes.html)
    for more details.
  ax.hold(b)
/anaconda/lib/python3.6/site-packages/matplotlib/axes/_axes.py:545: UserWarning: No labelled objects found. Use label='...' kwarg on individual plots.
  warnings.warn("No labelled objects found. "

For the number of jobs, it looks as what we expect. There are more IT jobs in west and east coast. In term of salary for 2 most popular jobs, California and Washington pay really distinguished salary from the rest. Then Massachusetts, North Carolina and maybe Colorado are the 2nd tie. But we know that some area like California, New York and New Jersey, the living expense are significantly higher than other area like mid-west. So comparing the real purchasing power could be a better way.

Salary vs. Purchasing Power: Adjusted Salary with Living Expense

The real purchasing power is the income adjusted by Regional Price Parity (RPP), which measures the differences in price levels across states. Data for 2016 and 2017 have not releasd yet, let have a comparison for the most popular job in 2015.

In [15]:
rpp2015 = {}
with open('RPP_AllItems_2015.txt') as f:
    for line in f:
        state,rpp = line.strip().split(',')
        rpp2015[state.upper().strip()] = float(rpp)/100
In [29]:
jlist=['PROGRAMMER ANALYST']
sbystate = h1bdat[h1bdat['YEAR'] == 2015]
sbystate = sbystate[sbystate['JOB_TITLE'].isin(jlist)][['STATE','BASE_SALARY']].groupby(['STATE']).mean()
sbystate2 = h1bdat[h1bdat['YEAR'] == 2015]
sbystate2 = sbystate2[sbystate2['JOB_TITLE'].isin(jlist)][['STATE','BASE_SALARY']].groupby(['STATE']).count()
X = []
Y = []
for state in list(sbystate.index):
    (lon, lan) = capitolpos[state]
    X.append(lon)
    Y.append(lan)

fig = plt.figure(figsize=(12,12))
plt.subplot(2, 1, 1)
m = Basemap(llcrnrlon=-119,llcrnrlat=22,urcrnrlon=-64,urcrnrlat=49, projection='lcc',lat_1=32,lat_2=45,lon_0=-95)
m.drawcoastlines()
m.drawcountries(linewidth=3, color='C0')
m.drawstates(linewidth=1.5)
xmap, ymap = m(X, Y)

ax = m.scatter(xmap, ymap, s=8000*sbystate2.values/np.max(sbystate2.values), c=sbystate.BASE_SALARY.values/1000, cmap='cool', vmin=sbystate.values.min()/1000, vmax=sbystate.values.max()/1000, alpha=0.80)
cb = plt.colorbar(ax, format='%2.0f')
cb.ax.tick_params(labelsize=14)
plt.legend()
plt.title('Salary (x1000 USD/year) and Geographic Distribution of The Most Popular Job', fontsize=14)

plt.subplot(2, 1, 2)
sbystate3 = h1bdat[h1bdat['YEAR'] == 2015]
sbystate3 = sbystate3[sbystate3['JOB_TITLE'].isin(jlist)][['STATE','BASE_SALARY']]
sbystate3 = sbystate3.assign(adj_salary=sbystate3.apply(lambda x: x.BASE_SALARY/rpp2015[x['STATE']], axis=1))
sbystate3 = sbystate3[['STATE','adj_salary']].groupby('STATE').mean()
X = []
Y = []
for state in list(sbystate.index):
    (lon, lan) = capitolpos[state]
    X.append(lon)
    Y.append(lan)

m = Basemap(llcrnrlon=-119,llcrnrlat=22,urcrnrlon=-64,urcrnrlat=49, projection='lcc',lat_1=32,lat_2=45,lon_0=-95)
m.drawcoastlines()
m.drawcountries(linewidth=3, color='C0')
m.drawstates(linewidth=1.5)
xmap, ymap = m(X, Y)

ax = m.scatter(xmap, ymap, s=8000*sbystate2.values/np.max(sbystate2.values), c=sbystate3.adj_salary.values/1000, cmap='cool', vmin=sbystate.values.min()/1000, vmax=sbystate.values.max()/1000, alpha=0.80)
cb = plt.colorbar(ax, format='%2.0f')
cb.ax.tick_params(labelsize=14)
plt.legend()
plt.title('Adjusted Salary (x1000 USD/year) and Geographic Distribution of The Most Popular Job', fontsize=14)
plt.tight_layout()
plt.show()
#plt.savefig("Salary_vs_Purchasing_Power.pdf")
/anaconda/lib/python3.6/site-packages/mpl_toolkits/basemap/__init__.py:3222: MatplotlibDeprecationWarning: The ishold function was deprecated in version 2.0.
  b = ax.ishold()
/anaconda/lib/python3.6/site-packages/mpl_toolkits/basemap/__init__.py:3231: MatplotlibDeprecationWarning: axes.hold is deprecated.
    See the API Changes document (http://matplotlib.org/api/api_changes.html)
    for more details.
  ax.hold(b)
/anaconda/lib/python3.6/site-packages/matplotlib/axes/_axes.py:545: UserWarning: No labelled objects found. Use label='...' kwarg on individual plots.
  warnings.warn("No labelled objects found. "
In [7]:
joblist = list(h1bdat[['JOB_TITLE','BASE_SALARY']].groupby(['JOB_TITLE']).count().index)
with open('joblist.dat','w') as fout:
    for i in range(len(joblist)):
        fout.write(joblist[i]+'\n')
fout.close()
joblist.append('NONE')
wjob1 = widgets.Text(
    value='software engineer',
    description='Job-1',
    disabled=False
)
wjob2 = widgets.Text(
    value='None',
    description='Job-2',
    disabled=False
)
wjob3 = widgets.Text(
    value='None',
    description='Job-3',
    disabled=False
)
In [8]:
display(wjob1, wjob2, wjob3)
In [142]:
seljobs = [wjob1.value.upper(),wjob2.value.upper(),wjob3.value.upper()]
for i in range(len(seljobs)):
    if seljobs[i] not in joblist:
        print('Jobt Tile '+str(i+1)+' not valid. Enter again')
In [143]:
seljobs = list(set(seljobs))
if 'NONE' in seljobs:
    seljobs.remove('NONE')
print("Your selected jobs:", seljobs)
Your selected jobs: ['BUSINESS ANALYST', 'DATA SCIENTIST']
In [144]:
fig = plt.figure(figsize=(12,4))
ax1 = plt.subplot(1, 2, 1)
nby = h1bdat[['YEAR','BASE_SALARY']][h1bdat['JOB_TITLE'].isin(seljobs)].groupby('YEAR').count()
ax1.plot(nby.index,nby.values/1000,marker='o', markersize=10)
ax1.xaxis.set_tick_params(labelsize=14)
ax1.yaxis.set_tick_params(labelsize=14)
ax1.xaxis.set_ticks(np.arange(2013,2018,1))
ax1.set_title('Number of Jobs with Time', fontsize=16)
ax1.set_xlabel("Year", fontsize=14)
ax1.set_ylabel("Number of Jobs (x1000)", fontsize=14)

ax2 = plt.subplot(1, 2, 2)
nby = h1bdat[['YEAR','BASE_SALARY']][h1bdat['JOB_TITLE'].isin(seljobs)].groupby('YEAR').mean()
ax2.plot(nby.index,nby.values/1000,marker='s', markersize=10,c='C1')
ax2.xaxis.set_tick_params(labelsize=14)
ax2.yaxis.set_tick_params(labelsize=14)
ax2.xaxis.set_ticks(np.arange(2013,2018,1))
ax2.set_title('Average Salary with Time', fontsize=16)
ax2.set_xlabel("Year", fontsize=14)
ax2.set_ylabel("Salary (x1000 USD/year)", fontsize=14)
plt.show()
In [145]:
ntop=10
topEmpsj = h1bdat[['EMPLOYER','BASE_SALARY','JOB_TITLE']][h1bdat['YEAR'] >= 2016]
topEmpsj = topEmpsj[topEmpsj['JOB_TITLE'].isin(seljobs)].groupby('EMPLOYER').count().sort_values(by='JOB_TITLE',ascending=False)
topEmpsj = list(topEmpsj.head(ntop).index)
byEmpYearsj = h1bdat[['EMPLOYER', 'YEAR', 'BASE_SALARY', 'JOB_TITLE']][h1bdat['EMPLOYER'].isin(topEmpsj)]
byEmpYearsj = byEmpYearsj[byEmpYearsj['JOB_TITLE'].isin(seljobs)]
byEmpYearsj = byEmpYearsj.groupby([byEmpYearsj['EMPLOYER'],byEmpYearsj['YEAR']])

topEmpsj2 = h1bdat[['EMPLOYER','BASE_SALARY','JOB_TITLE']][h1bdat['YEAR'] >= 2016]
topEmpsj2 = topEmpsj2[topEmpsj2['JOB_TITLE'].isin(seljobs)]
topEmpsj2 = topEmpsj2[topEmpsj2['EMPLOYER'].isin(topEmpsj)][['EMPLOYER','BASE_SALARY']].groupby('EMPLOYER').mean().sort_values(by='BASE_SALARY',ascending=False)
topEmpsj2 = list(topEmpsj2.head(ntop).index)

markers=['o','v','^','<','>','d','s','p','*','h','x','o','v','^','<','>','d','s','p','*','h','x','D']
colors=['C0','C1','C2','C3','C4','C5','C6','C7','C8','C9','C0','C1','C2','C3','C4','C5','C6','C7','C8','C9']
fig = plt.figure(figsize=(10,12))
ax1 = plt.subplot(2, 1, 1)
for company in topEmpsj:
    tmp = byEmpYearsj.count().loc[company]
    ax1.plot(tmp.index.values, tmp["BASE_SALARY"].values, label=company, linewidth=2,
             marker=markers[topEmpsj.index(company)], markersize=8, color=colors[topEmpsj.index(company)])

ax1.set_title('Top '+str(ntop)+' Companies for the Jobs', fontsize=16)
ax1.set_xlabel("Year", fontsize=14)
ax1.set_ylabel("Number of Jobs", fontsize=14)
ax1.xaxis.set_tick_params(labelsize=14)
ax1.yaxis.set_tick_params(labelsize=14)
box = ax1.get_position()
ax1.set_position([box.x0, box.y0, box.width * 0.8, box.height])
ax1.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax1.xaxis.set_ticks(np.arange(2013,2018,1))

ax2 = plt.subplot(2, 1, 2)
for company in topEmpsj2:
    tmp = byEmpYearsj.mean().loc[company]
    ax2.plot(tmp.index.values, tmp["BASE_SALARY"].values/1000, label=company, linewidth=2,
             marker=markers[topEmpsj.index(company)], markersize=8, color=colors[topEmpsj.index(company)])

ax2.set_title('')
ax2.set_xlabel("Year", fontsize=14)
#ax2.set_ylabel("Average Salary (x1000 USD/year)", fontsize=14)
ax2.set_ylabel("Average Salary (x1000 USD/year)", fontsize=14)
ax2.xaxis.set_tick_params(labelsize=14)
ax2.yaxis.set_tick_params(labelsize=14)
box = ax2.get_position()
ax2.set_position([box.x0, box.y0, box.width * 0.8, box.height])
ax2.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax2.xaxis.set_ticks(np.arange(2013,2018,1))

plt.show()
In [146]:
jlist=seljobs
sbystate = h1bdat[h1bdat['YEAR'] == 2015]
sbystate = sbystate[sbystate['JOB_TITLE'].isin(jlist)][['STATE','BASE_SALARY']].groupby(['STATE']).mean()
sbystate2 = h1bdat[h1bdat['YEAR'] == 2015]
sbystate2 = sbystate2[sbystate2['JOB_TITLE'].isin(jlist)][['STATE','BASE_SALARY']].groupby(['STATE']).count()
X = []
Y = []
for state in list(sbystate.index):
    (lon, lan) = capitolpos[state]
    X.append(lon)
    Y.append(lan)

fig = plt.figure(figsize=(12,12))
plt.subplot(2, 1, 1)
m = Basemap(llcrnrlon=-119,llcrnrlat=22,urcrnrlon=-64,urcrnrlat=49, projection='lcc',lat_1=32,lat_2=45,lon_0=-95)
m.drawcoastlines()
m.drawcountries(linewidth=3, color='C0')
m.drawstates(linewidth=1.5)
xmap, ymap = m(X, Y)

ax = m.scatter(xmap, ymap, s=8000*sbystate2.values/np.max(sbystate2.values), c=sbystate.BASE_SALARY.values/1000, cmap='cool', vmin=sbystate.values.min()/1000, vmax=sbystate.values.max()/1000, alpha=0.80)
cb = plt.colorbar(ax, format='%2.0f')
cb.ax.tick_params(labelsize=14)
plt.legend()
plt.title('Salary (x1000 USD/year) and Geographic Distribution of The Selected Job', fontsize=14)

plt.subplot(2, 1, 2)
sbystate3 = h1bdat[h1bdat['YEAR'] == 2015]
sbystate3 = sbystate3[sbystate3['JOB_TITLE'].isin(jlist)][['STATE','BASE_SALARY']]
sbystate3 = sbystate3.assign(adj_salary=sbystate3.apply(lambda x: x.BASE_SALARY/rpp2015[x['STATE']], axis=1))
sbystate3 = sbystate3[['STATE','adj_salary']].groupby('STATE').mean()
X = []
Y = []
for state in list(sbystate.index):
    (lon, lan) = capitolpos[state]
    X.append(lon)
    Y.append(lan)

m = Basemap(llcrnrlon=-119,llcrnrlat=22,urcrnrlon=-64,urcrnrlat=49, projection='lcc',lat_1=32,lat_2=45,lon_0=-95)
m.drawcoastlines()
m.drawcountries(linewidth=3, color='C0')
m.drawstates(linewidth=1.5)
xmap, ymap = m(X, Y)

ax = m.scatter(xmap, ymap, s=8000*sbystate2.values/np.max(sbystate2.values), c=sbystate3.adj_salary.values/1000, cmap='cool', vmin=sbystate.values.min()/1000, vmax=sbystate.values.max()/1000, alpha=0.80)
cb = plt.colorbar(ax, format='%2.0f')
cb.ax.tick_params(labelsize=14)
plt.legend()
plt.title('Adjusted Salary (x1000 USD/year) and Geographic Distribution of The Selected Popular Job', fontsize=14)
plt.tight_layout()
plt.show()
#plt.savefig("Salary_vs_Purchasing_Power.pdf")
/anaconda/lib/python3.6/site-packages/mpl_toolkits/basemap/__init__.py:3222: MatplotlibDeprecationWarning: The ishold function was deprecated in version 2.0.
  b = ax.ishold()
/anaconda/lib/python3.6/site-packages/mpl_toolkits/basemap/__init__.py:3231: MatplotlibDeprecationWarning: axes.hold is deprecated.
    See the API Changes document (http://matplotlib.org/api/api_changes.html)
    for more details.
  ax.hold(b)
/anaconda/lib/python3.6/site-packages/matplotlib/axes/_axes.py:545: UserWarning: No labelled objects found. Use label='...' kwarg on individual plots.
  warnings.warn("No labelled objects found. "

The picture change significantly! California, New York and New Jersey become the lowest adjusted salary states! certainly for this Programmer Analyst job. If you would like to see the change for other jobs, you can contact me. Some noticable observations from this figure:

  1. Washington has the highest salary before and after adjustment.
  2. California, New York and New Jersey have the lowest salary after adjustment, as mentioned above.
  3. Mid-west area + some south states + North Carolina have very good salary after adjustment.

These results are quite surprised as we know that some areas in West and East Coast pay quit higher than other area to offset the high living expense. But we may not expect that significant changes in the picture. So you really need to take into account the differences in salary and living cost to decide where you should work, for the same jobs, to have the most money saving. The changes are job dependent since salary differences between regions are job different.

Future work

(If I have some free time)

  • Developing an app to show users:

    a) list of top companies hiring that job

    b) geographic distribution of that job

    c) geographic dependent salary with/without RPP

when he/she select a job title from job list.

  • Collecting job postings to buid other database and app to suggest suitble job for a set of skills given by user and the inverse way of listing necessary skills for a given job.

These 2 apps will help internatinal students better choosing career and classes to prepare for future career.