1 How to Use This Document

This document includes a series of appendices that discuss data sources, data wrangling, and data visualization. Navigate through the document by using the panel at the left.

2 Introduction:

The H1B visa is a nonimmigrant, temporary visa that allows U.S. employers to hire highly educated foreign professionals, particularly in the fields of Science, Technology, Engineering, and Math. U.S. employers can hire 65000 foreign workers annually for specialty occupations and fashion models of exceptional merit and ability under the H1B visa program. Also, an additional 20,000 petitions can be filed on behalf of beneficiaries with a master’s degree or higher from a U.S. institution of higher education. Before filing an H1B visa petition, an employer must obtain a Labor Condition Application (LCA) from the U.S. Department of Labor (DOL). The LCA ensures that migrant workers in H1B status are compensated fairly and secure that employment of the H1B worker will not adversely affect the wages and working conditions (such as hours, vacations, and benefits) of similarly employed U.S. workers. To do that, the United States Department of Labor regulations require that the wages offered to a foreign worker be the prevailing wage rate for the occupational classification in employment. The prevailing wage rate is the average wage paid to similarly employed workers in a specific occupation in a given geographic area.

This project explores the typical earnings of an H1B applicant for a business analyst position in different regions in the USA— and the value of those earnings after adjusting for the cost of living across the USA.

2.1 Data:

3 Exploratory Data Analysis

The median salary for a business analyst is $69000. Only a few reported salaries range from $100000 to $200000 for a business analyst position, which skewed the histogram to the right. These high salaries are offered primarily in New York, California, and New Jersey. The location might play a significant role in causing higher wages. At the same time, other factors related to a candidate’s work experience, skill sets, and education can also impact the salaries. The data set has no information regarding the candidate’s competence level and position level, whether it’s an entry-level or senior-level position. Overall, the most frequently offered salaries fall between $50000 to $70000 range.

How does a business analyst’s salary vary geographically?

The data has been divided into four regions: Midwest, Northeast, South, and West. In all four regions, the frequently offered yearly wage for the business analyst position mostly ranges from $65000 to $85000. The annual earnings of a business analyst are more variable across the western region, ranging from $33280 to $200000. The higher salary ratio (the total count of salaries greater than $80000 divided by the total count of wages in each region) is highest in the west, 34%, than the other three regions’ similar ratio, 20%. Most of the higher wages are reported in the big cities of California. With a bi-modal shape of yearly earnings, the H1B sponsored business analyst are frequently paid roughly around $65000 and $85000 in the northeast region. Moreover, the lower salary ratio in the south (the total count of salaries less than $60000 divided by the add-up count of wages in each region) is similar to the west and northeast regions. At the same time, the midwest has a much more condensed distribution and pays the highest lower salary ratio of all areas, 12%.

Let’s compare the central tendency of these regions. We observe that the median salary in the west is $74554, top in four areas. On the contrary, the median salary of a business analyst in the south is $65000, the lowest in all four regions. The northeast and midwest regions offer a median wage of $70574 and $66224, respectively.

At the bottom of the range, median earnings are $60000 in Oklahoma and South Carolina. In contrast, at the top of the range, median incomes are $80621 in Montana. An interactive Choropleth would give a detailed view of median salaries in each state.

Who is paying higher?

When determining salary for H1B employees, employers must pay the prevailing wage determined by the National Prevailing Wage Center (NPWC) or other legitimate sources of information, such as the Online Wage Library. However, an employer can offer more than the previously determined prevailing wage. Business analysts in New Jersey, California, Texas, and New York more frequently get higher than the prevailing salary. Two yearly earnings have been found to be lower than the prevailing wage from Florida and Texas.

How much did a business analyst get paid in 2018 compared to the U.S. median household income?

The U.S. median household income was $63179, according to U.S. Census Bureau Statistics. Most states’ median earnings of a business analyst were higher than the U.S. median household income. Washington, DC, Hawaii, New Jersey, New York, and Washington are the top five states that pay more than the U.S. median household income. At the same time, Mississippi, Oklahoma, Puerto Rico, and West Virginia are bottom three in terms of paying lower than the U.S. median salary.

Is there any trend line on business analyst’s salary between 2016 to 2018?

From 2016 to 2018, business analysts’ annual salaries remained almost the same in each region. Over the years, no trend line has been observed for yearly earnings. Random effects might cause the annual salaries’ high variability.

What does it mean to live in two different states with the same yearly earnings?

Although a business analyst’s annual salary does not vary significantly across states, the cost of living can differ enormously, allowing a divergence between incomes in different places. A lower cost of living, which includes lower prices for housing, food, education, transportation, and other goods and services, allows the same dollar of wages to stretch further. We can see how much higher or lower the cost of the same bundle of goods and services is in each location relative to the national average cost by using the Bureau of Economic Analysis’ (BEA’s) regional price parities (RPPs). The RPP is calculated using Consumer Price Index prices and American Community Survey housing rents. RPPs are expressed as a percentage of each year’s overall national price level, so RPPs higher than 100 represent state prices higher than the national average and vice versa.

Median Annual Wage with and without Adjustment Cost of Living
year	state	median_annual_wage	RPP	adjusted_annual_median_wage	cost_of_living
2018	IA	65000	91.596	70963.80	-8.404
2018	OH	65000	92.886	69978.25	-7.114
2018	TX	65000	98.356	66086.46	-1.644

The table shows the same annual median earnings ($65000) but different living costs, giving a sense of how the cost of living can vary across the country. Iowa’s cost of living index (-8.4) is below the national average. In contrast, the cost of living in Ohio is about 7% lower than the national average. In comparison, the cost of living in Texas is 1.6%, slightly below the national average. When adjusted for cost of living, $65000 in wages is worth $70963 in Iowa, while the same earnings are worth only $66086 in Texas.

The upper figure shows the median annual wage versus the cost-of-living index by state. The state’s median annual wage is not strongly correlated with the cost-of-living index. Thirty-five states have lower living costs than the national average. Montana offers the highest median salary $80621, with a 7.8% lower cost of living than the national average. There are 16 states with higher living costs than the national average, offering median salaries ranging from $60000 to $78000. California, Washington-DC, New Jersey, and Hawaii have the highest living cost, 10% higher than the national average. Among these top four, Hawaii offers the lowest salary,$63835.

4 Conclusion

Overall, an H1B-sponsored business analyst’s salary ranges from $65000 to $85000. From 2016 to 2018, the annual median wage remained almost the same. Geographically, the Western region offers a higher median salary ratio than the others. In most states, a nonimmigrant business analyst’s median earnings are higher than the U.S. median household income in 2018. The Median yearly earnings have a weak positive correlation with the cost of living index. States such as California, Washington-DC, New Jersey, and Hawaii have the highest living cost and offer a salary ranging from $60000 to $78000. In contrast, Montana and Vermont have the highest median wage, with living costs below the national average, 7.8% and 0.3%, respectively.

5 Data Source Appendix :

The office of Foreign Labor Certification (OFLC) disclose LCA data for various stakeholders and the general public.The data has been downloaded from Performance Data of U.S Department of Labor’s Website. The data sets are available in the Microsoft Excel (.xlsx) file format and organized by the federal fiscal year (October 1 through September 30). This exploratory data analyses have been done on 2016 to 2018 data.The three data frames have been joined together and overall the combined data frame has rows and 52 columns.

The data frames outline details of the H1B employment, such as the occupational classification being sought, the number of nonimmigrants covered, the wage rate to be paid to the beneficiary, the prevailing wage for the occupation in the area of intended employment, the work location/s, and the dates of employment.

The LCA data from 2016 to 2018 have been used for the analysis. Three data frames have been combined in R. Overall the combined data frame has 1926862 rows and 52 columns.

State abbreviation and regions data: USA states abbreviations and regional information have been downloaded from here

Cost of Living Data: Bureau of Economic Analysis’ Regional Price Parities (RPP)

6 Data Wrangling Appendix

6.1 Cleaning

To begin, load the necessary libraries and read in the data.

# load libraries

load_packages <- pacman::p_load(
        readxl,
        skimr,
        tidyverse,   # deduplication, grouping, and slicing functions
        janitor,     # function for reviewing duplicates
        stringr,     # for string searches, can be used in "rolling-up" values
        ggplot2,
        lubridate,
        naniar,
        forcats,
        rmarkdown,
        here,
        tinytex,
        knitr,
        kableExtra)


# Import data

H1B_2016 <-read_excel(here("Data","Raw-data","H-1B_Disclosure_Data_FY16.xlsx"))
H1B_2017 <- read_excel(here("Data","Raw-data","H-1B_Disclosure_Data_FY17.xlsx"))
H1B_2018 <- read_excel(here("Data","Raw-data","H-1B_Disclosure_Data_FY2018_EOY.xlsx"))

Let’s do some primary cleaning with janitor packages.

H1B_2016<-  H1B_2016 %>% 
        clean_names()

H1B_2017<-  H1B_2017 %>% 
        clean_names()

H1B_2018<-  H1B_2018 %>% 
        clean_names()

Check the common variables of three data sets.

setequal(H1B_2017, H1B_2018) # FALSE is returned, so the observgations are not identical

# intersect(H1B_2017, H1B_2018) # `new_concurrent_employment' variable name is different in 2018 data set.


# rename the`new_concurrent_emp` variable name of 2018 data set to `new_concurrent_employment'

H1B_2018 <- H1B_2018 %>% 
        rename(`new_concurrent_employment` = `new_concurrent_emp`)# new name = old name

2017 and 2018 data frames have identical column names. 3 columns’ names are different in 2016 data frame. column names of 2016 data frame has been changed to match the column names of other two data frames: “h-1b_dependent” to “h1b_dependent”, “naic_code” to “naics_code” , “pw_wage_source” to “pw_source”

library(janitor)

H1B_2016<-  H1B_2016 %>% 
        clean_names()%>% 
        rename(naics_code = naic_code,
              pw_source = pw_wage_source,
              h1b_dependent= h_1b_dependent)


# combine 2017 and 2018 data frames together

H1B_17_nd_18 <- rbind(H1B_2017,H1B_2018)

#  Joining three data frames

H1B_16_to_18 <- bind_rows(H1B_2016,H1B_17_nd_18) # 2016 data set have fewer columns than 2017 and 2018. So, we use bind_rows()


# removing some data frames for utilizing memory space

rm(H1B_2016, H1B_2017, H1B_2018, H1B_17_nd_18)

6.2 Quantifying Missing Data:

How many missings?

# how many missings?

n_var_miss(H1B_16_to_18)

## [1] 45

Out of 52 variables, 45 variables are missing

# percent of ALL data frame values that are missing
pct_miss(H1B_16_to_18)

## [1] 18.35342

# Percent of rows with any value missing
pct_miss_case(H1B_16_to_18)

## [1] 100

# Percent of rows that are complete (no values missing)  
pct_complete_case(H1B_16_to_18)

## [1] 0

# count the missing values for every column

Missing_data <- sapply(H1B_16_to_18, function(x) sum(is.na(x)))

kable(Missing_data)

	x
case_number	0
case_status	0
case_submitted	1
decision_date	0
visa_class	0
employment_start_date	53
employment_end_date	66
employer_name	16
employer_address	16
employer_city	15
employer_state	125
employer_postal_code	16
employer_country	96517
employer_province	1413325
employer_phone	96518
employer_phone_ext	1802385
agent_attorney_name	0
agent_attorney_city	710076
agent_attorney_state	761609
job_title	13
soc_code	19
soc_name	20
naics_code	18
total_workers	0
full_time_position	647863
prevailing_wage	5
pw_unit_of_pay	152
pw_source	154
pw_source_year	172
pw_source_other	10939
wage_rate_of_pay_from	0
wage_rate_of_pay_to	5
wage_unit_of_pay	31
h1b_dependent	41170
willful_violator	41179
worksite_city	23
worksite_county	32
worksite_state	34
worksite_postal_code	25
original_cert_date	1785088
employer_business_dba	1495366
agent_representing_employer	744376
new_employment	647852
continued_employment	647852
change_previous_employment	647852
new_concurrent_employment	647852
change_employer	647852
amended_petition	647852
pw_wage_level	744355
support_h1b	737830
labor_con_agree	1445958
public_disclosure_location	1926862

Proportion of missing data

# what is the proportion of missing data for each variable?

pctmiss <- colSums(is.na(H1B_16_to_18))/nrow(H1B_16_to_18)
round(pctmiss, 2)

##                 case_number                 case_status 
##                        0.00                        0.00 
##              case_submitted               decision_date 
##                        0.00                        0.00 
##                  visa_class       employment_start_date 
##                        0.00                        0.00 
##         employment_end_date               employer_name 
##                        0.00                        0.00 
##            employer_address               employer_city 
##                        0.00                        0.00 
##              employer_state        employer_postal_code 
##                        0.00                        0.00 
##            employer_country           employer_province 
##                        0.05                        0.73 
##              employer_phone          employer_phone_ext 
##                        0.05                        0.94 
##         agent_attorney_name         agent_attorney_city 
##                        0.00                        0.37 
##        agent_attorney_state                   job_title 
##                        0.40                        0.00 
##                    soc_code                    soc_name 
##                        0.00                        0.00 
##                  naics_code               total_workers 
##                        0.00                        0.00 
##          full_time_position             prevailing_wage 
##                        0.34                        0.00 
##              pw_unit_of_pay                   pw_source 
##                        0.00                        0.00 
##              pw_source_year             pw_source_other 
##                        0.00                        0.01 
##       wage_rate_of_pay_from         wage_rate_of_pay_to 
##                        0.00                        0.00 
##            wage_unit_of_pay               h1b_dependent 
##                        0.00                        0.02 
##            willful_violator               worksite_city 
##                        0.02                        0.00 
##             worksite_county              worksite_state 
##                        0.00                        0.00 
##        worksite_postal_code          original_cert_date 
##                        0.00                        0.93 
##       employer_business_dba agent_representing_employer 
##                        0.78                        0.39 
##              new_employment        continued_employment 
##                        0.34                        0.34 
##  change_previous_employment   new_concurrent_employment 
##                        0.34                        0.34 
##             change_employer            amended_petition 
##                        0.34                        0.34 
##               pw_wage_level                 support_h1b 
##                        0.39                        0.38 
##             labor_con_agree  public_disclosure_location 
##                        0.75                        1.00

# kable(pctmiss)

The table shows the proportion of missing data for each variable.

Now, look into entire data frame for average missing values of all columns

# We can look at an entire data frame by using summarize_all

missing <- H1B_16_to_18 %>%   # missing data frame has information about the average of missing values of all columns
  summarize_all(funs(mean(is.na(.))))

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

glimpse(missing[1:10])

## Rows: 1
## Columns: 10
## $ case_number           <dbl> 0
## $ case_status           <dbl> 0
## $ case_submitted        <dbl> 5.189785e-07
## $ decision_date         <dbl> 0
## $ visa_class            <dbl> 0
## $ employment_start_date <dbl> 2.750586e-05
## $ employment_end_date   <dbl> 3.425258e-05
## $ employer_name         <dbl> 8.303656e-06
## $ employer_address      <dbl> 8.303656e-06
## $ employer_city         <dbl> 7.784678e-06

Visualizing Missing Data

missing %>%
  pivot_longer(case_number:support_h1b, names_to = "variable", 
               values_to = "pct") %>%
  filter(pct > 0.1) %>%
  ggplot(aes(x=reorder(variable, pct), y=pct)) +
  coord_flip() +
  geom_col()

Showing missingness with gg_miss_var

# showing missingness with gg_miss_var

gg_miss_var(H1B_16_to_18,show_pct = TRUE) # By default, counts are shown instead of percents, change this with show_pct = TRUE if you want the missing value in percentage

## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.
## ℹ The deprecated feature was likely used in the naniar package.
##   Please report the issue at <]8;;https://github.com/njtierney/naniar/issueshttps://github.com/njtierney/naniar/issues]8;;>.

Dropping Missing Data: Droping four columns which have more than 50 percent missing values.

missing_dropped <- H1B_16_to_18 %>%
  select(-c(employer_business_dba, employer_phone_ext, employer_province, original_cert_date))

rm(H1B_16_to_18)

6.3 Duplicated data

# 100% duplicates across all columns
missing_dropped %>% 
  janitor::get_dupes()

## No variable names specified - using all columns.

## No duplicate combinations found of: case_number, case_status, case_submitted, decision_date, visa_class, employment_start_date, employment_end_date, employer_name, employer_address, ... and 39 other variables

## # A tibble: 0 × 49
## # … with 49 variables: case_number <chr>, case_status <chr>,
## #   case_submitted <dttm>, decision_date <dttm>, visa_class <chr>,
## #   employment_start_date <dttm>, employment_end_date <dttm>,
## #   employer_name <chr>, employer_address <chr>, employer_city <chr>,
## #   employer_state <chr>, employer_postal_code <chr>, employer_country <chr>,
## #   employer_phone <chr>, agent_attorney_name <chr>, agent_attorney_city <chr>,
## #   agent_attorney_state <chr>, job_title <chr>, soc_code <chr>, …

There are no duplicated observations in the data set.

6.3.1 Extract data

The analysis have been done on ertified cases, in simpler term the cases that got LCA approval by USCIS.

# Extracting certified cases

certified <- missing_dropped %>%
  filter(case_status == "CERTIFIED") %>%
  glimpse()

## Rows: 1,694,789
## Columns: 48
## $ case_number                 <chr> "I-200-15198-889809", "I-200-15209-285824"…
## $ case_status                 <chr> "CERTIFIED", "CERTIFIED", "CERTIFIED", "CE…
## $ case_submitted              <dttm> 2015-11-10, 2015-10-14, 2015-09-30, 2015-…
## $ decision_date               <dttm> 2015-11-18, 2015-10-20, 2015-10-06, 2015-…
## $ visa_class                  <chr> "H-1B", "H-1B", "H-1B", "H-1B", "H-1B", "H…
## $ employment_start_date       <dttm> 2015-11-10, 2016-01-14, 2015-10-10, 2016-…
## $ employment_end_date         <dttm> 2018-11-09, 2018-01-13, 2016-11-12, 2018-…
## $ employer_name               <chr> "QUICKLOGIX LLC", "MCCHRYSTAL GROUP, LLC",…
## $ employer_address            <chr> "265 FRANKLIN STREET", "333 NORTH FAIRFAX …
## $ employer_city               <chr> "BOSTON", "ALEXANDRIA", "SAN DIEGO", "CHUL…
## $ employer_state              <chr> "MA", "VA", "CA", "CA", "MA", "CA", "TX", …
## $ employer_postal_code        <chr> "02110", "22314", "92117", "91910", "02110…
## $ employer_country            <chr> "UNITED STATES OF AMERICA", "UNITED STATES…
## $ employer_phone              <chr> "6179877989", "5713128637", "8584059128", …
## $ agent_attorney_name         <chr> "FURO, RUKAYYA", "ALEXANDER, JAMES", "FELD…
## $ agent_attorney_city         <chr> "OWINGS MILLS", "WASHINGTON", "SAN DIEGO",…
## $ agent_attorney_state        <chr> "MD", "DC", "CA", "CA", "MD", "CA", "TX", …
## $ job_title                   <chr> "CEO", "PRESIDENT, NORTHEAST REGION", "CEO…
## $ soc_code                    <chr> "11-1011", "11-1011", "11-1011", "11-1011"…
## $ soc_name                    <chr> "CHIEF EXECUTIVES", "CHIEF EXECUTIVES", "C…
## $ naics_code                  <chr> "541519", "541611", "541511", "611110", "5…
## $ total_workers               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ full_time_position          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ prevailing_wage             <dbl> 90.00, 241842.00, 99986.00, 99986.00, 90.0…
## $ pw_unit_of_pay              <chr> "Hour", "Year", "Year", "Year", "Hour", "Y…
## $ pw_source                   <chr> "OES", "OES", "OES", "OES", "OES", "OES", …
## $ pw_source_year              <chr> "2015", "2015", "2015", "2015", "2015", "2…
## $ pw_source_other             <chr> "OFLC ONLINE DATA CENTER", "OFLC ONLINE DA…
## $ wage_rate_of_pay_from       <dbl> 90.00, 245000.00, 100000.00, 100000.00, 90…
## $ wage_rate_of_pay_to         <dbl> 0, 0, 0, 120000, 0, 0, 0, 0, 0, 0, 202533,…
## $ wage_unit_of_pay            <chr> "Hour", "Year", "Year", "Year", "Hour", "Y…
## $ h1b_dependent               <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N…
## $ willful_violator            <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N…
## $ worksite_city               <chr> "SANTA CLARA", "ALEXANDRIA", "SAN DIEGO", …
## $ worksite_county             <chr> "SANTA CLARA", "ALEXANDRIA CITY", "SAN DIE…
## $ worksite_state              <chr> "CA", "VA", "CA", "CA", "CA", "CA", "TX", …
## $ worksite_postal_code        <chr> "95054", "22134", "92117", "91910", "95054…
## $ agent_representing_employer <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ new_employment              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ continued_employment        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ change_previous_employment  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ new_concurrent_employment   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ change_employer             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ amended_petition            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ pw_wage_level               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ support_h1b                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ labor_con_agree             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ public_disclosure_location  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

There is a single observation in certified data frames where prevailing wage is 0. The job title for this position is “DATABASE ARCHITECT”. It seems like as error since the information in other columns mostly is NA. This observation has been deleted for further analysis.

 certified %>%
        filter(prevailing_wage == 0 ) %>%
        select(job_title,worksite_state, employer_city) %>%
        glimpse()

## Rows: 1
## Columns: 3
## $ job_title      <chr> "DATABASE ARCHITECT"
## $ worksite_state <chr> NA
## $ employer_city  <chr> NA

There are 12 observation reported more than 100 million annual earnings which has been excluded for the analysis. Such annual earnings are extremely unlikely for a business analyst position. 3 observation more than 500k salary. one observation has yearly salary $37.5 which also excluded (high leverage point) from the analysis.

median_salary <-  salary_range %>%
        filter(yearly_wage_offered <= 200000) %>%  # 15 outliers change the graph's scale. so exclude them from the analysis
        select(yearly_wage_offered, state, worksite_city, region, prevailing_yearly_wage, employer_name, case_submitted) %>% 
         group_by(state) %>% 
         mutate(salary_median = median(yearly_wage_offered)) %>% # mutate a column with median salary grouped by state
         glimpse()

                
         
 # summary(median_salary $ yearly_wage_offered)        
                  
 median_salary %>% 
         filter(yearly_wage_offered <= 10000)%>% 
         glimpse()
         

# rm(median_salary)

Also, there are some observations in the prevailing wage column which are recorded as hourly wage.

# let's find out the lower wage range

salary_range %>% 
        filter(prevailing_wage < 100 & pw_unit_of_pay == "Hour" ) %>% 
        select(state, prevailing_wage, pw_unit_of_pay)%>% 
        glimpse()

## Rows: 2,160
## Columns: 3
## $ state           <chr> "PA", "FL", "NH", "NY", "FL", "CA", "NY", "NJ", "IL", …
## $ prevailing_wage <dbl> 23.29, 20.30, 30.70, 26.91, 20.30, 29.86, 26.91, 30.75…
## $ pw_unit_of_pay  <chr> "Hour", "Hour", "Hour", "Hour", "Hour", "Hour", "Hour"…

we can see that those observations which have prevailing wage records lower than $100 are recorded as hourly wage. For the sake of analysis, those observations have been converted to yearly wage by multiplying standard office hours in a year which is 2080 hours.

# mutate two columns where  hourly wage related rows has been converted to yearly wage  

salary_range <- salary_range %>% 
        mutate(prevailing_yearly_wage = ifelse(pw_unit_of_pay == "Hour", prevailing_wage*2080, prevailing_wage*1 ),
               yearly_wage_offered = ifelse(pw_unit_of_pay == "Hour", wage_rate_of_pay_from*2080, wage_rate_of_pay_from*1 ))%>%  # which basically translate if the condition is true apply the first argument, otherwise the second
        glimpse()


# check if the calculation is valid
salary_range %>% 
        filter(yearly_wage_offered <= 1000) %>% 
        select(yearly_wage_offered,wage_rate_of_pay_from, pw_unit_of_pay) %>% 
        glimpse()

The data frames made for analysis.

# one row has prevailing wage o and other related columns have mostly NA information for that row. so we are ignoring that observation
certified <- certified %>%
        filter(wage_rate_of_pay_from > 0 ) %>%
        glimpse()



# create a data set for business analyst position.

salary_range <- certified %>%
        rename(state = worksite_state)%>% 
        filter(job_title=="BUSINESS ANALYST") %>% 
        glimpse()

rm(missing_dropped)

# sum(is.na(salary_range))
 


# mutate two columns where  hourly wage related rows has been converted to yearly wage  

salary_range <- salary_range %>% 
        mutate(prevailing_yearly_wage = ifelse(pw_unit_of_pay == "Hour", prevailing_wage*2080, prevailing_wage*1 ),
               yearly_wage_offered = ifelse(pw_unit_of_pay == "Hour", wage_rate_of_pay_from*2080, wage_rate_of_pay_from*1 ))%>%  # which basically translate if the condition is true apply the first argument, otherwise the second
        glimpse()


# check if the calculation is valid
salary_range %>% 
        filter(yearly_wage_offered <= 1000) %>% 
        select(yearly_wage_offered,wage_rate_of_pay_from, pw_unit_of_pay) %>% 
        glimpse()





# finding salary range in different regions of USA

#install.packages("readr")

library(readr)
sta <-read_csv(here("Data","us_census_bureau_regions.csv"))


colnames(sta)[2] <- "state"                # Rename second column
colnames(sta)[3] <- "region"              # Rename third column

# merging salary_range and sta data sets to visualize the salary across different rehions in the USA

salary_range <- salary_range %>% 
  left_join(sta, by = c("state" = "state"), copy = TRUE)


# There are 16 NA observation in region columns. Lets filter those observations and check the corresponding state for those observation 

sapply(salary_range, function(x) sum(is.na(x)))


#The NA region have 16 NA observations belong to "PR" (Puerto Rico) and VI (Virgin Island) which are part of US territories. WE will not include those observation to the analysis.

#Let's look at those NA observations for regions.

region_na <- salary_range %>% 
        filter(is.na(region)) %>% 
        select(state, region,yearly_wage_offered) %>% 
        glimpse()



# remove NA observations for region

salary_range <- salary_range %>% 
  drop_na(region) # drops rows missing values for any of these columns

rm(region_na)

Now, import the regional price parity data to find the cost of living index.

# import regional price parity data
rpp <-read_excel(here("Data","RPP.xlsx"))

#rm(rpp)


## choose rpp for all items in every state 
rpp <- rpp %>% 
        filter(Description == "RPPs: All items") %>% 
        glimpse()

# convert wide to long format
rpp <- gather(rpp, year, amount, c(5:7))


# joining with sta data frame to find the state abbreviation
rpp <-  rpp %>% 
  left_join(sta, by = c("GeoName" = "State"), copy = TRUE) %>% 
         glimpse()
        
         
 
# remove row and columns

 rpp <- rpp %>%         
  slice(-1) %>% # remove first row
         select(- c(LineCode, Division)) %>% # remove two columns 
        glimpse()


# joining with sta data frame to find the state abbreviation
 
 cost_of_living <- median_salary_state %>% 
  left_join(rpp, by = c("state" = "state"), copy = TRUE) %>% 
         glimpse()

 
 
# add a column for adjusted salary
 
 cost_of_living <- cost_of_living %>% 
         mutate(adjusted_salary = median/amount *100,
               cost_of_living = amount - 100) %>% 
         glimpse()
 


 
 # rm(cost_of_living)

Finding cost of living index in different states.

# correlation between wage and cost of living 
cor(cost_of_living_2018 $median, cost_of_living_2018 $ cost_of_living)


summary(cost_of_living_2018 $median)
        
(cost_of_living_2018 $median)


high_living <- cost_of_living_2018 %>% 
        filter(cost_of_living > 0) %>% 
        select(state, median, cost_of_living) %>% 
        glimpse()

summary(high_living$ median)
max(high_living$cost_of_living)


lower_living <- cost_of_living_2018 %>% 
        filter(cost_of_living < 0) %>% 
        select(state, median, cost_of_living) %>% 
        glimpse()

summary(lower_living $median)

lower_living %>% 
        
    filter(median >78000 & cost_of_living < 0) %>% 
        select(state, median, cost_of_living) %>% 
        glimpse()

cost_of_living_2018 %>% 
        
    filter(median >78000 & cost_of_living < 0) %>% 
        select(state, median, cost_of_living, GeoName) %>% 
        glimpse()


# state with highest living cost 
cost_of_living_2018 %>% 
        filter(cost_of_living >= 10 ) %>% 
      select(state, median, cost_of_living, GeoName) %>% 
        glimpse()


cost_of_living_2018 %>% 
        filter(median >= 70000) %>% 
        select(state, median, cost_of_living, GeoName) %>% 
        glimpse()


cost_of_living_2018 %>% 
        filter(median <= 65000 & cost_of_living > 0) %>% 
        select(state, median, cost_of_living) %>% 
        glimpse()

7 Data Visualization Appendix

7.1 Plotting Graphs

The distribution of proposed wage for business analyst position.

# creating breaks for histogram

x <- salary_range$ yearly_wage_offered # prevailing wage has converted to yearly wage


breaks <- seq(floor(min(x)), ceiling(max(x)), by = 50000)


# Histogram with kernel density

#install.packages("scales")
# update.packages(ggplot2)

#install.packages("ggplot2", dependencies=TRUE)

library("scales")
library(ggplot2)
library(dplyr)

 (hist_wage <- salary_range %>% 
             filter(yearly_wage_offered < 200000) %>% 
            ggplot(aes(x= yearly_wage_offered)) +
            geom_histogram(binwidth = 10000, fill = "#B06E6A") +
           geom_density(aes(y = ..density..),
              lwd = 1, colour = "#B06E6A",
               fill = "#B06E6A", alpha = 0.25)+
         labs(title = "Annual Wage Density Plot", 
         x = "\n Yearly Wage", y = "Count \n") + 
                scale_x_continuous(labels = comma)+ # comma comes from scales library
               scale_y_continuous(labels = comma, expand = expand_scale(mult = c(0, 0.1)))+ # this line of code removes the empty blank space below the bars)
    theme(panel.grid = element_blank(),
          panel.background = element_blank(),
          axis.text = element_text(size = 12), 
          axis.title = element_text(size = 12), 
          plot.title = element_text(size = 14, hjust = 0.5, face = "bold")))

The distribution of proposed wage in different regions.

# install.packages("ggridges")

library(ggridges)


(salary_ggridge <- ggplot(median_salary, aes(x = yearly_wage_offered, y = region, fill = region)) +
  geom_density_ridges( real_min_height = 0.05) +
  facet_wrap(~region) +
  theme_ridges() +
  scale_fill_manual(values = c("#0F1F38", "#8E7970", "#F55449", "#1B4B5A")) +   
  theme(legend.position = "none",
        panel.grid = element_blank(),
        panel.background = element_blank(),# No show legend
        axis.title.x = element_text(hjust = 0.5),  # x axis title in the center
        axis.title.y = element_text(hjust = 0.5)) + 
  scale_x_continuous(limits=c(10000, 200000)) +# y axis title in the center
  labs(x = "\n Annual Wage", y = "Region\n"))

Interactive Choropleth map for a detailed view of median salaries in each state.

# install.packages("usmap")
# install.packages("plotly")


library(plotly)
library(usmap) #import the package

color_salary <- c("#1B4B5A", "#8ED6EC", "#46C2E8", "#3F5F69", "#1C4D5C")



fontstyle <- list(  # fonts of hover over information
        size = 15,
        color = "black"
)

label <- list( # control the color of hover 
        bgcolor = "#EEEEEE", # background color of hover label
        bordercolor = "transparent",
        font = fontstyle
)



(median_salary_statewise <- plot_geo(median_salary_state, 
                                    locationmode = 'USA-states') %>% 
                add_trace(locations = ~state,
                          z = ~median,
                          zmin = 0, # setting the min value for the legend 
                          zmax = max(median_salary $ median), # setting the max value for the legend 
                          color = ~median,   # we want the map to be colored by median yearly_wage
                          colorscale = 'color_salary',
                          text = ~hover,
                          hoverinfo = 'text') %>%  
                           # colorscale = 'color_salary') %>% 
                layout(geo = list(scope = 'usa' ),
                       title = "Median Yearly Wage for Business Analyst Position across USA \n 2016 -2018") %>%   # only show the USA map
                style(hoverlabel = label) %>% 
                config(displayModeBar = FALSE) %>% # for not showing toolbar on the top 
                colorbar(tickprefix = "$"))
 

 ggsave(filename = here("Figures/median_salary_statewise.png"),
       height = 6, width = 8)     

# str(salary_range)

#rm(salary_ggridge)

Now examine which states are paying higher than the prevailing wage.

library(wordcloud2)


state_name_freq <- count(high_paying_state, state) # state wise diff_prevailing_proposed_wage frequency data for wordcloud

    
#(wordcloud2(state_name_freq, size=1.6, color='random-light',shape = 'triangle', backgroundColor="white"

 
(word_cloud <- wordcloud2(state_name_freq, size = .7, shape = 'star', color= rep_len( c("#0F1F38", "#8E7970", "#F55449", "#1B4B5A"), nrow(demoFreq))))

How much did a business analyst get paid compared to the U.S. median household income in 2018?

(lollipop <- ggplot(median_salary_diff, aes(x = state, y = diff_midean_to_state)) +
  geom_hline(yintercept= 0, color="#0F1F38", size=.5) + # adding line to see the difference
  #geom_point(aes(fill = region)) +

  geom_segment(aes(x = state, xend = state, y = 0, yend = diff_midean_to_state),
               color = ifelse(median_salary_diff $ diff_midean_to_state %in% c(-20000:-40000), "#F55449", "grey"),
               size = ifelse(median_salary_diff $ diff_midean_to_state %in% c(-20000:-40000), 1.3, 0.7)
               ) +
  geom_point(
               color = ifelse(median_salary_diff $ diff_midean_to_state %in% c(-20000:-40000), "#F55449", "grey"),
               size = ifelse(median_salary_diff $ diff_midean_to_state %in% c(-20000:-40000), 5,2)
   ) +
 theme(panel.grid = element_blank(),
        legend.position="none",
        panel.background = element_blank(),
        panel.border = element_blank(),
        axis.text.x = element_text(angle = 90, vjust = 0.5)
        ) + 
  xlab("") +
  ylab("State's Median Wage vs USA's Median Wage") +
  #ggtitle("How much did a business analyst get paid in 2018 compared \n to the U.S. median household income?") +
  annotate("text", x=grep("PR", median_salary_diff$state),    y=median_salary_diff$diff_midean_to_state[which(median_salary_diff$state=="PR")], 
           label="Puerto Rico \n paid the lowest", 
           color="#0F1F38", 
           size=2.5 , 
           angle=0, 
           fontface="bold", 
           hjust=1, # text align left adjust
           nudge_y = 10,
           direction = "y",
  )
           )

what was the trend of a business analyst’s annual earnings from 2016 to 2018?

library(ggplot2)

(scatter_line <- ggplot(median_salary, aes( x= case_submitted, y= yearly_wage_offered, color = region)) +
  geom_point() +
  #geom_smooth(method = "lm") + 
  stat_smooth(method = "lm", color = "black")) +
  facet_wrap(~region) +
  theme_bw() +
  scale_color_manual(values = c("#66A5AD", "#8E7970", "#F55449", "#1B4B5A")) +   
  theme(axis.text.x = element_text(size = 12, angle = 45, vjust = 1, hjust = 1),
        panel.grid = element_blank(),
        panel.background = element_blank()
        ) +
       # panel.border = element_blank())
  xlab("") +
  ylab("Annual Wage Offered")

Cost of living index vs median annual wage in each state

# install.packages("ggrepel")
library(ggrepel)
library(tidyverse)

(living_cost_scatterplot <- ggplot(cost_of_living_2018 , aes(median, cost_of_living, label = state)) +
geom_hline(yintercept= 0, color="black", size=.5) + # adding y-intercept
geom_point(aes(color = living_cost_highlight)) +
stat_smooth(method = "lm", color = "#003B46")) +
theme_bw() + 
scale_color_manual(values = c("#F55449", "#1B4B5A", "#66A5AD")) +
geom_text_repel(data = cost_of_living_2018 %>% filter( cost_of_living > 10| cost_of_living < - 10),
                nudge_y = 1,
                hjust = 0.5,
                segment.color = "gray60") +
                #direction = "y")
theme(panel.grid = element_blank(),
      panel.background = element_blank(),
      legend.position = "none",
      panel.border = element_blank()) +
labs(title = "Median Annual Wage vs Cost-of-Living Index for 2018",
    x = "Median Annual Wage", y = "Cost-of-Living Index (national average = 0)"

)

Data Analysis on the Annual Earnings of H1B-Applicant Business Analysts

Nigar (Leeza) Sultana