This document includes a series of appendices that discuss data sources, data wrangling, and data visualization. Navigate through the document by using the panel at the left.
The H1B visa is a nonimmigrant, temporary visa that allows U.S. employers to hire highly educated foreign professionals, particularly in the fields of Science, Technology, Engineering, and Math. U.S. employers can hire 65000 foreign workers annually for specialty occupations and fashion models of exceptional merit and ability under the H1B visa program. Also, an additional 20,000 petitions can be filed on behalf of beneficiaries with a master’s degree or higher from a U.S. institution of higher education. Before filing an H1B visa petition, an employer must obtain a Labor Condition Application (LCA) from the U.S. Department of Labor (DOL). The LCA ensures that migrant workers in H1B status are compensated fairly and secure that employment of the H1B worker will not adversely affect the wages and working conditions (such as hours, vacations, and benefits) of similarly employed U.S. workers. To do that, the United States Department of Labor regulations require that the wages offered to a foreign worker be the prevailing wage rate for the occupational classification in employment. The prevailing wage rate is the average wage paid to similarly employed workers in a specific occupation in a given geographic area.
This project explores the typical earnings of an H1B applicant for a business analyst position in different regions in the USA— and the value of those earnings after adjusting for the cost of living across the USA.
The office of Foreign Labor Certification (OFLC) disclose LCA data for various stakeholders and the general public.The data has been downloaded from Performance Data of U.S Department of Labor’s Website.
The median salary for a business analyst is $69000. Only a few reported salaries range from $100000 to $200000 for a business analyst position, which skewed the histogram to the right. These high salaries are offered primarily in New York, California, and New Jersey. The location might play a significant role in causing higher wages. At the same time, other factors related to a candidate’s work experience, skill sets, and education can also impact the salaries. The data set has no information regarding the candidate’s competence level and position level, whether it’s an entry-level or senior-level position. Overall, the most frequently offered salaries fall between $50000 to $70000 range.
How does a business analyst’s salary vary geographically?
The data has been divided into four regions: Midwest, Northeast, South, and West. In all four regions, the frequently offered yearly wage for the business analyst position mostly ranges from $65000 to $85000. The annual earnings of a business analyst are more variable across the western region, ranging from $33280 to $200000. The higher salary ratio (the total count of salaries greater than $80000 divided by the total count of wages in each region) is highest in the west, 34%, than the other three regions’ similar ratio, 20%. Most of the higher wages are reported in the big cities of California. With a bi-modal shape of yearly earnings, the H1B sponsored business analyst are frequently paid roughly around $65000 and $85000 in the northeast region. Moreover, the lower salary ratio in the south (the total count of salaries less than $60000 divided by the add-up count of wages in each region) is similar to the west and northeast regions. At the same time, the midwest has a much more condensed distribution and pays the highest lower salary ratio of all areas, 12%.
Let’s compare the central tendency of these regions. We observe that the median salary in the west is $74554, top in four areas. On the contrary, the median salary of a business analyst in the south is $65000, the lowest in all four regions. The northeast and midwest regions offer a median wage of $70574 and $66224, respectively.
At the bottom of the range, median earnings are $60000 in Oklahoma and South Carolina. In contrast, at the top of the range, median incomes are $80621 in Montana. An interactive Choropleth would give a detailed view of median salaries in each state.
Who is paying higher?
When determining salary for H1B employees, employers must pay the prevailing wage determined by the National Prevailing Wage Center (NPWC) or other legitimate sources of information, such as the Online Wage Library. However, an employer can offer more than the previously determined prevailing wage. Business analysts in New Jersey, California, Texas, and New York more frequently get higher than the prevailing salary. Two yearly earnings have been found to be lower than the prevailing wage from Florida and Texas.
How much did a business analyst get paid in 2018 compared to the U.S. median household income?
The U.S. median household income was $63179, according to U.S. Census Bureau Statistics. Most states’ median earnings of a business analyst were higher than the U.S. median household income. Washington, DC, Hawaii, New Jersey, New York, and Washington are the top five states that pay more than the U.S. median household income. At the same time, Mississippi, Oklahoma, Puerto Rico, and West Virginia are bottom three in terms of paying lower than the U.S. median salary.
Is there any trend line on business analyst’s salary between 2016 to 2018?
From 2016 to 2018, business analysts’ annual salaries remained almost the same in each region. Over the years, no trend line has been observed for yearly earnings. Random effects might cause the annual salaries’ high variability.
What does it mean to live in two different states with the same yearly earnings?
Although a business analyst’s annual salary does not vary significantly across states, the cost of living can differ enormously, allowing a divergence between incomes in different places. A lower cost of living, which includes lower prices for housing, food, education, transportation, and other goods and services, allows the same dollar of wages to stretch further. We can see how much higher or lower the cost of the same bundle of goods and services is in each location relative to the national average cost by using the Bureau of Economic Analysis’ (BEA’s) regional price parities (RPPs). The RPP is calculated using Consumer Price Index prices and American Community Survey housing rents. RPPs are expressed as a percentage of each year’s overall national price level, so RPPs higher than 100 represent state prices higher than the national average and vice versa.
| year | state | median_annual_wage | RPP | adjusted_annual_median_wage | cost_of_living |
|---|---|---|---|---|---|
| 2018 | IA | 65000 | 91.596 | 70963.80 | -8.404 |
| 2018 | OH | 65000 | 92.886 | 69978.25 | -7.114 |
| 2018 | TX | 65000 | 98.356 | 66086.46 | -1.644 |
The table shows the same annual median earnings ($65000) but different living costs, giving a sense of how the cost of living can vary across the country. Iowa’s cost of living index (-8.4) is below the national average. In contrast, the cost of living in Ohio is about 7% lower than the national average. In comparison, the cost of living in Texas is 1.6%, slightly below the national average. When adjusted for cost of living, $65000 in wages is worth $70963 in Iowa, while the same earnings are worth only $66086 in Texas.
The upper figure shows the median annual wage versus the cost-of-living index by state. The state’s median annual wage is not strongly correlated with the cost-of-living index. Thirty-five states have lower living costs than the national average. Montana offers the highest median salary $80621, with a 7.8% lower cost of living than the national average. There are 16 states with higher living costs than the national average, offering median salaries ranging from $60000 to $78000. California, Washington-DC, New Jersey, and Hawaii have the highest living cost, 10% higher than the national average. Among these top four, Hawaii offers the lowest salary,$63835.
Overall, an H1B-sponsored business analyst’s salary ranges from $65000 to $85000. From 2016 to 2018, the annual median wage remained almost the same. Geographically, the Western region offers a higher median salary ratio than the others. In most states, a nonimmigrant business analyst’s median earnings are higher than the U.S. median household income in 2018. The Median yearly earnings have a weak positive correlation with the cost of living index. States such as California, Washington-DC, New Jersey, and Hawaii have the highest living cost and offer a salary ranging from $60000 to $78000. In contrast, Montana and Vermont have the highest median wage, with living costs below the national average, 7.8% and 0.3%, respectively.
The office of Foreign Labor Certification (OFLC) disclose LCA data for various stakeholders and the general public.The data has been downloaded from Performance Data of U.S Department of Labor’s Website. The data sets are available in the Microsoft Excel (.xlsx) file format and organized by the federal fiscal year (October 1 through September 30). This exploratory data analyses have been done on 2016 to 2018 data.The three data frames have been joined together and overall the combined data frame has rows and 52 columns.
The data frames outline details of the H1B employment, such as the occupational classification being sought, the number of nonimmigrants covered, the wage rate to be paid to the beneficiary, the prevailing wage for the occupation in the area of intended employment, the work location/s, and the dates of employment.
The LCA data from 2016 to 2018 have been used for the analysis. Three data frames have been combined in R. Overall the combined data frame has 1926862 rows and 52 columns.
State abbreviation and regions data: USA states abbreviations and regional information have been downloaded from here
Cost of Living Data: Bureau of Economic Analysis’ Regional Price Parities (RPP)
To begin, load the necessary libraries and read in the data.
# load libraries
load_packages <- pacman::p_load(
readxl,
skimr,
tidyverse, # deduplication, grouping, and slicing functions
janitor, # function for reviewing duplicates
stringr, # for string searches, can be used in "rolling-up" values
ggplot2,
lubridate,
naniar,
forcats,
rmarkdown,
here,
tinytex,
knitr,
kableExtra)
# Import data
H1B_2016 <-read_excel(here("Data","Raw-data","H-1B_Disclosure_Data_FY16.xlsx"))
H1B_2017 <- read_excel(here("Data","Raw-data","H-1B_Disclosure_Data_FY17.xlsx"))
H1B_2018 <- read_excel(here("Data","Raw-data","H-1B_Disclosure_Data_FY2018_EOY.xlsx"))Let’s do some primary cleaning with janitor packages.
H1B_2016<- H1B_2016 %>%
clean_names()
H1B_2017<- H1B_2017 %>%
clean_names()
H1B_2018<- H1B_2018 %>%
clean_names()Check the common variables of three data sets.
setequal(H1B_2017, H1B_2018) # FALSE is returned, so the observgations are not identical
# intersect(H1B_2017, H1B_2018) # `new_concurrent_employment' variable name is different in 2018 data set.
# rename the`new_concurrent_emp` variable name of 2018 data set to `new_concurrent_employment'
H1B_2018 <- H1B_2018 %>%
rename(`new_concurrent_employment` = `new_concurrent_emp`)# new name = old name2017 and 2018 data frames have identical column names. 3 columns’ names are different in 2016 data frame. column names of 2016 data frame has been changed to match the column names of other two data frames: “h-1b_dependent” to “h1b_dependent”, “naic_code” to “naics_code” , “pw_wage_source” to “pw_source”
library(janitor)
H1B_2016<- H1B_2016 %>%
clean_names()%>%
rename(naics_code = naic_code,
pw_source = pw_wage_source,
h1b_dependent= h_1b_dependent)
# combine 2017 and 2018 data frames together
H1B_17_nd_18 <- rbind(H1B_2017,H1B_2018)
# Joining three data frames
H1B_16_to_18 <- bind_rows(H1B_2016,H1B_17_nd_18) # 2016 data set have fewer columns than 2017 and 2018. So, we use bind_rows()
# removing some data frames for utilizing memory space
rm(H1B_2016, H1B_2017, H1B_2018, H1B_17_nd_18)How many missings?
# how many missings?
n_var_miss(H1B_16_to_18)## [1] 45
Out of 52 variables, 45 variables are missing
# percent of ALL data frame values that are missing
pct_miss(H1B_16_to_18)## [1] 18.35342
# Percent of rows with any value missing
pct_miss_case(H1B_16_to_18) ## [1] 100
# Percent of rows that are complete (no values missing)
pct_complete_case(H1B_16_to_18)## [1] 0
# count the missing values for every column
Missing_data <- sapply(H1B_16_to_18, function(x) sum(is.na(x)))
kable(Missing_data)| x | |
|---|---|
| case_number | 0 |
| case_status | 0 |
| case_submitted | 1 |
| decision_date | 0 |
| visa_class | 0 |
| employment_start_date | 53 |
| employment_end_date | 66 |
| employer_name | 16 |
| employer_address | 16 |
| employer_city | 15 |
| employer_state | 125 |
| employer_postal_code | 16 |
| employer_country | 96517 |
| employer_province | 1413325 |
| employer_phone | 96518 |
| employer_phone_ext | 1802385 |
| agent_attorney_name | 0 |
| agent_attorney_city | 710076 |
| agent_attorney_state | 761609 |
| job_title | 13 |
| soc_code | 19 |
| soc_name | 20 |
| naics_code | 18 |
| total_workers | 0 |
| full_time_position | 647863 |
| prevailing_wage | 5 |
| pw_unit_of_pay | 152 |
| pw_source | 154 |
| pw_source_year | 172 |
| pw_source_other | 10939 |
| wage_rate_of_pay_from | 0 |
| wage_rate_of_pay_to | 5 |
| wage_unit_of_pay | 31 |
| h1b_dependent | 41170 |
| willful_violator | 41179 |
| worksite_city | 23 |
| worksite_county | 32 |
| worksite_state | 34 |
| worksite_postal_code | 25 |
| original_cert_date | 1785088 |
| employer_business_dba | 1495366 |
| agent_representing_employer | 744376 |
| new_employment | 647852 |
| continued_employment | 647852 |
| change_previous_employment | 647852 |
| new_concurrent_employment | 647852 |
| change_employer | 647852 |
| amended_petition | 647852 |
| pw_wage_level | 744355 |
| support_h1b | 737830 |
| labor_con_agree | 1445958 |
| public_disclosure_location | 1926862 |
Proportion of missing data
# what is the proportion of missing data for each variable?
pctmiss <- colSums(is.na(H1B_16_to_18))/nrow(H1B_16_to_18)
round(pctmiss, 2)## case_number case_status
## 0.00 0.00
## case_submitted decision_date
## 0.00 0.00
## visa_class employment_start_date
## 0.00 0.00
## employment_end_date employer_name
## 0.00 0.00
## employer_address employer_city
## 0.00 0.00
## employer_state employer_postal_code
## 0.00 0.00
## employer_country employer_province
## 0.05 0.73
## employer_phone employer_phone_ext
## 0.05 0.94
## agent_attorney_name agent_attorney_city
## 0.00 0.37
## agent_attorney_state job_title
## 0.40 0.00
## soc_code soc_name
## 0.00 0.00
## naics_code total_workers
## 0.00 0.00
## full_time_position prevailing_wage
## 0.34 0.00
## pw_unit_of_pay pw_source
## 0.00 0.00
## pw_source_year pw_source_other
## 0.00 0.01
## wage_rate_of_pay_from wage_rate_of_pay_to
## 0.00 0.00
## wage_unit_of_pay h1b_dependent
## 0.00 0.02
## willful_violator worksite_city
## 0.02 0.00
## worksite_county worksite_state
## 0.00 0.00
## worksite_postal_code original_cert_date
## 0.00 0.93
## employer_business_dba agent_representing_employer
## 0.78 0.39
## new_employment continued_employment
## 0.34 0.34
## change_previous_employment new_concurrent_employment
## 0.34 0.34
## change_employer amended_petition
## 0.34 0.34
## pw_wage_level support_h1b
## 0.39 0.38
## labor_con_agree public_disclosure_location
## 0.75 1.00
# kable(pctmiss)The table shows the proportion of missing data for each variable.
Now, look into entire data frame for average missing values of all columns
# We can look at an entire data frame by using summarize_all
missing <- H1B_16_to_18 %>% # missing data frame has information about the average of missing values of all columns
summarize_all(funs(mean(is.na(.))))## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
glimpse(missing[1:10])## Rows: 1
## Columns: 10
## $ case_number <dbl> 0
## $ case_status <dbl> 0
## $ case_submitted <dbl> 5.189785e-07
## $ decision_date <dbl> 0
## $ visa_class <dbl> 0
## $ employment_start_date <dbl> 2.750586e-05
## $ employment_end_date <dbl> 3.425258e-05
## $ employer_name <dbl> 8.303656e-06
## $ employer_address <dbl> 8.303656e-06
## $ employer_city <dbl> 7.784678e-06
Visualizing Missing Data
missing %>%
pivot_longer(case_number:support_h1b, names_to = "variable",
values_to = "pct") %>%
filter(pct > 0.1) %>%
ggplot(aes(x=reorder(variable, pct), y=pct)) +
coord_flip() +
geom_col()Showing missingness with gg_miss_var
# showing missingness with gg_miss_var
gg_miss_var(H1B_16_to_18,show_pct = TRUE) # By default, counts are shown instead of percents, change this with show_pct = TRUE if you want the missing value in percentage## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.
## ℹ The deprecated feature was likely used in the naniar package.
## Please report the issue at <]8;;https://github.com/njtierney/naniar/issueshttps://github.com/njtierney/naniar/issues]8;;>.
Dropping Missing Data: Droping four columns which have more than 50 percent missing values.
missing_dropped <- H1B_16_to_18 %>%
select(-c(employer_business_dba, employer_phone_ext, employer_province, original_cert_date))
rm(H1B_16_to_18)# 100% duplicates across all columns
missing_dropped %>%
janitor::get_dupes()## No variable names specified - using all columns.
## No duplicate combinations found of: case_number, case_status, case_submitted, decision_date, visa_class, employment_start_date, employment_end_date, employer_name, employer_address, ... and 39 other variables
## # A tibble: 0 × 49
## # … with 49 variables: case_number <chr>, case_status <chr>,
## # case_submitted <dttm>, decision_date <dttm>, visa_class <chr>,
## # employment_start_date <dttm>, employment_end_date <dttm>,
## # employer_name <chr>, employer_address <chr>, employer_city <chr>,
## # employer_state <chr>, employer_postal_code <chr>, employer_country <chr>,
## # employer_phone <chr>, agent_attorney_name <chr>, agent_attorney_city <chr>,
## # agent_attorney_state <chr>, job_title <chr>, soc_code <chr>, …
There are no duplicated observations in the data set.
The analysis have been done on ertified cases, in simpler term the cases that got LCA approval by USCIS.
# Extracting certified cases
certified <- missing_dropped %>%
filter(case_status == "CERTIFIED") %>%
glimpse()## Rows: 1,694,789
## Columns: 48
## $ case_number <chr> "I-200-15198-889809", "I-200-15209-285824"…
## $ case_status <chr> "CERTIFIED", "CERTIFIED", "CERTIFIED", "CE…
## $ case_submitted <dttm> 2015-11-10, 2015-10-14, 2015-09-30, 2015-…
## $ decision_date <dttm> 2015-11-18, 2015-10-20, 2015-10-06, 2015-…
## $ visa_class <chr> "H-1B", "H-1B", "H-1B", "H-1B", "H-1B", "H…
## $ employment_start_date <dttm> 2015-11-10, 2016-01-14, 2015-10-10, 2016-…
## $ employment_end_date <dttm> 2018-11-09, 2018-01-13, 2016-11-12, 2018-…
## $ employer_name <chr> "QUICKLOGIX LLC", "MCCHRYSTAL GROUP, LLC",…
## $ employer_address <chr> "265 FRANKLIN STREET", "333 NORTH FAIRFAX …
## $ employer_city <chr> "BOSTON", "ALEXANDRIA", "SAN DIEGO", "CHUL…
## $ employer_state <chr> "MA", "VA", "CA", "CA", "MA", "CA", "TX", …
## $ employer_postal_code <chr> "02110", "22314", "92117", "91910", "02110…
## $ employer_country <chr> "UNITED STATES OF AMERICA", "UNITED STATES…
## $ employer_phone <chr> "6179877989", "5713128637", "8584059128", …
## $ agent_attorney_name <chr> "FURO, RUKAYYA", "ALEXANDER, JAMES", "FELD…
## $ agent_attorney_city <chr> "OWINGS MILLS", "WASHINGTON", "SAN DIEGO",…
## $ agent_attorney_state <chr> "MD", "DC", "CA", "CA", "MD", "CA", "TX", …
## $ job_title <chr> "CEO", "PRESIDENT, NORTHEAST REGION", "CEO…
## $ soc_code <chr> "11-1011", "11-1011", "11-1011", "11-1011"…
## $ soc_name <chr> "CHIEF EXECUTIVES", "CHIEF EXECUTIVES", "C…
## $ naics_code <chr> "541519", "541611", "541511", "611110", "5…
## $ total_workers <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ full_time_position <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ prevailing_wage <dbl> 90.00, 241842.00, 99986.00, 99986.00, 90.0…
## $ pw_unit_of_pay <chr> "Hour", "Year", "Year", "Year", "Hour", "Y…
## $ pw_source <chr> "OES", "OES", "OES", "OES", "OES", "OES", …
## $ pw_source_year <chr> "2015", "2015", "2015", "2015", "2015", "2…
## $ pw_source_other <chr> "OFLC ONLINE DATA CENTER", "OFLC ONLINE DA…
## $ wage_rate_of_pay_from <dbl> 90.00, 245000.00, 100000.00, 100000.00, 90…
## $ wage_rate_of_pay_to <dbl> 0, 0, 0, 120000, 0, 0, 0, 0, 0, 0, 202533,…
## $ wage_unit_of_pay <chr> "Hour", "Year", "Year", "Year", "Hour", "Y…
## $ h1b_dependent <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N…
## $ willful_violator <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N…
## $ worksite_city <chr> "SANTA CLARA", "ALEXANDRIA", "SAN DIEGO", …
## $ worksite_county <chr> "SANTA CLARA", "ALEXANDRIA CITY", "SAN DIE…
## $ worksite_state <chr> "CA", "VA", "CA", "CA", "CA", "CA", "TX", …
## $ worksite_postal_code <chr> "95054", "22134", "92117", "91910", "95054…
## $ agent_representing_employer <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ new_employment <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ continued_employment <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ change_previous_employment <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ new_concurrent_employment <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ change_employer <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ amended_petition <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ pw_wage_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ support_h1b <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ labor_con_agree <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ public_disclosure_location <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
There is a single observation in certified data frames where prevailing wage is 0. The job title for this position is “DATABASE ARCHITECT”. It seems like as error since the information in other columns mostly is NA. This observation has been deleted for further analysis.
certified %>%
filter(prevailing_wage == 0 ) %>%
select(job_title,worksite_state, employer_city) %>%
glimpse()## Rows: 1
## Columns: 3
## $ job_title <chr> "DATABASE ARCHITECT"
## $ worksite_state <chr> NA
## $ employer_city <chr> NA
There are 12 observation reported more than 100 million annual earnings which has been excluded for the analysis. Such annual earnings are extremely unlikely for a business analyst position. 3 observation more than 500k salary. one observation has yearly salary $37.5 which also excluded (high leverage point) from the analysis.
median_salary <- salary_range %>%
filter(yearly_wage_offered <= 200000) %>% # 15 outliers change the graph's scale. so exclude them from the analysis
select(yearly_wage_offered, state, worksite_city, region, prevailing_yearly_wage, employer_name, case_submitted) %>%
group_by(state) %>%
mutate(salary_median = median(yearly_wage_offered)) %>% # mutate a column with median salary grouped by state
glimpse()
# summary(median_salary $ yearly_wage_offered)
median_salary %>%
filter(yearly_wage_offered <= 10000)%>%
glimpse()
# rm(median_salary)Also, there are some observations in the prevailing wage column which are recorded as hourly wage.
# let's find out the lower wage range
salary_range %>%
filter(prevailing_wage < 100 & pw_unit_of_pay == "Hour" ) %>%
select(state, prevailing_wage, pw_unit_of_pay)%>%
glimpse()## Rows: 2,160
## Columns: 3
## $ state <chr> "PA", "FL", "NH", "NY", "FL", "CA", "NY", "NJ", "IL", …
## $ prevailing_wage <dbl> 23.29, 20.30, 30.70, 26.91, 20.30, 29.86, 26.91, 30.75…
## $ pw_unit_of_pay <chr> "Hour", "Hour", "Hour", "Hour", "Hour", "Hour", "Hour"…
we can see that those observations which have prevailing wage records lower than $100 are recorded as hourly wage. For the sake of analysis, those observations have been converted to yearly wage by multiplying standard office hours in a year which is 2080 hours.
# mutate two columns where hourly wage related rows has been converted to yearly wage
salary_range <- salary_range %>%
mutate(prevailing_yearly_wage = ifelse(pw_unit_of_pay == "Hour", prevailing_wage*2080, prevailing_wage*1 ),
yearly_wage_offered = ifelse(pw_unit_of_pay == "Hour", wage_rate_of_pay_from*2080, wage_rate_of_pay_from*1 ))%>% # which basically translate if the condition is true apply the first argument, otherwise the second
glimpse()
# check if the calculation is valid
salary_range %>%
filter(yearly_wage_offered <= 1000) %>%
select(yearly_wage_offered,wage_rate_of_pay_from, pw_unit_of_pay) %>%
glimpse()The data frames made for analysis.
# one row has prevailing wage o and other related columns have mostly NA information for that row. so we are ignoring that observation
certified <- certified %>%
filter(wage_rate_of_pay_from > 0 ) %>%
glimpse()
# create a data set for business analyst position.
salary_range <- certified %>%
rename(state = worksite_state)%>%
filter(job_title=="BUSINESS ANALYST") %>%
glimpse()
rm(missing_dropped)
# sum(is.na(salary_range))
# mutate two columns where hourly wage related rows has been converted to yearly wage
salary_range <- salary_range %>%
mutate(prevailing_yearly_wage = ifelse(pw_unit_of_pay == "Hour", prevailing_wage*2080, prevailing_wage*1 ),
yearly_wage_offered = ifelse(pw_unit_of_pay == "Hour", wage_rate_of_pay_from*2080, wage_rate_of_pay_from*1 ))%>% # which basically translate if the condition is true apply the first argument, otherwise the second
glimpse()
# check if the calculation is valid
salary_range %>%
filter(yearly_wage_offered <= 1000) %>%
select(yearly_wage_offered,wage_rate_of_pay_from, pw_unit_of_pay) %>%
glimpse()
# finding salary range in different regions of USA
#install.packages("readr")
library(readr)
sta <-read_csv(here("Data","us_census_bureau_regions.csv"))
colnames(sta)[2] <- "state" # Rename second column
colnames(sta)[3] <- "region" # Rename third column
# merging salary_range and sta data sets to visualize the salary across different rehions in the USA
salary_range <- salary_range %>%
left_join(sta, by = c("state" = "state"), copy = TRUE)
# There are 16 NA observation in region columns. Lets filter those observations and check the corresponding state for those observation
sapply(salary_range, function(x) sum(is.na(x)))
#The NA region have 16 NA observations belong to "PR" (Puerto Rico) and VI (Virgin Island) which are part of US territories. WE will not include those observation to the analysis.
#Let's look at those NA observations for regions.
region_na <- salary_range %>%
filter(is.na(region)) %>%
select(state, region,yearly_wage_offered) %>%
glimpse()
# remove NA observations for region
salary_range <- salary_range %>%
drop_na(region) # drops rows missing values for any of these columns
rm(region_na)Now, import the regional price parity data to find the cost of living index.
# import regional price parity data
rpp <-read_excel(here("Data","RPP.xlsx"))
#rm(rpp)
## choose rpp for all items in every state
rpp <- rpp %>%
filter(Description == "RPPs: All items") %>%
glimpse()
# convert wide to long format
rpp <- gather(rpp, year, amount, c(5:7))
# joining with sta data frame to find the state abbreviation
rpp <- rpp %>%
left_join(sta, by = c("GeoName" = "State"), copy = TRUE) %>%
glimpse()
# remove row and columns
rpp <- rpp %>%
slice(-1) %>% # remove first row
select(- c(LineCode, Division)) %>% # remove two columns
glimpse()
# joining with sta data frame to find the state abbreviation
cost_of_living <- median_salary_state %>%
left_join(rpp, by = c("state" = "state"), copy = TRUE) %>%
glimpse()
# add a column for adjusted salary
cost_of_living <- cost_of_living %>%
mutate(adjusted_salary = median/amount *100,
cost_of_living = amount - 100) %>%
glimpse()
# rm(cost_of_living)Finding cost of living index in different states.
# correlation between wage and cost of living
cor(cost_of_living_2018 $median, cost_of_living_2018 $ cost_of_living)
summary(cost_of_living_2018 $median)
(cost_of_living_2018 $median)
high_living <- cost_of_living_2018 %>%
filter(cost_of_living > 0) %>%
select(state, median, cost_of_living) %>%
glimpse()
summary(high_living$ median)
max(high_living$cost_of_living)
lower_living <- cost_of_living_2018 %>%
filter(cost_of_living < 0) %>%
select(state, median, cost_of_living) %>%
glimpse()
summary(lower_living $median)
lower_living %>%
filter(median >78000 & cost_of_living < 0) %>%
select(state, median, cost_of_living) %>%
glimpse()
cost_of_living_2018 %>%
filter(median >78000 & cost_of_living < 0) %>%
select(state, median, cost_of_living, GeoName) %>%
glimpse()
# state with highest living cost
cost_of_living_2018 %>%
filter(cost_of_living >= 10 ) %>%
select(state, median, cost_of_living, GeoName) %>%
glimpse()
cost_of_living_2018 %>%
filter(median >= 70000) %>%
select(state, median, cost_of_living, GeoName) %>%
glimpse()
cost_of_living_2018 %>%
filter(median <= 65000 & cost_of_living > 0) %>%
select(state, median, cost_of_living) %>%
glimpse()The distribution of proposed wage for business analyst position.
# creating breaks for histogram
x <- salary_range$ yearly_wage_offered # prevailing wage has converted to yearly wage
breaks <- seq(floor(min(x)), ceiling(max(x)), by = 50000)
# Histogram with kernel density
#install.packages("scales")
# update.packages(ggplot2)
#install.packages("ggplot2", dependencies=TRUE)
library("scales")
library(ggplot2)
library(dplyr)
(hist_wage <- salary_range %>%
filter(yearly_wage_offered < 200000) %>%
ggplot(aes(x= yearly_wage_offered)) +
geom_histogram(binwidth = 10000, fill = "#B06E6A") +
geom_density(aes(y = ..density..),
lwd = 1, colour = "#B06E6A",
fill = "#B06E6A", alpha = 0.25)+
labs(title = "Annual Wage Density Plot",
x = "\n Yearly Wage", y = "Count \n") +
scale_x_continuous(labels = comma)+ # comma comes from scales library
scale_y_continuous(labels = comma, expand = expand_scale(mult = c(0, 0.1)))+ # this line of code removes the empty blank space below the bars)
theme(panel.grid = element_blank(),
panel.background = element_blank(),
axis.text = element_text(size = 12),
axis.title = element_text(size = 12),
plot.title = element_text(size = 14, hjust = 0.5, face = "bold")))The distribution of proposed wage in different regions.
# install.packages("ggridges")
library(ggridges)
(salary_ggridge <- ggplot(median_salary, aes(x = yearly_wage_offered, y = region, fill = region)) +
geom_density_ridges( real_min_height = 0.05) +
facet_wrap(~region) +
theme_ridges() +
scale_fill_manual(values = c("#0F1F38", "#8E7970", "#F55449", "#1B4B5A")) +
theme(legend.position = "none",
panel.grid = element_blank(),
panel.background = element_blank(),# No show legend
axis.title.x = element_text(hjust = 0.5), # x axis title in the center
axis.title.y = element_text(hjust = 0.5)) +
scale_x_continuous(limits=c(10000, 200000)) +# y axis title in the center
labs(x = "\n Annual Wage", y = "Region\n"))Interactive Choropleth map for a detailed view of median salaries in each state.
# install.packages("usmap")
# install.packages("plotly")
library(plotly)
library(usmap) #import the package
color_salary <- c("#1B4B5A", "#8ED6EC", "#46C2E8", "#3F5F69", "#1C4D5C")
fontstyle <- list( # fonts of hover over information
size = 15,
color = "black"
)
label <- list( # control the color of hover
bgcolor = "#EEEEEE", # background color of hover label
bordercolor = "transparent",
font = fontstyle
)
(median_salary_statewise <- plot_geo(median_salary_state,
locationmode = 'USA-states') %>%
add_trace(locations = ~state,
z = ~median,
zmin = 0, # setting the min value for the legend
zmax = max(median_salary $ median), # setting the max value for the legend
color = ~median, # we want the map to be colored by median yearly_wage
colorscale = 'color_salary',
text = ~hover,
hoverinfo = 'text') %>%
# colorscale = 'color_salary') %>%
layout(geo = list(scope = 'usa' ),
title = "Median Yearly Wage for Business Analyst Position across USA \n 2016 -2018") %>% # only show the USA map
style(hoverlabel = label) %>%
config(displayModeBar = FALSE) %>% # for not showing toolbar on the top
colorbar(tickprefix = "$"))
ggsave(filename = here("Figures/median_salary_statewise.png"),
height = 6, width = 8)
# str(salary_range)
#rm(salary_ggridge)Now examine which states are paying higher than the prevailing wage.
library(wordcloud2)
state_name_freq <- count(high_paying_state, state) # state wise diff_prevailing_proposed_wage frequency data for wordcloud
#(wordcloud2(state_name_freq, size=1.6, color='random-light',shape = 'triangle', backgroundColor="white"
(word_cloud <- wordcloud2(state_name_freq, size = .7, shape = 'star', color= rep_len( c("#0F1F38", "#8E7970", "#F55449", "#1B4B5A"), nrow(demoFreq))))How much did a business analyst get paid compared to the U.S. median household income in 2018?
(lollipop <- ggplot(median_salary_diff, aes(x = state, y = diff_midean_to_state)) +
geom_hline(yintercept= 0, color="#0F1F38", size=.5) + # adding line to see the difference
#geom_point(aes(fill = region)) +
geom_segment(aes(x = state, xend = state, y = 0, yend = diff_midean_to_state),
color = ifelse(median_salary_diff $ diff_midean_to_state %in% c(-20000:-40000), "#F55449", "grey"),
size = ifelse(median_salary_diff $ diff_midean_to_state %in% c(-20000:-40000), 1.3, 0.7)
) +
geom_point(
color = ifelse(median_salary_diff $ diff_midean_to_state %in% c(-20000:-40000), "#F55449", "grey"),
size = ifelse(median_salary_diff $ diff_midean_to_state %in% c(-20000:-40000), 5,2)
) +
theme(panel.grid = element_blank(),
legend.position="none",
panel.background = element_blank(),
panel.border = element_blank(),
axis.text.x = element_text(angle = 90, vjust = 0.5)
) +
xlab("") +
ylab("State's Median Wage vs USA's Median Wage") +
#ggtitle("How much did a business analyst get paid in 2018 compared \n to the U.S. median household income?") +
annotate("text", x=grep("PR", median_salary_diff$state), y=median_salary_diff$diff_midean_to_state[which(median_salary_diff$state=="PR")],
label="Puerto Rico \n paid the lowest",
color="#0F1F38",
size=2.5 ,
angle=0,
fontface="bold",
hjust=1, # text align left adjust
nudge_y = 10,
direction = "y",
)
) what was the trend of a business analyst’s annual earnings from 2016 to 2018?
library(ggplot2)
(scatter_line <- ggplot(median_salary, aes( x= case_submitted, y= yearly_wage_offered, color = region)) +
geom_point() +
#geom_smooth(method = "lm") +
stat_smooth(method = "lm", color = "black")) +
facet_wrap(~region) +
theme_bw() +
scale_color_manual(values = c("#66A5AD", "#8E7970", "#F55449", "#1B4B5A")) +
theme(axis.text.x = element_text(size = 12, angle = 45, vjust = 1, hjust = 1),
panel.grid = element_blank(),
panel.background = element_blank()
) +
# panel.border = element_blank())
xlab("") +
ylab("Annual Wage Offered")Cost of living index vs median annual wage in each state
# install.packages("ggrepel")
library(ggrepel)
library(tidyverse)
(living_cost_scatterplot <- ggplot(cost_of_living_2018 , aes(median, cost_of_living, label = state)) +
geom_hline(yintercept= 0, color="black", size=.5) + # adding y-intercept
geom_point(aes(color = living_cost_highlight)) +
stat_smooth(method = "lm", color = "#003B46")) +
theme_bw() +
scale_color_manual(values = c("#F55449", "#1B4B5A", "#66A5AD")) +
geom_text_repel(data = cost_of_living_2018 %>% filter( cost_of_living > 10| cost_of_living < - 10),
nudge_y = 1,
hjust = 0.5,
segment.color = "gray60") +
#direction = "y")
theme(panel.grid = element_blank(),
panel.background = element_blank(),
legend.position = "none",
panel.border = element_blank()) +
labs(title = "Median Annual Wage vs Cost-of-Living Index for 2018",
x = "Median Annual Wage", y = "Cost-of-Living Index (national average = 0)"
)