Direct marketing campaigns have become an increasingly important strategy for businesses looking to engage customers and drive revenue growth. However, achieving success with these campaigns requires a deep understanding of customer behavior, preferences, and demographics. Data analysis plays a critical role in this process, providing businesses with the insights needed to develop targeted and tailored marketing strategies. In the banking industry, direct marketing campaigns are especially critical, as they can have a significant impact on customer acquisition and retention. By analyzing customer data, businesses can gain valuable insights into which marketing tactics are most effective, which customer segments are most likely to respond positively to marketing efforts, and which factors influence the decision to subscribe to a term deposit.
This project aim to build a predictive model that can be used to optimize marketing campaign in the banking industry. The direct bank marketing data in the UCI machine learning repository has been utilized to build classification algorithm to predict whether or not a customer is likely to subscribe to a term deposit based on their characteristics and the details of the marketing campaign.
The direct bank marketing data in the UCI machine learning repository contains information about a direct marketing campaign of a Portuguese banking institution. The goal of the campaign was to promote a term deposit among the bank’s customers.
The data set was collected through phone calls made to customers between May 2008 and November 2010. The phone calls were made by the bank’s marketing team, and the customers were selected randomly from a database of the bank’s clients. The phone calls were made using a standard script that provided information about the term deposit and asked the customer if they would be interested in subscribing to it.
The data set includes 41,188 instances, each representing a contact made with a customer during the campaign. For each contact, there are 21 input variables (such as age, job, education, marital status, etc.) and one binary output variable, which indicates whether or not the customer subscribed to the term deposit. Each phone call in the data set is a row, while the columns correspond to the variables whose names and definitions are the following:
variable | description |
---|---|
age | numeric |
job | type of job (categorical): “admin.”,“blue-collar”,“entrepreneur”,“housemaid”,“management”,“retired”,“self-employed”,“services”,“student”,“technician”,“unemployed”,“unknown”) |
marital | marital status (categorical: “divorced”,“married”,“single”,“unknown”; note: “divorced” means divorced or widowed) |
education | categorical: “basic.4y”,“basic.6y”,“basic.9y”,“high.school”,“illiterate”,“professional.course”,“university.degree”,“unknown” |
default | has credit in default? (categorical: “no”,“yes”,“unknown”) |
housing | has housing loan? (categorical: “no”,“yes”,“unknown”) |
loan | has personal loan? (categorical: “no”,“yes”,“unknown”) |
contact | contact communication type (categorical: “cellular”,“telephone”) |
month | last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”) |
day_of_week | last contact day of the week (categorical: “mon”,“tue”,“wed”,“thu”,“fri”) |
duration | last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=“no”). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. |
campaign | number of contacts performed during this campaign and for this client (numeric, includes last contact) |
pdays | number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) |
previous | number of contacts performed before this campaign and for this client (numeric) |
poutcome | outcome of the previous marketing campaign (categorical: “failure”,“nonexistent”,“success”) |
emp.var.rate | employment variation rate - quarterly indicator (numeric) |
cons.price.idx | consumer price index - monthly indicator (numeric) |
cons.conf.idx | consumer confidence index - monthly indicator (numeric) |
euribor3m | euribor 3 month rate - daily indicator (numeric) |
nr.employed | number of employees - quarterly indicator (numeric) |
y | has the client subscribed a term deposit? (binary: “yes”,“no”) |
From all the customers those had been contacted, only 11.3% of them subscribed to the term deposit.
People who fall between 30-39 age group have mostly been contacted by the bank followed by the age group between 40-49.
The marketing campaign primarily targeted individuals in the “admin” job group, with the “blue-collar” job group being the second most contacted. On the other hand, the “student” group received the least amount of contact from the marketing team. This indicates that the marketing strategy focused on reaching out to professionals in administrative roles and those working in manual or labor-intensive jobs, while giving less priority to students.
The customers with middle school education or less were highest in numbers in the direct-marketing campaign, followed by the customers with university degree. Among the targeted customers, customers with a university degree demonstrated a higher likelihood of subscribing to the term deposit compared to other education groups.
Contact | Frequency |
---|---|
cellular | 26144 |
telephone | 15044 |
The marketing campaign was done through the phone calls. More than 60% of the call was made via cellular connection. Often, same client was contacted multiple times to promote the product (bank term deposit). The “distribution” variable in the dataset indicates the last contact duration. After the end of the call, customer’s final response was obviously known by the marketer. Therefore this attribute highly affects the output variable, and should be excluded to have a realistic predictive model. However, from behavioral perspectives, the duration variable provides insights for the marketing team to find a duration range that can take to pursue customers on average.
The scatterplot (left panel) shows that majority of subscribed customers made decision within 10 phone calls and had longer conversations during their last calls compared to the unsubscribed customers. On average two phone calls were made to the targeted customers during this campaign. Moreover, the boxplot on the right panel shows the distributions of the last call duration for customers’ response. The unsubscribed customers’ call duration exhibits a narrower interquartile range, indicating a more concentrated distributions of values, while the subscribed customers’ last call duration has wider variability in call duration. On average, last call duration for the customers who subscribed (approximately 9 min) was higher than the customers who did not subscribed (approximately 4 min).
Now, if we look into the call duration among various customers’ profession below, we notice from the following graph that among the targeted customers who subscribed, the “blue-collar”, entrepreneur, and “self-employed” job professions have higher variability in last call duration while “student”, “retired”, and “services” have lower variability. It might be the case that students, retired,
The figure below reveals numerous significant customer demographic details. First, blue-collar workers make up the majority of clients, followed by those in managerial and technical positions. Furthermore, a substantial portion of the clientele’s employment status remains unknown. In terms of marital status, most clients are married, with lower percentages of clients being single or divorced. When it comes to customers’ educational backgrounds, they differ, but a considerable proportion have completed middle school and high school.
Regarding their financial situation, most clients have never had a credit default. Also, about equal numbers of consumers do not have house loans, and a sizable portion do. Interestingly, most clients do not own personal loans, suggesting a market for personal loan products.
In terms of communication channels, cellular communication accounts for most client engagements, with telephone contact accounting for a lesser portion of the market. The time of these encounters happened in different months and various frequencies of consumer contacts; May and August are among the months with the highest frequency. Furthermore, daily client encounters are spread relatively equally, with a minor uptick on some days.
Additionally, most past campaign results are either unknown or unsuccessful, with a much lower percentage having successful results.
The histograms below show several key economic indicators’ distributions: - Previous Campaign Contacts (previous): Most customers in the dataset have been contacted only a few times before, indicating that the bank is primarily reaching out to a relatively new customer base. - Employment Variation Rate (emp_var_rate): Despite some noticeable negative values, the employment variation rate is centered on positive values, particularly around 1. While the negative values represent times of job losses, the positive surge at 1 represents periods of employment increase. - Consumer Price Index (cons_price_idx): The consumer price index values are centered around 92.5, 93.5, and 94.5. These clusters show times when prices were relatively stable within particular ranges. A higher frequency at 93.5 suggests a benchmark for normal price levels during the time period. - Consumer Confidence Index (cons_conf_idx): Consumer confidence is spread between -50 and -25, with notable peaks around -47 and -37. The negative values show that, from May 2008 to November 2010, when the data was gathered, consumers had an overall negative view of the economic condition. - EURIBOR 3-Month Rate (euribor3m): The EURIBOR 3-month rate shows significant spikes at around 1 and 5. The lower rates (around 1) could indicate higher monetary liquidity, while the higher spike (around 5) might reflect tighter credit conditions or economic stress.
Several classification modeling techniques, such as decision trees, logistic regression, random forest, and support vector machines, were used to predict whether or not a customer is likely to subscribe to a term deposit based on their characteristics. The primary focus was on the predictive capabilities of these models, followed by a comparative analysis to identify the best-performing model.
Response | Frequency |
---|---|
no | 36548 |
yes | 4640 |
The number of customers who declined for subscribing to the term deposit is significantly higher than the number of customers who responded positively. This imbalance can skew the performance of models, which misleads to high accuracy score for the majority class while fail to identify the minority class. A random oversampling technique was applied to handle the imbalanced data. In oversampling, instances from the minority class are duplicated to increase their representation in the dataset. This was done randomly until the number of instances in the minority class matches that of the majority class.
Response | Frequency |
---|---|
no | 36548 |
yes | 36513 |
We can see that after oversampling the data, the minority class is represented more equally with the majority class and the dataset becomes more balanced,
One-hot encoding was employed to convert categorical variables into numerical inputs, making them suitable for various machine-learning algorithms. The standardization technique was applied for numerical variables to transform the data, ensuring a mean of zero and a standard deviation of one.
The decision tree modeling was used for both feature selection of the data sets.The followings features were selected by the decision tree model as important features.
## important_features
## cons_conf_idx cons_price_idx
## 1 1
## emp_var_rate euribor3m
## 1 1
## nr_employed previously_contacted_Contacted
## 1 1
## previously_contacted_Not.contacted
## 1
With the important features selected by the decision tree algorithm, the whole data set was then split into training and testing sets.
Model | Accuracy | AUC | Precision | Recall | F1_Score |
---|---|---|---|---|---|
Decision Tree | 0.7394429 | 0.7493050 | 0.8663292 | 0.6911155 | 0.7688665 |
Random Forest | 0.7551845 | 0.8111904 | 0.8915036 | 0.7006452 | 0.7846348 |
SVM | 0.7271234 | 0.7711920 | 0.7273225 | 0.7272230 | 0.7272727 |
Logistic Regression | 0.7271234 | 0.7728307 | 0.7267752 | 0.7274719 | 0.7271234 |
Among the four models, the Random Forest demonstrates the best performance across most metrics, including accuracy, AUC, precision, recall, and F1 Score. We can interpret the result as follows:
## [1] "AUC Decision Tree: 0.749305006951617"
## [1] "AUC Random Forest: 0.811190426109134"
## [1] "AUC SVM: 0.771191966665816"
## [1] "AUC Logistic Regression: 0.772830723999059"
Although all the models’ performance was almost similar, overall, the
Random Forest (yellow curve) demonstrates the highest area under the
curve (AUC), meaning it consistently maintains a high True Positive Rate
(TPR) with a low False Positive Rate (FPR) across various
thresholds—indicates it is the most effective model for distinguishing
between the classes.
The identified important features, such as cons_conf_idx, cons_price_idx, emp_var_rate, and euribor3m, reflect economic conditions; features like emp_var_rate and nr_employed indicate the importance of employment situation; and the previously_contacted and month variables record past interactions. These variables are crucial for predicting customer behavior and making informed decisions. Among the evaluated models, the Random Forest is the most reliable and robust, offering the best accuracy, sensitivity, and specificity performance.
# make the description in similar smaller letter in every column
<- bank_marketing %>%
bank_marketingclean_names()
# look at the data
dim(bank_marketing)
names(bank_marketing)
str(bank_marketing)
# Checking for missing variables
n_var_miss(bank_marketing)
# check for blank rows
<- !complete.cases(bank_marketing)
blank_rows
bank_marketing[blank_rows,]
# Check for duplicate rows
<- duplicated(bank_marketing)
dup
# Print the duplicated rows
bank_marketing[dup,]
# Remove the duplicate rows
<- bank_marketing %>%
bank_marketing distinct()
# Check for distinct values in age categories
unique(bank_marketing$age)
# create an age group from 10 to 100
<- function(data, age) {
age_to_categorical # Creating age groups using cut() function
<- cut(age, breaks = c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
age_group labels = c('10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80-89', '90-100'))
# Inserting the age group after age and deleting it
$age_group <- age_group
bank_marketing$age <- NULL
bank_marketing
return(bank_marketing)
}# Adding age_group to the data frame
<- age_to_categorical(bank_marketing, bank_marketing$age)
bank_marketing
# renaming target column
<- bank_marketing %>%
bank_marketing rename(customer_response = y)
# Update the education levels to group "basic.4y," "basic.6y," and "basic.9y" as "less than or equal to middle school"
<- bank_marketing %>%
bank_marketing_modified mutate(education_group = ifelse(education %in% c("basic.4y", "basic.6y", "basic.9y"),
"Less or equal to middle school",
education))# Remove the 'education' column
<- bank_marketing_modified[, -which(names(bank_marketing_modified) == "education")]
bank_marketing_modified
## transform the campaign into categorical variable
# create a campaign group from 0 to 60
<- function(bank_marketing_modified, campaign) {
campaign_to_categorical # Creating campaign groups using cut() function
<- cut(campaign, breaks = c(0,10, 20, 30, 40, 50, 60),
campaign_group labels = c('0-10','10-19', '20-29', '30-39', '40-49', '50-59'))
# Inserting the campaign group after age and deleting it
$campaign_group <- campaign_group
bank_marketing_modified$campaign <- NULL
bank_marketing_modified
return(bank_marketing_modified)
}# Adding campaign_group to the data frame
<- campaign_to_categorical(bank_marketing_modified, bank_marketing_modified$campaign)
bank_marketing_modified
# Transform the pdays into categorical variables since 999 means not contacted before
<- function(bank_marketing_modified, pdays) {
pdays_to_categorical # Creating previously contacted groups using cut() function
<- cut(pdays, breaks = c(0, 30, 999),
previously_contacted labels = c('Contacted within a month', 'Not contacted before'))
# Inserting the previously contacted group and deleting the original pdays variable
$previously_contacted <- previously_contacted
bank_marketing_modified$pdays <- NULL
bank_marketing_modified
return(bank_marketing_modified)
}
# Adding pdays_group to the data frame
<- pdays_to_categorical(bank_marketing_modified, bank_marketing_modified$pdays)
bank_marketing_modified
table(bank_marketing_modified$previously_contacted)
# Handling the imbalanced data
# Generate the table of frequency of contact
<- as.data.frame(table(data_balanced$customer_response))
response_frequency2
# Display the table using kable and kableExtra
library(kableExtra)
kable(response_frequency2, col.names = c("Response", "Frequency")) %>%
kable_styling(position = "center", full_width = FALSE) %>%
column_spec(1, width = "8em") %>% # Adjust width of the first column
column_spec(2, width = "8em") # Adjust width of the second column
#One-hot encoding for the categorical variables and standardization of numerical variables
# Load the required libraries
library(caret)
# Load necessary libraries
library(tidymodels) # loads recipes and other packages
library(dplyr)
# Update the 'cat' vector to exclude the outcome variable
<- setdiff(cat, "customer_response")
cat
# Verify column names in our dataset
print(names(data_balanced))
str(data_balanced)
library(recipes)
# Create a recipe for preprocessing
<- recipe(customer_response~., data = data_balanced) %>%
recipe_obj step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>% # One-hot encoding
step_center(all_numeric(), -all_outcomes()) %>% # Center numeric variables (Stadardization of numerical variables)
step_scale(all_numeric(), -all_outcomes()) # Scale numeric variables
# Prepare the recipe with your data
<- prep(recipe_obj, training = data_balanced)
prep_obj
# Bake the recipe to create the final dataset
<- bake(prep_obj, new_data = NULL)
bank_marketing_encoded
## Splitting the data
library(caret)
set.seed(123) # For reproducibility
# Create indices for the training set, with stratification
<- createDataPartition(bank_marketing_encoded$customer_response, p = 0.8, list = FALSE)
train_indices
# Split the data using the indices
<- bank_marketing_encoded[train_indices, ]
train_data <- bank_marketing_encoded[-train_indices, ] test_data
#---------------------- Decision Tree for Feature Selection -----------------------------------
library(caret)
library(rpart)
library(randomForest)
library(e1071) # For SVM
library(MLmetrics)
library(pROC)
# Prepare data and define training control
<- trainControl(
train_ctrl method = "cv",
number = 10,
savePredictions = "final",
classProbs = TRUE, # if we need class probabilities
summaryFunction = twoClassSummary, # Use summary function for binary classification
# savePredictions = "final" # Save predictions for the final model
)
<- train(
model_tree ~ .,
customer_response data = train_data,
method = "rpart",
trControl = train_ctrl,
metric = "ROC"
)
#print(summary(model_tree))
# Extract feature importance
<- varImp(model_tree, scale = FALSE)
importance
# Print feature importance
print(importance)
# Check the structure of the importance object
str(importance)
# Plot feature importance
plot(importance, top = 20)
# Extract features with non-zero importance
<- rownames(importance$importance)[importance$importance$Overall > 0]
important_features
# Print the important features
print(important_features)
# Ensure all important features are in test data with only the important features
<- intersect(important_features, names(test_data))
common_features
# Subset the train and test data
<- train_data[, c(important_features, "customer_response")]
train_data_selected <- test_data[, c(common_features, "customer_response")] test_data_selected
library(randomForest) # For Random Forest
library(e1071) # For SVM
library(naivebayes) # For Naive Bayes
library(gbm) # For GBM
library(caret)
library(rpart)
library(rpart.plot)
library(MLmetrics)
library(pROC)
library(ranger)
# Correctly identifying the factor
$customer_response <- factor(train_data$customer_response, levels = c("yes", "no"))
train_data$customer_response <- factor(test_data$customer_response, levels = c("yes", "no"))
test_data
#------------------------- Decision Tree -------------------------------------
# Re-train the decision tree model with the selected features
<- train(
model_tree_selected ~ .,
customer_response data = train_data_selected,
method = "rpart",
trControl = train_ctrl,
metric = "ROC"
)
#--------------- Train with models ------------------------------
# Train Logistic Regression
<- train(
model_logistic ~ .,
customer_response data = train_data_selected,
method = "glm",
family = binomial,
trControl = train_ctrl,
metric = "ROC"
)
library(doParallel)
<- makePSOCKcluster(detectCores() - 1) # Leave one core free
cl registerDoParallel(cl)
<- trainControl(
train_ctrl method = "cv",
number = 10,
savePredictions = "final",
classProbs = TRUE,
summaryFunction = twoClassSummary,
allowParallel = TRUE # Enable parallel processing
)
# Train Random Forest
<- train(
model_rf ~ .,
customer_response data = train_data_selected,
method = "ranger",
trControl = train_ctrl,
metric = "ROC"
)
# Train SVM
<- train(
model_svm ~ .,
customer_response data = train_data_selected,
method = "svmLinear",
trControl = train_ctrl,
metric = "ROC"
)
# Stop the cluster after training
stopCluster(cl)
# Model Predictions and evaluate models
# Function to evaluate model
<- function(model, test_data) {
evaluate_model <- predict(model, newdata = test_data_selected, type = "raw")
predictions <- predict(model, newdata = test_data_selected, type = "prob")[, 2]
prob_predictions
<- confusionMatrix(predictions, test_data_selected$customer_response)
conf_matrix <- roc(test_data_selected$customer_response, prob_predictions)
roc_result <- auc(roc_result)
auc_value
<- Precision(predictions, test_data_selected$customer_response)
precision <- Recall(predictions, test_data_selected$customer_response)
recall <- F1_Score(predictions, test_data_selected$customer_response)
f1_score
list(
ConfusionMatrix = conf_matrix,
AUC = auc_value,
Precision = precision,
Recall = recall,
F1_Score = f1_score
)
}
# Evaluate each model using test_data_selected
<- evaluate_model(model_tree_selected, test_data_selected)
results_tree <- evaluate_model(model_rf, test_data_selected)
results_rf <- evaluate_model(model_svm, test_data_selected)
results_svm #results_knn <- evaluate_model(model_knn, test_data_selected)
<- evaluate_model(model_logistic, test_data_selected)
results_logistic
# Extract and compile results
<- c("Decision Tree", "Random Forest", "SVM", "Logistic Regression")
model_names <- data.frame(
metrics Model = model_names,
Accuracy = c(
$ConfusionMatrix$overall["Accuracy"],
results_tree$ConfusionMatrix$overall["Accuracy"],
results_rf$ConfusionMatrix$overall["Accuracy"],
results_svm#results_knn$ConfusionMatrix$overall["Accuracy"],
$ConfusionMatrix$overall["Accuracy"]
results_logistic
),AUC = c(
$AUC,
results_tree$AUC,
results_rf$AUC,
results_svm#results_knn$AUC,
$AUC
results_logistic
),Precision = c(
$Precision,
results_tree$Precision,
results_rf$Precision,
results_svm#results_knn$Precision,
$Precision
results_logistic
),Recall = c(
$Recall,
results_tree$Recall,
results_rf$Recall,
results_svm#results_knn$Recall,
$Recall
results_logistic
),F1_Score = c(
$F1_Score,
results_tree$F1_Score,
results_rf$F1_Score,
results_svm#results_knn$F1_Score,
$F1_Score
results_logistic
)
)
# Print the metrics table in an organized way using kable
kable(metrics, caption = "Model Performance Metrics")
library(randomForest) # For Random Forest
library(e1071) # For SVM
library(naivebayes) # For Naive Bayes
library(gbm) # For GBM
library(caret)
library(rpart)
library(rpart.plot)
library(MLmetrics)
library(pROC)
library(ranger)
# Correctly identifying the factor
$customer_response <- factor(train_data$customer_response, levels = c("yes", "no"))
train_data$customer_response <- factor(test_data$customer_response, levels = c("yes", "no"))
test_data
#------------------------- Decision Tree -------------------------------------
# Re-train the decision tree model with the selected features
<- train(
model_tree_selected ~ .,
customer_response data = train_data_selected,
method = "rpart",
trControl = train_ctrl,
metric = "ROC"
)
#--------------- Train with models ------------------------------
# Train Logistic Regression
<- train(
model_logistic ~ .,
customer_response data = train_data_selected,
method = "glm",
family = binomial,
trControl = train_ctrl,
metric = "ROC"
)
library(doParallel)
<- makePSOCKcluster(detectCores() - 1) # Leave one core free
cl registerDoParallel(cl)
<- trainControl(
train_ctrl method = "cv",
number = 10,
savePredictions = "final",
classProbs = TRUE,
summaryFunction = twoClassSummary,
allowParallel = TRUE # Enable parallel processing
)
# Train Random Forest
<- train(
model_rf ~ .,
customer_response data = train_data_selected,
method = "ranger",
trControl = train_ctrl,
metric = "ROC"
)
# Train SVM
<- train(
model_svm ~ .,
customer_response data = train_data_selected,
method = "svmLinear",
trControl = train_ctrl,
metric = "ROC"
)
# Stop the cluster after training
stopCluster(cl)
# Distinguishing numerical and categorical columns
<- sapply(bank_marketing_modified, class)
columns <- names(columns[columns == "character"])
cat <- names(columns[columns != "character"])
num
cat("Categorical:", cat, "\n")
cat("Numerical:", num, "\n")
# Explore frequency of categorical columns
<- length(cat)
c <- (c %/% 3) + (c %% 3 > 0)
rows print(rows)
par(mfrow = c(rows, 3))
par(mar = c(4, 4, 2, 2))
lapply(cat, function(col) {
print(barplot(table(bank_marketing_modified[[col]]),
main = paste("Countplot of", col),
col = "#14213d", border = "black",
las = 2, cex.names = 0.8))
})
# Reset the plotting layout to default
par(mfrow = c(1, 1))
# Explore the numerical variables
<- length(num)
n <- (n %/% 3) + (n %% 3 > 0)
nrows print(nrows)
par(mfrow = c(nrows, 3))
par(mar = c(4, 4, 2, 2))
lapply(num, function(col) {
if (is.numeric(bank_marketing_modified[[col]])) {
print(hist(bank_marketing_modified[[col]], breaks = 15,
col = "#fca311", main = paste("Histogram of", col),
xlab = "", ylab = ""))
else {
} print(paste(col, "is not numeric and cannot be plotted as a histogram."))
}
})
table(bank_marketing_modified $ previous)
library(scales)
# Barplot for customer_response
<- bank_marketing %>%
(barplot_customer_response count(customer_response) %>%
#mutate(customer_response = fct_reorder(customer_response, n)) %>%
ggplot(aes(x = customer_response, y = n/sum(n), fill = customer_response)) +
geom_bar(stat="identity", alpha=.6, position = position_dodge(width = 0.1), width=.6) +
scale_fill_manual(values = c( "#084c61", "#fca311")) +
geom_text(aes(label = paste0(sprintf("%.1f", (n/sum(n)) * 100), "%")),
vjust = -0.2, size = 3.5, color = "black") +
#coord_flip() +
labs(title = "Frequecy Percentage of Customer Responses",
x = NULL, y = "", fill = "Customer Response") +
scale_y_continuous(labels = percent_format()) +
theme(panel.grid = element_blank(),
panel.background = element_blank(),
axis.line = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x=element_blank(),
axis.ticks.y = element_blank(),
axis.text.y=element_blank())
)
#Load libraries
library(forcats)
library(scales)
# Creating barplot for age_group
<- bank_marketing %>%
(barplot_age_group count(age_group) %>%
mutate(age_group = fct_reorder(age_group, n)) %>%
ggplot(aes(x = age_group, y = n/sum(n))) +
geom_bar(stat="identity", fill="#14213d", alpha=.6, width=.4) +
geom_text(aes(label = paste0(sprintf("%.1f", (n/sum(n)) * 100), "%")),
hjust = -0.1, size = 3.5, color = "black") +
coord_flip() +
labs(title = "Frequecy Percentage of each Age Group",
x = NULL, y = "") +
scale_y_continuous(labels = percent_format()) +
theme(panel.grid = element_blank(),
panel.background = element_blank(),
axis.line = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x=element_blank())
)
# Which job category has been reached out mostly by the bank marketing?
#install.packages("jmv")
#library("jmv")
#bank_marketing %>%
# dplyr::select(marital) %>%
# descriptives(freq = TRUE)
library(forcats)
# Create a stacked bar plot to see the number of campaigns at each month
<- bank_marketing_modified %>%
(stacked_barplot count(campaign_group, month) %>%
mutate(campaign_group = fct_reorder(campaign_group, n, .desc = TRUE)) %>%
ggplot(aes(x = month, y = n, fill = campaign_group)) +
scale_fill_manual(values = c("#084c61", "#fca311", "#14213d", "#db3a34", "#177e89", "#ffc857")) +
geom_bar(stat = "identity") +
labs(title = "Stacked Bar Plot of Campaign by Month",
x = "Month", y = "", fill = "Campaign") +
theme(panel.grid = element_blank(),
panel.background = element_blank(),
legend.position = "right",
plot.title = element_text(size = 14, face = "bold")))
library(tidyverse)
library(scales)
library(patchwork)
# Plot 1: Frequency percentage of total for each job group (sorted)
<- bank_marketing %>%
(barplot_job count(job) %>%
mutate(job = fct_reorder(job, n)) %>%
ggplot(aes(x = n/sum(n), y = job)) +
geom_col(fill = "#084c61", alpha = 0.8) +
geom_text(aes(label = paste0(sprintf("%.1f", (n/sum(n)) * 100), "%")),
hjust = -0.2, color = "black", size = 3.5) +
labs(title = "Frequency Percentage of each Job Group",
x = "", y = NULL) +
scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
theme_minimal() +
theme(panel.grid = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x=element_blank()))
#----------------------------------------------------------------
# Plot 1: Frequency percentage of total for each education level
<- bank_marketing_modified %>%
(barplot_education1 filter(education_group != "illiterate") %>%
count(education_group) %>%
mutate( education_group = fct_reorder(education_group, n)) %>%
ggplot(aes(x = n/sum(n), y = education_group)) +
geom_col(fill = "#14213d", alpha = 0.8) +
geom_text(aes(label = paste0(sprintf("%.1f", (n/sum(n)) * 100), "%")),
hjust = -0.0005, color = "black", size = 3.5) +
labs(title = "Frequency Percentage of each Education Level",
x = "", y = NULL) +
scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
theme_minimal() +
theme(panel.grid = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x=element_blank()))
# Plot 2: Percentage of total customer response (yes/no) by education group (side by side bar graph)
<- bank_marketing_modified %>%
(barplot_education2 filter(education_group != "illiterate") %>%
count(education_group, customer_response) %>%
mutate(perc = n/sum(n)) %>%
ggplot(aes(x = reorder(education_group, -perc), y = perc, fill = customer_response)) +
geom_col(position = position_dodge(width = 0.9), alpha = 0.8) +
geom_text(aes(label = paste0(sprintf("%.1f", perc * 100), "%")),
position = position_dodge(width = 0.9),
vjust = -.07, size = 3.5, color = "black") +
labs(title = "Percentage of Total Customer Response by Education Group",
x = "", y = "", fill = "Customer Response") +
scale_y_continuous(labels = percent_format()) +
scale_fill_manual(values = c("#084c61", "#fca311")) +
theme(panel.grid = element_blank(),
panel.background = element_blank(),
axis.line = element_blank(),
legend.position = "right",
plot.title = element_text(size = 14, face = "bold"),
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)))
# Create the scatterplot
<- ggplot(bank_marketing, aes(x = campaign, y = duration, color = customer_response)) +
(scatterplot geom_point() +
scale_color_manual(values = c("#fca311", "#084c61")) +
labs(title = "Scatterplot of Duration vs Campaign",
x = "Campaign",
y = "Duration",
color = "Customer Response") +
theme(panel.grid = element_blank(),
panel.background = element_blank(),
legend.position = "none",
plot.title = element_text(size = 14, face = "bold")) +
scale_x_continuous(breaks = seq(0, 60, 10))) # Specify the desired tick mark positions
# boxplot of last call duration
<- ggplot(bank_marketing, aes(x = customer_response, y = duration, fill = customer_response)) +
boxplot_duration geom_boxplot() +
scale_fill_manual(values = c("#084c61", "#fca311")) +
labs(title = "Boxplot of Duration \n by Customer Response",
x = "",
y = "",
fill = "Customer Response") +
theme(panel.grid = element_blank(),
panel.background = element_blank(),
legend.position = "top",
plot.title = element_text(size = 14, face = "bold"))
# box plot for each job group
<- ggplot(bank_marketing, aes(x = job, y = duration, fill = customer_response)) +
(boxplot_job geom_boxplot() +
labs(title = "Call Durations' Distribution for Cutomer Response by each Job Group",
x = "",
y = "") +
scale_fill_manual(values = c( "#fca311", "#084c61")) +
theme(panel.grid = element_blank(),
panel.background = element_blank(),
#axis.line = element_blank(),
legend.position = "bottom",
plot.title = element_text(size = 14, face = "bold"),
#axis.text.y =element_blank(),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)))