1 Introduction

Direct marketing campaigns have become an increasingly important strategy for businesses looking to engage customers and drive revenue growth. However, achieving success with these campaigns requires a deep understanding of customer behavior, preferences, and demographics. Data analysis plays a critical role in this process, providing businesses with the insights needed to develop targeted and tailored marketing strategies. In the banking industry, direct marketing campaigns are especially critical, as they can have a significant impact on customer acquisition and retention. By analyzing customer data, businesses can gain valuable insights into which marketing tactics are most effective, which customer segments are most likely to respond positively to marketing efforts, and which factors influence the decision to subscribe to a term deposit.

This project aim to build a predictive model that can be used to optimize marketing campaign in the banking industry. The direct bank marketing data in the UCI machine learning repository has been utilized to build classification algorithm to predict whether or not a customer is likely to subscribe to a term deposit based on their characteristics and the details of the marketing campaign.

2 Data

The direct bank marketing data in the UCI machine learning repository contains information about a direct marketing campaign of a Portuguese banking institution. The goal of the campaign was to promote a term deposit among the bank’s customers.

The data set was collected through phone calls made to customers between May 2008 and November 2010. The phone calls were made by the bank’s marketing team, and the customers were selected randomly from a database of the bank’s clients. The phone calls were made using a standard script that provided information about the term deposit and asked the customer if they would be interested in subscribing to it.

The data set includes 41,188 instances, each representing a contact made with a customer during the campaign. For each contact, there are 21 input variables (such as age, job, education, marital status, etc.) and one binary output variable, which indicates whether or not the customer subscribed to the term deposit. Each phone call in the data set is a row, while the columns correspond to the variables whose names and definitions are the following:

variable description
age numeric
job type of job (categorical): “admin.”,“blue-collar”,“entrepreneur”,“housemaid”,“management”,“retired”,“self-employed”,“services”,“student”,“technician”,“unemployed”,“unknown”)
marital marital status (categorical: “divorced”,“married”,“single”,“unknown”; note: “divorced” means divorced or widowed)
education categorical: “basic.4y”,“basic.6y”,“basic.9y”,“high.school”,“illiterate”,“professional.course”,“university.degree”,“unknown”
default has credit in default? (categorical: “no”,“yes”,“unknown”)
housing has housing loan? (categorical: “no”,“yes”,“unknown”)
loan has personal loan? (categorical: “no”,“yes”,“unknown”)
contact contact communication type (categorical: “cellular”,“telephone”)
month last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
day_of_week last contact day of the week (categorical: “mon”,“tue”,“wed”,“thu”,“fri”)
duration last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=“no”). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known.
campaign number of contacts performed during this campaign and for this client (numeric, includes last contact)
pdays number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
previous number of contacts performed before this campaign and for this client (numeric)
poutcome outcome of the previous marketing campaign (categorical: “failure”,“nonexistent”,“success”)
emp.var.rate employment variation rate - quarterly indicator (numeric)
cons.price.idx consumer price index - monthly indicator (numeric)
cons.conf.idx consumer confidence index - monthly indicator (numeric)
euribor3m euribor 3 month rate - daily indicator (numeric)
nr.employed number of employees - quarterly indicator (numeric)
y has the client subscribed a term deposit? (binary: “yes”,“no”)

3 Exploratory Data Analysis

From all the customers those had been contacted, only 11.3% of them subscribed to the term deposit.

3.1 Customers’ Demography

People who fall between 30-39 age group have mostly been contacted by the bank followed by the age group between 40-49.

The marketing campaign primarily targeted individuals in the “admin” job group, with the “blue-collar” job group being the second most contacted. On the other hand, the “student” group received the least amount of contact from the marketing team. This indicates that the marketing strategy focused on reaching out to professionals in administrative roles and those working in manual or labor-intensive jobs, while giving less priority to students.

The customers with middle school education or less were highest in numbers in the direct-marketing campaign, followed by the customers with university degree. Among the targeted customers, customers with a university degree demonstrated a higher likelihood of subscribing to the term deposit compared to other education groups.

3.2 Contact and Campaign Information

Contact Frequency
cellular 26144
telephone 15044

The marketing campaign was done through the phone calls. More than 60% of the call was made via cellular connection. Often, same client was contacted multiple times to promote the product (bank term deposit). The “distribution” variable in the dataset indicates the last contact duration. After the end of the call, customer’s final response was obviously known by the marketer. Therefore this attribute highly affects the output variable, and should be excluded to have a realistic predictive model. However, from behavioral perspectives, the duration variable provides insights for the marketing team to find a duration range that can take to pursue customers on average.

The scatterplot (left panel) shows that majority of subscribed customers made decision within 10 phone calls and had longer conversations during their last calls compared to the unsubscribed customers. On average two phone calls were made to the targeted customers during this campaign. Moreover, the boxplot on the right panel shows the distributions of the last call duration for customers’ response. The unsubscribed customers’ call duration exhibits a narrower interquartile range, indicating a more concentrated distributions of values, while the subscribed customers’ last call duration has wider variability in call duration. On average, last call duration for the customers who subscribed (approximately 9 min) was higher than the customers who did not subscribed (approximately 4 min).

Now, if we look into the call duration among various customers’ profession below, we notice from the following graph that among the targeted customers who subscribed, the “blue-collar”, entrepreneur, and “self-employed” job professions have higher variability in last call duration while “student”, “retired”, and “services” have lower variability. It might be the case that students, retired,

The figure below reveals numerous significant customer demographic details. First, blue-collar workers make up the majority of clients, followed by those in managerial and technical positions. Furthermore, a substantial portion of the clientele’s employment status remains unknown. In terms of marital status, most clients are married, with lower percentages of clients being single or divorced. When it comes to customers’ educational backgrounds, they differ, but a considerable proportion have completed middle school and high school.

Regarding their financial situation, most clients have never had a credit default. Also, about equal numbers of consumers do not have house loans, and a sizable portion do. Interestingly, most clients do not own personal loans, suggesting a market for personal loan products.

In terms of communication channels, cellular communication accounts for most client engagements, with telephone contact accounting for a lesser portion of the market. The time of these encounters happened in different months and various frequencies of consumer contacts; May and August are among the months with the highest frequency. Furthermore, daily client encounters are spread relatively equally, with a minor uptick on some days.

Additionally, most past campaign results are either unknown or unsuccessful, with a much lower percentage having successful results.

The histograms below show several key economic indicators’ distributions: - Previous Campaign Contacts (previous): Most customers in the dataset have been contacted only a few times before, indicating that the bank is primarily reaching out to a relatively new customer base. - Employment Variation Rate (emp_var_rate): Despite some noticeable negative values, the employment variation rate is centered on positive values, particularly around 1. While the negative values represent times of job losses, the positive surge at 1 represents periods of employment increase. - Consumer Price Index (cons_price_idx): The consumer price index values are centered around 92.5, 93.5, and 94.5. These clusters show times when prices were relatively stable within particular ranges. A higher frequency at 93.5 suggests a benchmark for normal price levels during the time period. - Consumer Confidence Index (cons_conf_idx): Consumer confidence is spread between -50 and -25, with notable peaks around -47 and -37. The negative values show that, from May 2008 to November 2010, when the data was gathered, consumers had an overall negative view of the economic condition. - EURIBOR 3-Month Rate (euribor3m): The EURIBOR 3-month rate shows significant spikes at around 1 and 5. The lower rates (around 1) could indicate higher monetary liquidity, while the higher spike (around 5) might reflect tighter credit conditions or economic stress.

4 Statistical Modeling

Several classification modeling techniques, such as decision trees, logistic regression, random forest, and support vector machines, were used to predict whether or not a customer is likely to subscribe to a term deposit based on their characteristics. The primary focus was on the predictive capabilities of these models, followed by a comparative analysis to identify the best-performing model.

4.1 Handling imbalanced data with random oversampling technique

Response Frequency
no 36548
yes 4640

The number of customers who declined for subscribing to the term deposit is significantly higher than the number of customers who responded positively. This imbalance can skew the performance of models, which misleads to high accuracy score for the majority class while fail to identify the minority class. A random oversampling technique was applied to handle the imbalanced data. In oversampling, instances from the minority class are duplicated to increase their representation in the dataset. This was done randomly until the number of instances in the minority class matches that of the majority class.

Response Frequency
no 36548
yes 36513

We can see that after oversampling the data, the minority class is represented more equally with the majority class and the dataset becomes more balanced,

4.2 Transforming Numerical and Categorical Variables:

One-hot encoding was employed to convert categorical variables into numerical inputs, making them suitable for various machine-learning algorithms. The standardization technique was applied for numerical variables to transform the data, ensuring a mean of zero and a standard deviation of one.

4.3 Feauture Selection

The decision tree modeling was used for both feature selection of the data sets.The followings features were selected by the decision tree model as important features.

## important_features
##                      cons_conf_idx                     cons_price_idx 
##                                  1                                  1 
##                       emp_var_rate                          euribor3m 
##                                  1                                  1 
##                        nr_employed     previously_contacted_Contacted 
##                                  1                                  1 
## previously_contacted_Not.contacted 
##                                  1

With the important features selected by the decision tree algorithm, the whole data set was then split into training and testing sets.

4.4 Evaluting Models

Model Performance Metrics
Model Accuracy AUC Precision Recall F1_Score
Decision Tree 0.7394429 0.7493050 0.8663292 0.6911155 0.7688665
Random Forest 0.7551845 0.8111904 0.8915036 0.7006452 0.7846348
SVM 0.7271234 0.7711920 0.7273225 0.7272230 0.7272727
Logistic Regression 0.7271234 0.7728307 0.7267752 0.7274719 0.7271234

Among the four models, the Random Forest demonstrates the best performance across most metrics, including accuracy, AUC, precision, recall, and F1 Score. We can interpret the result as follows:

  • Accuracy (0.75): Accuracy is the percentage of correct predictions made by the model out of all predictions. If we pick 100 customers’ responses (yes or no), the model would correctly predict 75 of those responses.
  • AUC (0.81): An AUC of 0.81 means that the model is quite good at distinguishing between the two classes it is trying to predict. This number means that if you randomly pick one example from the positive class and one from the negative class, there’s an 81% chance that the model will correctly identify which one is which.
  • Precision (0.73): Precision is the percentage of positive predictions made by the model that are actually correct. When the Random Forest model predicts that a customer will respond ‘yes’, it is correct about 88% of the time.
  • Recall (0.70): Recall, also known as sensitivity or true positive rate, is a measure of a model’s ability to correctly identify positive instances from all actual positive instances in the dataset. In our case, of all the customers who actually responded, the Random Forest model successfully identified about 70% of them. This means it missed about 30% of the customers who did respond.
  • F1 Score (0.78): The F1 Score is a measure of a model’s accuracy that considers both precision and recall. It is especially useful when we need to balance the trade-off between precision (the accuracy of the positive predictions) and recall (the ability to find all positive instances). The F1 Score is the harmonic mean of precision and recall, and it ranges from 0 to 1, where 1 indicates perfect precision and recall, and 0 indicates the worst performance. In our case, F1 Score of 0.78 indicates a good balance between the model correctly predicting customer responses and not missing too many actual responses. It combines the high precision (few false alarms) and reasonable recall (few missed responses).

## [1] "AUC Decision Tree: 0.749305006951617"
## [1] "AUC Random Forest: 0.811190426109134"
## [1] "AUC SVM: 0.771191966665816"
## [1] "AUC Logistic Regression: 0.772830723999059"

Although all the models’ performance was almost similar, overall, the Random Forest (yellow curve) demonstrates the highest area under the curve (AUC), meaning it consistently maintains a high True Positive Rate (TPR) with a low False Positive Rate (FPR) across various thresholds—indicates it is the most effective model for distinguishing between the classes.

5 Conclusion

The identified important features, such as cons_conf_idx, cons_price_idx, emp_var_rate, and euribor3m, reflect economic conditions; features like emp_var_rate and nr_employed indicate the importance of employment situation; and the previously_contacted and month variables record past interactions. These variables are crucial for predicting customer behavior and making informed decisions. Among the evaluated models, the Random Forest is the most reliable and robust, offering the best accuracy, sensitivity, and specificity performance.

6 Data Wrangling Appendix

6.1 Data Cleaning

# make the description in similar smaller letter in every column
bank_marketing<-  bank_marketing %>% 
        clean_names()

# look at the data

dim(bank_marketing)
names(bank_marketing)
str(bank_marketing)

# Checking for missing variables 
n_var_miss(bank_marketing)

# check for blank rows
blank_rows <- !complete.cases(bank_marketing)
bank_marketing[blank_rows,]

# Check for duplicate rows
dup <- duplicated(bank_marketing)

# Print the duplicated rows
bank_marketing[dup,]

# Remove the duplicate rows
bank_marketing <- bank_marketing %>%
        distinct()


# Check for distinct values in age categories
unique(bank_marketing$age)

# create an age group from 10 to 100 
age_to_categorical <- function(data, age) {
        # Creating age groups using cut() function
        age_group <- cut(age, breaks = c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100), 
                         labels = c('10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80-89', '90-100'))
        # Inserting the age group after age and deleting it
        bank_marketing$age_group <- age_group
        bank_marketing$age <- NULL
        
        return(bank_marketing)
}
# Adding age_group to the data frame
bank_marketing <- age_to_categorical(bank_marketing, bank_marketing$age)


# renaming target column
bank_marketing <- bank_marketing %>% 
        rename(customer_response = y)



# Update the education levels to group "basic.4y," "basic.6y," and "basic.9y" as "less than or equal to middle school"
bank_marketing_modified <- bank_marketing %>%
        mutate(education_group = ifelse(education %in% c("basic.4y", "basic.6y", "basic.9y"),
                                        "Less or equal to middle school",
                                        education))
# Remove the 'education' column
bank_marketing_modified <- bank_marketing_modified[, -which(names(bank_marketing_modified) == "education")]



## transform the campaign into categorical variable

# create a campaign group from 0 to 60 
campaign_to_categorical <- function(bank_marketing_modified, campaign) {
        # Creating campaign groups using cut() function
        campaign_group <- cut(campaign, breaks = c(0,10, 20, 30, 40, 50, 60), 
                              labels = c('0-10','10-19', '20-29', '30-39', '40-49', '50-59'))
        # Inserting the campaign group after age and deleting it
        bank_marketing_modified$campaign_group <- campaign_group
        bank_marketing_modified$campaign <- NULL
        
        return(bank_marketing_modified)
}
# Adding campaign_group to the data frame
bank_marketing_modified <- campaign_to_categorical(bank_marketing_modified, bank_marketing_modified$campaign)



# Transform the pdays into categorical variables since 999 means not contacted before

pdays_to_categorical <- function(bank_marketing_modified, pdays) {
        # Creating previously contacted groups using cut() function
        previously_contacted <- cut(pdays, breaks = c(0, 30, 999), 
                                    labels = c('Contacted within a month', 'Not contacted before'))
        
        # Inserting the previously contacted group and deleting the original pdays variable
        bank_marketing_modified$previously_contacted <- previously_contacted
        bank_marketing_modified$pdays <- NULL
        
        return(bank_marketing_modified)
}

# Adding pdays_group to the data frame
bank_marketing_modified <- pdays_to_categorical(bank_marketing_modified, bank_marketing_modified$pdays)

table(bank_marketing_modified$previously_contacted)

6.2 Handling Imbalanced Data Using Oversampling Method

# Handling the imbalanced data

# Generate the table of frequency of contact
response_frequency2 <- as.data.frame(table(data_balanced$customer_response))

# Display the table using kable and kableExtra
library(kableExtra)
kable(response_frequency2, col.names = c("Response", "Frequency")) %>%
  kable_styling(position = "center", full_width = FALSE) %>%
  column_spec(1, width = "8em") %>%  # Adjust width of the first column
  column_spec(2, width = "8em")      # Adjust width of the second column
#One-hot encoding for the categorical variables and standardization of numerical variables

# Load the required libraries
library(caret)

# Load necessary libraries
library(tidymodels) # loads recipes and other packages
library(dplyr)



# Update the 'cat' vector to exclude the outcome variable 
cat <- setdiff(cat, "customer_response")



# Verify column names in our dataset
print(names(data_balanced))

str(data_balanced)




library(recipes)

# Create a recipe for preprocessing
recipe_obj <- recipe(customer_response~., data = data_balanced) %>%
        step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%  # One-hot encoding
        step_center(all_numeric(), -all_outcomes()) %>%  # Center numeric variables (Stadardization of numerical variables)
        step_scale(all_numeric(), -all_outcomes())  # Scale numeric variables

# Prepare the recipe with your data
prep_obj <- prep(recipe_obj, training = data_balanced)

# Bake the recipe to create the final dataset
bank_marketing_encoded <- bake(prep_obj, new_data = NULL)




## Splitting the data 

library(caret)
set.seed(123) # For reproducibility

# Create indices for the training set, with stratification
train_indices <- createDataPartition(bank_marketing_encoded$customer_response, p = 0.8, list = FALSE)

# Split the data using the indices
train_data <- bank_marketing_encoded[train_indices, ]
test_data <- bank_marketing_encoded[-train_indices, ]
#---------------------- Decision Tree  for Feature Selection -----------------------------------

library(caret)
library(rpart)
library(randomForest)
library(e1071)  # For SVM
library(MLmetrics)
library(pROC)



# Prepare data and define training control
train_ctrl <- trainControl(
        method = "cv",
        number = 10,
        savePredictions = "final",
        classProbs = TRUE,  # if we need class probabilities
        summaryFunction = twoClassSummary, # Use summary function for binary classification
       # savePredictions = "final"  # Save predictions for the final model
)



model_tree <- train(
        customer_response ~ .,
        data = train_data,
        method = "rpart",
        trControl = train_ctrl,
        metric = "ROC"
)



#print(summary(model_tree))

# Extract feature importance
importance <- varImp(model_tree, scale = FALSE)

# Print feature importance
print(importance)
# Check the structure of the importance object
str(importance)

# Plot feature importance
plot(importance, top = 20)

# Extract features with non-zero importance
important_features <- rownames(importance$importance)[importance$importance$Overall > 0]

# Print the important features
print(important_features)


# Ensure all important features are in test data with only the important features
common_features <- intersect(important_features, names(test_data))

# Subset the train and test data
train_data_selected <- train_data[, c(important_features, "customer_response")]
test_data_selected <- test_data[, c(common_features, "customer_response")]
library(randomForest)  # For Random Forest
library(e1071)         # For SVM
library(naivebayes)    # For Naive Bayes
library(gbm)           # For GBM
library(caret)
library(rpart)
library(rpart.plot)
library(MLmetrics)
library(pROC)
library(ranger)



# Correctly identifying the factor
train_data$customer_response <- factor(train_data$customer_response, levels = c("yes", "no"))
test_data$customer_response <- factor(test_data$customer_response, levels = c("yes", "no"))



#------------------------- Decision Tree -------------------------------------

# Re-train the decision tree model with the selected features
model_tree_selected <- train(
        customer_response ~ .,
        data = train_data_selected,
        method = "rpart",
        trControl = train_ctrl,
        metric = "ROC"
)




#--------------- Train with models ------------------------------

# Train Logistic Regression
model_logistic <- train(
        customer_response ~ .,
        data = train_data_selected,
        method = "glm",
        family = binomial,
        trControl = train_ctrl,
        metric = "ROC"
)


library(doParallel)
cl <- makePSOCKcluster(detectCores() - 1)  # Leave one core free
registerDoParallel(cl)

train_ctrl <- trainControl(
        method = "cv",
        number = 10,
        savePredictions = "final",
        classProbs = TRUE,
        summaryFunction = twoClassSummary,
        allowParallel = TRUE  # Enable parallel processing
)



# Train Random Forest
model_rf <- train(
        customer_response ~ .,
        data = train_data_selected,
        method = "ranger",
        trControl = train_ctrl,
        metric = "ROC"
)


# Train SVM
model_svm <- train(
        customer_response ~ .,
        data = train_data_selected,
        method = "svmLinear",
        trControl = train_ctrl,
        metric = "ROC"
)



# Stop the cluster after training
stopCluster(cl)
# Model Predictions and evaluate models

# Function to evaluate model
evaluate_model <- function(model, test_data) {
        predictions <- predict(model, newdata = test_data_selected, type = "raw")
        prob_predictions <- predict(model, newdata = test_data_selected, type = "prob")[, 2]
        
        conf_matrix <- confusionMatrix(predictions, test_data_selected$customer_response)
        roc_result <- roc(test_data_selected$customer_response, prob_predictions)
        auc_value <- auc(roc_result)
        
        precision <- Precision(predictions, test_data_selected$customer_response)
        recall <- Recall(predictions, test_data_selected$customer_response)
        f1_score <- F1_Score(predictions, test_data_selected$customer_response)
        
        list(
                ConfusionMatrix = conf_matrix,
                AUC = auc_value,
                Precision = precision,
                Recall = recall,
                F1_Score = f1_score
        )
}

# Evaluate each model using test_data_selected
results_tree <- evaluate_model(model_tree_selected, test_data_selected)
results_rf <- evaluate_model(model_rf, test_data_selected)
results_svm <- evaluate_model(model_svm, test_data_selected)
#results_knn <- evaluate_model(model_knn, test_data_selected)
results_logistic <- evaluate_model(model_logistic, test_data_selected)



# Extract and compile results
model_names <- c("Decision Tree", "Random Forest", "SVM", "Logistic Regression")
metrics <- data.frame(
        Model = model_names,
        Accuracy = c(
                results_tree$ConfusionMatrix$overall["Accuracy"],
                results_rf$ConfusionMatrix$overall["Accuracy"],
                results_svm$ConfusionMatrix$overall["Accuracy"],
                #results_knn$ConfusionMatrix$overall["Accuracy"],
                results_logistic$ConfusionMatrix$overall["Accuracy"]
        ),
        AUC = c(
                results_tree$AUC,
                results_rf$AUC,
                results_svm$AUC,
                #results_knn$AUC,
                results_logistic$AUC
        ),
        Precision = c(
                results_tree$Precision,
                results_rf$Precision,
                results_svm$Precision,
                #results_knn$Precision,
                results_logistic$Precision
        ),
        Recall = c(
                results_tree$Recall,
                results_rf$Recall,
                results_svm$Recall,
                #results_knn$Recall,
                results_logistic$Recall
        ),
        F1_Score = c(
                results_tree$F1_Score,
                results_rf$F1_Score,
                results_svm$F1_Score,
                #results_knn$F1_Score,
                results_logistic$F1_Score
        )
)

# Print the metrics table in an organized way using kable
kable(metrics, caption = "Model Performance Metrics")
library(randomForest)  # For Random Forest
library(e1071)         # For SVM
library(naivebayes)    # For Naive Bayes
library(gbm)           # For GBM
library(caret)
library(rpart)
library(rpart.plot)
library(MLmetrics)
library(pROC)
library(ranger)



# Correctly identifying the factor
train_data$customer_response <- factor(train_data$customer_response, levels = c("yes", "no"))
test_data$customer_response <- factor(test_data$customer_response, levels = c("yes", "no"))



#------------------------- Decision Tree -------------------------------------

# Re-train the decision tree model with the selected features
model_tree_selected <- train(
        customer_response ~ .,
        data = train_data_selected,
        method = "rpart",
        trControl = train_ctrl,
        metric = "ROC"
)




#--------------- Train with models ------------------------------

# Train Logistic Regression
model_logistic <- train(
        customer_response ~ .,
        data = train_data_selected,
        method = "glm",
        family = binomial,
        trControl = train_ctrl,
        metric = "ROC"
)


library(doParallel)
cl <- makePSOCKcluster(detectCores() - 1)  # Leave one core free
registerDoParallel(cl)

train_ctrl <- trainControl(
        method = "cv",
        number = 10,
        savePredictions = "final",
        classProbs = TRUE,
        summaryFunction = twoClassSummary,
        allowParallel = TRUE  # Enable parallel processing
)



# Train Random Forest
model_rf <- train(
        customer_response ~ .,
        data = train_data_selected,
        method = "ranger",
        trControl = train_ctrl,
        metric = "ROC"
)


# Train SVM
model_svm <- train(
        customer_response ~ .,
        data = train_data_selected,
        method = "svmLinear",
        trControl = train_ctrl,
        metric = "ROC"
)



# Stop the cluster after training
stopCluster(cl)

7 Data Visualization Appendix

# Distinguishing numerical and categorical columns
columns <- sapply(bank_marketing_modified, class)
cat <- names(columns[columns == "character"])
num <- names(columns[columns != "character"])

cat("Categorical:", cat, "\n")
cat("Numerical:", num, "\n")

# Explore frequency of categorical columns
c <- length(cat)
rows <- (c %/% 3) + (c %% 3 > 0)
print(rows)

par(mfrow = c(rows, 3))
par(mar = c(4, 4, 2, 2))

lapply(cat, function(col) {
        print(barplot(table(bank_marketing_modified[[col]]), 
                      main = paste("Countplot of", col), 
                      col = "#14213d", border = "black", 
                      las = 2, cex.names = 0.8))
})




# Reset the plotting layout to default
par(mfrow = c(1, 1))

# Explore the numerical variables
n <- length(num)
nrows <- (n %/% 3) + (n %% 3 > 0)
print(nrows)

par(mfrow = c(nrows, 3))
par(mar = c(4, 4, 2, 2))

lapply(num, function(col) {
        if (is.numeric(bank_marketing_modified[[col]])) {
                print(hist(bank_marketing_modified[[col]], breaks = 15, 
                           col = "#fca311", main = paste("Histogram of", col), 
                           xlab = "", ylab = ""))
        } else {
                print(paste(col, "is not numeric and cannot be plotted as a histogram."))
        }
})



table(bank_marketing_modified $ previous)

library(scales)
# Barplot for customer_response
(barplot_customer_response <-  bank_marketing %>%
        count(customer_response) %>%
        #mutate(customer_response = fct_reorder(customer_response, n)) %>%
        ggplot(aes(x = customer_response, y = n/sum(n), fill = customer_response)) +
        geom_bar(stat="identity", alpha=.6, position = position_dodge(width = 0.1), width=.6) +
        scale_fill_manual(values = c( "#084c61", "#fca311")) +  
        geom_text(aes(label = paste0(sprintf("%.1f", (n/sum(n)) * 100), "%")), 
                  vjust = -0.2, size = 3.5, color = "black") +
        #coord_flip() +
        labs(title = "Frequecy Percentage of Customer Responses",
             x = NULL, y = "", fill = "Customer Response") +
        scale_y_continuous(labels = percent_format()) +
        theme(panel.grid = element_blank(),
              panel.background = element_blank(),
              axis.line = element_blank(),
              axis.ticks.x = element_blank(),
              axis.text.x=element_blank(),
              axis.ticks.y = element_blank(),
              axis.text.y=element_blank())
)
          


#Load libraries
library(forcats)
library(scales)

# Creating barplot for age_group
(barplot_age_group <- bank_marketing %>%
                count(age_group) %>%
                mutate(age_group = fct_reorder(age_group, n)) %>%
                ggplot(aes(x = age_group, y = n/sum(n))) +
                geom_bar(stat="identity", fill="#14213d", alpha=.6, width=.4) +
                geom_text(aes(label = paste0(sprintf("%.1f", (n/sum(n)) * 100), "%")), 
                          hjust = -0.1, size = 3.5, color = "black") +
                coord_flip() +
                labs(title = "Frequecy Percentage of each Age Group",
                     x = NULL, y = "") +
                scale_y_continuous(labels = percent_format()) +
                theme(panel.grid = element_blank(),
                      panel.background = element_blank(),
                      axis.line = element_blank(),
                      axis.ticks.x = element_blank(),
                      axis.text.x=element_blank())
        )


# Which job category has been reached out mostly by the bank marketing?
#install.packages("jmv")
#library("jmv")
#bank_marketing %>%
 #       dplyr::select(marital) %>%
  #      descriptives(freq = TRUE)




library(forcats)

# Create a stacked bar plot to see the number of campaigns at each month
(stacked_barplot <- bank_marketing_modified %>%
                count(campaign_group, month) %>%
                mutate(campaign_group = fct_reorder(campaign_group, n, .desc = TRUE)) %>%
                ggplot(aes(x = month, y = n, fill = campaign_group)) +
                scale_fill_manual(values = c("#084c61", "#fca311", "#14213d", "#db3a34", "#177e89", "#ffc857")) +
                geom_bar(stat = "identity") +
                labs(title = "Stacked Bar Plot of Campaign by Month",
                     x = "Month", y = "", fill = "Campaign") +
                theme(panel.grid = element_blank(),
                      panel.background = element_blank(),
                      legend.position = "right",
                      plot.title = element_text(size = 14, face = "bold")))




library(tidyverse)
library(scales)
library(patchwork)

# Plot 1: Frequency percentage of total for each job group (sorted)
(barplot_job <- bank_marketing %>%
        count(job) %>%
        mutate(job = fct_reorder(job, n)) %>%
        ggplot(aes(x = n/sum(n), y = job)) +
        geom_col(fill = "#084c61", alpha = 0.8) +
        geom_text(aes(label = paste0(sprintf("%.1f", (n/sum(n)) * 100), "%")),
                  hjust = -0.2, color = "black", size = 3.5) +
        labs(title = "Frequency Percentage of each Job Group",
             x = "", y = NULL) +
        scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
        theme_minimal() +
        theme(panel.grid = element_blank(),
              axis.ticks.x = element_blank(),
              axis.text.x=element_blank()))

#----------------------------------------------------------------
# Plot 1: Frequency percentage of total for each education level
(barplot_education1 <- bank_marketing_modified %>%
                filter(education_group != "illiterate") %>%
                count(education_group) %>%
                mutate( education_group = fct_reorder(education_group, n)) %>%
                ggplot(aes(x = n/sum(n), y = education_group)) +
                geom_col(fill = "#14213d", alpha = 0.8) +
                geom_text(aes(label = paste0(sprintf("%.1f", (n/sum(n)) * 100), "%")),
                          hjust = -0.0005, color = "black", size = 3.5) +
                labs(title = "Frequency Percentage of each Education Level",
                     x = "", y = NULL) +
                scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
                theme_minimal() +
                theme(panel.grid = element_blank(),
                      axis.ticks.x = element_blank(),
                      axis.text.x=element_blank()))
        

# Plot 2: Percentage of total customer response (yes/no) by education group (side by side bar graph)
(barplot_education2 <- bank_marketing_modified %>%
                filter(education_group != "illiterate") %>%
                count(education_group, customer_response) %>%
                mutate(perc = n/sum(n)) %>%
                ggplot(aes(x = reorder(education_group, -perc), y = perc, fill = customer_response)) +
                geom_col(position = position_dodge(width = 0.9), alpha = 0.8) +
                geom_text(aes(label = paste0(sprintf("%.1f", perc * 100), "%")),
                          position = position_dodge(width = 0.9),
                          vjust = -.07, size = 3.5, color = "black") +
                labs(title = "Percentage of Total Customer Response by Education Group",
                     x = "", y = "", fill = "Customer Response") +
                scale_y_continuous(labels = percent_format()) +
                scale_fill_manual(values = c("#084c61", "#fca311")) +
                theme(panel.grid = element_blank(),
                      panel.background = element_blank(),
                      axis.line = element_blank(),
                      legend.position = "right",
                      plot.title = element_text(size = 14, face = "bold"),
                      axis.ticks.y = element_blank(),
                      axis.text.y = element_blank(),
                      axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)))



# Create the scatterplot
(scatterplot <- ggplot(bank_marketing, aes(x = campaign, y = duration, color = customer_response)) +
        geom_point() +
        scale_color_manual(values = c("#fca311", "#084c61")) +
        labs(title = "Scatterplot of Duration vs Campaign",
             x = "Campaign",
             y = "Duration",
             color = "Customer Response") +
        theme(panel.grid = element_blank(),
              panel.background = element_blank(),
              legend.position = "none",
              plot.title = element_text(size = 14, face = "bold")) +
        scale_x_continuous(breaks = seq(0, 60, 10)))  # Specify the desired tick mark positions




# boxplot of last call duration
boxplot_duration <- ggplot(bank_marketing, aes(x = customer_response, y = duration, fill = customer_response)) +
        geom_boxplot() +
        scale_fill_manual(values = c("#084c61", "#fca311")) +
        labs(title = "Boxplot of Duration \n by Customer Response",
             x = "",
             y = "",
             fill = "Customer Response") +
        theme(panel.grid = element_blank(),
              panel.background = element_blank(),
              legend.position = "top",
              plot.title = element_text(size = 14, face = "bold"))

# box plot for each job group
(boxplot_job <- ggplot(bank_marketing, aes(x = job, y = duration, fill = customer_response)) +
                geom_boxplot() +
                labs(title = "Call Durations' Distribution for Cutomer Response by each Job Group",
                     x = "",
                     y = "") +
                scale_fill_manual(values = c( "#fca311", "#084c61")) +
                theme(panel.grid = element_blank(),
                      panel.background = element_blank(),
                      #axis.line = element_blank(),
                      legend.position = "bottom",
                      plot.title = element_text(size = 14, face = "bold"),
                      #axis.text.y =element_blank(),
                      axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)))