Workflow Sets in Tidymodels - Second Look into Pima Indians Diabetes Dataset

Workflow Sets in Tidymodels - Second Look into Pima Indians Diabetes Dataset
2nd-analysis.knit

This post presents an example of using workflow sets in the tidymodels framework with cross-validation and varying recipes and models.

The dataset in use is Pima Indians Diabetes. The previous post established the baseline accuracy of 73%. The goal for this time is to improve upon this number.

Preliminary

This section contains a copy of the previously done analysis. In short, it loads the data, renames the outcome variable, and replaces zeros with missing values.

library(tidyverse)
library(tidymodels)
set.seed(0)

clean <- function(data) {
  data |> 
    mutate(
      bmi = na_if(bmi, 0),
      diastolic_blood_pressure = na_if(diastolic_blood_pressure, 0),
      plasma_concentration = na_if(plasma_concentration, 0),
      serum_insulin = na_if(serum_insulin, 0),
      triceps_skinfold_thickness = na_if(triceps_skinfold_thickness, 0),
    )
}

visualise <- function(data) {
  data |> 
    pivot_longer(
      cols = 1:8,
      names_to = "measurement",
      values_to = "value",
    ) |> 
    filter(!if_any(everything(), is.na)) |> 
    ggplot(aes(x = value, fill = is_diabetic)) + 
    geom_histogram(bins = 50, position = "dodge") +
    facet_wrap(~measurement, nrow = 2, ncol = 4, scales = "free") +
    labs(
      title = "Distribution of Eight Health Measurements by Diabetes Status",
      x = "Value",
      y = "Count",
      fill = "Diabetic?",
    )
}

data_raw <- read_csv('./data/pima-indians-diabetes-data.csv')
data <- data_raw |> 
  mutate(
    class = factor(class, levels = c("1", "0"), labels = c("Yes", "No"))
  ) |> 
  rename(
    is_diabetic = class
  ) |> 
  clean()

visualise(data)

Cross-validation

Cross-validation is a technique which enables experimenting with many different models and pre-processing steps while ensuring that the results generalise.

The basic modelling setup of fitting a model to the train partition and evaluating it on the test partition fails with repeated experimentation. The risk of overfitting to the test partition increases as more models are tried, making performance estimates unreliable.

Cross-validation solves this by minimising the use of the test partition. In the simplest case (many alternative approaches exist), it divides the train partition into a number n of folds (n = 10 in the code below). Each model is trained on n-1 folds and evaluated on the remaining nth fold to get a total of n performance estimates. Once the best model is identified, it is trained on the whole of train partition and evaluated one final time on the test partition.

data_split <- initial_split(
  data,
  prop = 0.75,
  strata = is_diabetic
)

data_train <- training(data_split)
data_folds <- vfold_cv(
  data_train,
  v = 10,
  strata = is_diabetic
)
data_test <- testing(data_split)

metrics <- metric_set(accuracy, precision, recall)

Models

In tidymodels, the package that provides a standardised interface to specifying machine learning models is parsnip.

Many models are supported. They can be accessed by running the parsnip_addin function, which brings a graphical interface to browse models and generate specifications for them:

Here is a list of five selected models to try:

boost_tree_model <- boost_tree() |> 
  set_engine("xgboost") |> 
  set_mode("classification")

logistic_reg_model <- logistic_reg() |> 
  set_engine("glm") |> 
  set_mode("classification")

decision_tree_model <- decision_tree() |> 
  set_engine("rpart") |> 
  set_mode("classification")

rand_forest_model <- rand_forest() |> 
  set_engine("ranger") |> 
  set_mode("classification")

svm_linear_model <- svm_linear() |> 
  set_engine("kernlab") |> 
  set_mode("classification")

na_robust_specifications <- list(
  boost_tree_model,
  logistic_reg_model,
  decision_tree_model,
  rand_forest_model
)

na_sensitive_specifications <- list(
  svm_linear_model
)

Recipes

In tidymodels, recipes encapsulate data pre-processing steps. They also ensure that train and test partitions are treated correctly: pre-processing parameters (if any) are estimated from the train partition and applied unchanged to the test partition.

In the case of the Pima Indians Diabetes dataset, recipes help dealing with missing values, also known as NAs, which stands for “Not Available”.

Here are two recipes to try:

  • Null recipe: no pre-processing but can only be used with NA-robust models.
  • Imputer recipe: replace NAs with the median of the respective measurement and add an indicator column to mark that observation.
null_recipe <- recipe(
  is_diabetic ~ .,
  data = data_train
)

imputer_recipe <- recipe(
  is_diabetic ~ .,
  data = data_train
) |>
  step_indicate_na(
    bmi,
    diastolic_blood_pressure,
    plasma_concentration,
    serum_insulin,
    triceps_skinfold_thickness,
  ) |>
  step_impute_median(
    bmi,
    diastolic_blood_pressure,
    plasma_concentration,
    serum_insulin,
    triceps_skinfold_thickness,
  )

Bringing It All Together

Now that the recipes and model specifications are defined, they can be combined into workflows. In tidymodels, workflows encapsulate the whole modelling process - from raw data to a fitted model.

workflows <- rbind(
  workflow_set(
    preproc = list(impute = imputer_recipe),
    models = na_sensitive_specifications
  ),
  workflow_set(
    preproc = list(null = null_recipe, impute = imputer_recipe),
    models = na_robust_specifications
  )
)

fits <- workflows |>
  workflow_map(
    "fit_resamples",
    resamples = data_folds,
    metrics = metrics
  )

fits |>
  collect_metrics() |>
  filter(.metric == 'accuracy') |>
  arrange(desc(mean)) |>
  select(wflow_id, mean, std_err) |> 
  knitr::kable(digits = 4) |> 
  kable_styling(full_width = FALSE)
wflow_id mean std_err
impute_svm_linear 0.7813 0.0096
null_logistic_reg 0.7783 0.0242
null_rand_forest 0.7777 0.0116
impute_logistic_reg 0.7727 0.0092
impute_rand_forest 0.7674 0.0128
impute_boost_tree 0.7623 0.0191
null_boost_tree 0.7501 0.0177
impute_decision_tree 0.7500 0.0062
null_decision_tree 0.7486 0.0113
fits |>
  collect_metrics() |>
  ggplot(aes(x = mean, y = wflow_id, colour = `.metric`)) +
  geom_point() +
  geom_errorbar(aes(xmin = mean - std_err, xmax = mean + std_err), width = 0.2) +
  scale_x_continuous(breaks = seq(0.46, 0.80, by = 0.02)) +
  labs(
    title = "Accuracy, Precision, and Recall Estimates from 10-Fold Cross-Validation for 5 Models and 2 Recipes",
    y = "Model",
    x = "Mean",
    colour = "Metric"
  ) +
  theme(plot.margin = margin(30, 30, 30, 30), legend.position = "bottom")

The best model, as judged by mean accuracy estimate, is linear SVM with null values imputation. Its performance on the test partition, as measured by accuracy, is 75.5%.

best_model <- svm_linear_model

best_model_workflow <- workflow() |>
  add_recipe(imputer_recipe) |>
  add_model(best_model)

final_fit <- best_model_workflow |>
  fit(data = data_train)
 Setting default kernel parameters  
augment(final_fit, new_data = data_test) |>
  conf_mat(truth = is_diabetic, estimate = .pred_class) |>
  autoplot(conf_matrix, type = "heatmap") +
  labs(title = "Linear SVM Confusion Matrix") +
  theme(plot.title = element_text(hjust = 0.5))

augment(final_fit, data_test) |>
  metrics(truth = is_diabetic, estimate = .pred_class) |> 
  knitr::kable(digits = 4) |> 
  kable_styling(full_width = FALSE)
.metric .estimator .estimate
accuracy binary 0.7552
precision binary 0.6923
recall binary 0.5373

Conclusion

This post has showcased an example usage of workflow sets - a tidymodels technique to varying both recipes and models.

The best out of 9 fits resulted in a prediction accuracy increase, as estimated on the test partition, from 73% to 75.5%.

A few simplications that may need to revisited in the future include:

  • Not evaluating whether accuracy is a suitable metric to try to improve.
  • Not dealing with model parameters yet.
  • Not dealing with scaling, normalising, or outliers yet.

Read more