-
-
Notifications
You must be signed in to change notification settings - Fork 51
Description
What goes wrong
I'm using tidymodels to build a basic ML model, I'm then using the Vetiver package to serve this model as an API endpoint on GCP using a docker container. I'm having issues with authentication the error thrown when I run docker run is that there's "No .httr-oauth file exists in current working directory. Do library authentication steps to provide credentials."
I'm confused as to what is causing the issue, when I run gcs_list_buckets(projectId = Sys.getenv("GCE_DEFAULT_PROJECT_ID")) I can see my bucket info leading me to think I'm authenticated.
Are there recommendations when trying to authenticate using docker?
Steps to reproduce the problem
Please note that if a reproduceable example that I can run is not available, then the likelihood of getting any bug fixed is low.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(
tidyverse,
googleCloudRunner,
skimr,
tidymodels,
palmerpenguins,
gt,
ranger,
brulee,
pins,
vetiver,
plumber,
conflicted,
usethis,
themis,
googleCloudStorageR,
googleAuthR,
httr,
gargle,
tune,
finetune,
doMC
)
AUTHENTICATE USING THE SERVICE ACCOUNT JSON FILE REFERENCED IN THE ENVIRON FILE
googleAuthR::gar_auth_service(json_file = Sys.getenv("GCE_AUTH_FILE"))
gcs_list_buckets(projectId = Sys.getenv("GCE_DEFAULT_PROJECT_ID"))
tidymodels_conflicts()
conflict_prefer("penguins", "palmerpenguins")
PREPARE & SPLIT DATA ----------------------------------------------------
REMOVE ROWS WITH MISSING SEX, EXCLUDE YEAR AND ISLAND
penguins_df <-
penguins %>%
drop_na(sex) %>%
select(-year, -island)
set.seed(123)
SPLIT THE DATA INTO TRAIN AND TEST SETS STRATIFIED BY SEX
penguin_split <- initial_split(penguins_df, strata = sex, prop = 3 / 4)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)
CREATE FOLDS FOR CROSS VALIDATION
penguin_folds <- vfold_cv(penguin_train)
CREATE PREPROCESSING RECIPE ---------------------------------------------
penguin_rec <-
recipe(sex ~ ., data = penguin_train) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
themis::step_upsample(species) %>%
step_dummy(species) %>%
step_normalize(all_numeric_predictors())
MODEL SPECIFICATION -----------------------------------------------------
LOGISTIC REGRESSION
glm_spec <-
L1 REGULARISATION
logistic_reg(penalty = 1) %>%
set_engine("glm")
RANDOM FOREST
tree_spec <-
rand_forest(min_n = tune()) %>%
set_engine("ranger") %>%
set_mode("classification")
NEURAL NETWORK WITH TORCH
mlp_brulee_spec <-
mlp(
hidden_units = tune(),
epochs = tune(),
penalty = tune(),
learn_rate = tune()
) %>%
set_engine("brulee") %>%
set_mode("classification")
MODEL FITTING AND HYPER PARAMETER TUNING --------------------------------
# REGISTER PARALLEL CORES
registerDoMC(cores = 2)
BAYESIAN OPTIMIZATION FOR HYPER PARAMETER TUNING
bayes_control <- control_bayes(
no_improve = 10L,
time_limit = 20,
save_pred = TRUE,
verbose = TRUE
)
FIT ALL THREE MODELS WITH HYPER PARAMETER TUNING
workflow_set <-
workflow_set(
preproc = list(penguin_rec),
models = list(
glm = glm_spec,
tree = tree_spec,
torch = mlp_brulee_spec
)
) %>%
workflow_map("tune_bayes",
iter = 50L,
resamples = penguin_folds,
control = bayes_control
)
COMPARE MODEL RESULTS ---------------------------------------------------
rank_results(workflow_set,
rank_metric = "roc_auc",
select_best = TRUE
) %>%
gt()
PLOT MODEL PERFORMANCE
workflow_set %>%
autoplot()
FINALIZE MODEL FIT ------------------------------------------------------
SELECT THE LOGISTIC MODEL GIVEN THAT ITS A SIMPLER MODEL AND PERFORMANCE
IS SIMILAR TO THE NUERAL NET MODEL
best_model_id <- "recipe_glm"
SELECT BEST MODEL
best_fit <-
workflow_set %>%
extract_workflow_set_result(best_model_id) %>%
select_best(metric = "accuracy")
CREATE WORKFLOW FOR BEST MODEL
final_workflow <-
workflow_set %>%
extract_workflow(best_model_id) %>%
finalize_workflow(best_fit)
final_fit <-
final_workflow %>%
last_fit(penguin_split)
FINAL FIT METRICS
final_fit %>%
collect_metrics() %>%
gt()
final_fit %>%
collect_predictions() %>%
roc_curve(sex, .pred_female) %>%
autoplot()
final_fit_to_deploy <- final_fit %>%
extract_workflow()
VERSION WITH VETIVER ----------------------------------------------------
INITIALISE VETIVER MODEL OBJECT
v <- vetiver_model(final_fit_to_deploy,
model_name = "logistic_regression_model"
)
v
model_board <- board_gcs(bucket = "ml_ops_in_r_bucket")
model_board %>% vetiver_pin_write(vetiver_model = v)
My api is accessible and has response status 200
vetiver_write_plumber(model_board, "logistic_regression_model", rsconnect = FALSE)
vetiver_write_docker(v)
My docker file contains environment variable references to my json file as well as the bucket and project
Expected output
Actual output
Before you run your code, please run:
options(googleAuthR.verbose=2) and copy-paste the console output here.
Check it doesn't include any sensitive info like auth tokens or accountIds - you can usually just edit those out manually and replace with say XXX
Session Info
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_Ireland.utf8 LC_CTYPE=English_Ireland.utf8
[3] LC_MONETARY=English_Ireland.utf8 LC_NUMERIC=C
[5] LC_TIME=English_Ireland.utf8
time zone: Europe/Dublin
tzcode source: internal
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] rapidoc_8.4.3 doMC_1.3.5
[3] iterators_1.0.14 foreach_1.5.2
[5] finetune_1.1.0 gargle_1.5.2
[7] httr_1.4.7 googleAuthR_2.0.1
[9] googleCloudStorageR_0.7.0 themis_1.0.2
[11] usethis_2.2.2 conflicted_1.2.0
[13] plumber_1.2.1 vetiver_0.2.3
[15] pins_1.2.1 brulee_0.2.0
[17] ranger_0.15.1 gt_0.9.0
[19] palmerpenguins_0.1.1 yardstick_1.2.0
[21] workflowsets_1.0.1 workflows_1.1.3
[23] tune_1.1.2 rsample_1.2.0
[25] recipes_1.0.8 parsnip_1.1.1
[27] modeldata_1.2.0 infer_1.0.4
[29] dials_1.2.0 scales_1.2.1
[31] broom_1.0.5 tidymodels_1.1.1
[33] skimr_2.1.5 googleCloudRunner_0.5.0
[35] lubridate_1.9.2 forcats_1.0.0
[37] stringr_1.5.0 dplyr_1.1.2
[39] purrr_1.0.2 readr_2.1.4
[41] tidyr_1.3.0 tibble_3.2.1
[43] ggplot2_3.4.3 tidyverse_2.0.0
[45] pacman_0.5.1
loaded via a namespace (and not attached):
[1] torch_0.11.0 rstudioapi_0.15.0 jsonlite_1.8.7
[4] magrittr_2.0.3 farver_2.1.1 fs_1.6.3
[7] vctrs_0.6.3 memoise_2.0.1 askpass_1.2.0
[10] base64enc_0.1-3 butcher_0.3.3 htmltools_0.5.6
[13] curl_5.0.2 sass_0.4.7 parallelly_1.36.0
[16] googlePubsubR_0.0.4 cachem_1.0.8 mime_0.12
[19] lifecycle_1.0.3 pkgconfig_2.0.3 Matrix_1.5-4.1
[22] R6_2.5.1 fastmap_1.1.1 future_1.33.0
[25] digest_0.6.33 colorspace_2.1-0 furrr_0.3.1
[28] ps_1.7.5 labeling_0.4.3 fansi_1.0.4
[31] timechange_0.2.0 compiler_4.3.1 bit64_4.0.5
[34] withr_2.5.0 backports_1.4.1 webutils_1.1
[37] MASS_7.3-60 lava_1.7.2.1 openssl_2.1.1
[40] rappdirs_0.3.3 tools_4.3.1 httpuv_1.6.11
[43] zip_2.3.0 future.apply_1.11.0 nnet_7.3-19
[46] glue_1.6.2 callr_3.7.3 promises_1.2.1
[49] grid_4.3.1 generics_0.1.3 gtable_0.3.4
[52] tzdb_0.4.0 class_7.3-22 data.table_1.14.8
[55] hms_1.1.3 xml2_1.3.5 utf8_1.2.3
[58] pillar_1.9.0 later_1.3.1 splines_4.3.1
[61] lhs_1.1.6 lattice_0.21-8 swagger_3.33.1
[64] survival_3.5-5 bit_4.0.5 tidyselect_1.2.0
[67] coro_1.0.3 jose_1.2.0 knitr_1.43
[70] xfun_0.40 hardhat_1.3.0 timeDate_4022.108
[73] stringi_1.7.12 DiceDesign_1.9 yaml_2.3.7
[76] codetools_0.2-19 cli_3.6.1 rpart_4.1.19
[79] bundle_0.1.0 repr_1.1.6 munsell_0.5.0
[82] processx_3.8.2 Rcpp_1.0.11 ROSE_0.0-4
[85] globals_0.16.2 ellipsis_0.3.2 gower_1.0.1
[88] assertthat_0.2.1 GPfit_1.0-8 listenv_0.9.0
[91] ipred_0.9-14 prodlim_2023.08.28 rlang_1.1.1
Please run sessionInfo() so we can check what versions of packages you have installed



