background-image: url(img/portada.png) background-size: cover class: animated slideInRight fadeOutLeft, middle # Modelos de Clasificaciรณn ### 1ยบ Congreso Latinoamericano de Mujeres en Bioinformรกtica y Ciencia de Datos --- ## Vamos a modelar las especies ๐ง ### Ingreso los datos ```r library(tidymodels) library(palmerpenguins) penguins <- palmerpenguins::penguins %>% drop_na() %>% #elimino valores perdidos select(-year,-sex, -island) #elimino columnas q no son numรฉricas glimpse(penguins) #observo variables restantes ``` ``` ## Rows: 333 ## Columns: 5 ## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adeโฆ ## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.โฆ ## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.โฆ ## $ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 1โฆ ## $ body_mass_g <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 380โฆ ``` --- ### Paso 1: Vamos a dividir el set de datos ```r library(rsample) set.seed(123) #setear la semilla p_split <- penguins %>% initial_split(prop=0.75) # divido en 75% p_train <- training(p_split) p_split ``` ``` ## <Analysis/Assess/Total> ## <250/83/333> ``` ```r # para hacer validaciรณn cruzada estratificada p_folds <- vfold_cv(p_train, strata = species) ``` Estos son los datos de entrenamiento/prueba/total * __Vamos a _entrenar_ con 250 muestras__ * __Vamos a _testear_ con 83 muestras__ * __Datos totales: 333__ --- ### Paso 2: Preprocesamiento (receta) ```r #creo la receta recipe_dt <- p_train %>% recipe(species~.) %>% step_corr(all_predictors()) %>% #elimino las correlaciones step_center(all_predictors(), -all_outcomes()) %>% #centrado step_scale(all_predictors(), -all_outcomes()) %>% #escalado prep() recipe_dt #ver la receta ``` ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 4 ## ## Training data contained 250 data points and no missing data. ## ## Operations: ## ## Correlation filter removed no terms [trained] ## Centering for bill_length_mm, ... [trained] ## Scaling for bill_length_mm, ... [trained] ``` --- background-image: url(img/dt-fondo.png) background-size: cover ### Paso 3: Especificar el modelo ๐ณ __Modelo de รกrboles de decisiรณn__ __Vamos a utilizar el modelo por defecto__ ```r #especifico el modelo set.seed(123) vanilla_tree_spec <- decision_tree() %>% #arboles de decisiรณn set_engine("rpart") %>% #librerรญa rpart set_mode("classification") #modo para clasificar vanilla_tree_spec ``` ``` ## Decision Tree Model Specification (classification) ## ## Computational engine: rpart ``` --- background-image: url(img/dt-fondo.png) background-size: cover ### Paso 4: armo el workflow ```r #armo el workflow tree_wf <- workflow() %>% add_recipe(recipe_dt) %>% #agrego la receta add_model(vanilla_tree_spec) #agrego el modelo tree_wf ``` ``` ## โโ Workflow โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ## Preprocessor: Recipe ## Model: decision_tree() ## ## โโ Preprocessor โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ## 3 Recipe Steps ## ## โ step_corr() ## โ step_center() ## โ step_scale() ## ## โโ Model โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ## Decision Tree Model Specification (classification) ## ## Computational engine: rpart ``` --- background-image: url(img/dt-fondo.png) background-size: cover ### Paso 5: Ajuste de la funciรณn ```r #modelo vanilla sin tunning set.seed(123) vanilla_tree_spec %>% fit_resamples(species ~ ., resamples = p_folds) %>% collect_metrics() #desanidar las metricas ``` ``` ## # A tibble: 2 x 5 ## .metric .estimator mean n std_err ## <chr> <chr> <dbl> <int> <dbl> ## 1 accuracy multiclass 0.953 10 0.0154 ## 2 roc_auc hand_till 0.946 10 0.0197 ``` --- background-image: url(img/dt-fondo.png) background-size: cover #### Paso 5.1: vamos a especificar 2 hiperparรกmatros ```r set.seed(123) trees_spec <- decision_tree() %>% set_engine("rpart") %>% set_mode("classification") %>% set_args(min_n = 20, cost_complexity = 0.1) #especifico hiperparรกmetros trees_spec %>% fit_resamples(species ~ ., resamples = p_folds) %>% collect_metrics() ``` ``` ## # A tibble: 2 x 5 ## .metric .estimator mean n std_err ## <chr> <chr> <dbl> <int> <dbl> ## 1 accuracy multiclass 0.953 10 0.0154 ## 2 roc_auc hand_till 0.946 10 0.0197 ``` --- background-image: url(img/dt-fondo.png) background-size: cover # Ejercicio .bg-near-white.b--purple.ba.bw2.br3.shadow-5.ph4.mt5[ #### 1. ยฟPor quรฉ es el mismo valor obtenido en los dos casos? #### 2. Dejando fijo el valor de min_n=20, pruebe C=1, C=0.5 y C=0. #### 3. Dejando fijo el valor de C=0, pruebe min_n 1 y 5. ]
05
:
00
--- background-image: url(img/dt-fondo.png) background-size: cover ### Paso 6: Predicciรณn del modelo ๐ฎ ```r #utilizamos la funcion last_fit junto al workflow y al split de datos final_fit_dt <- last_fit(tree_wf, split = p_split ) final_fit_dt %>% collect_metrics() ``` ``` ## # A tibble: 2 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy multiclass 0.916 ## 2 roc_auc hand_till 0.929 ``` --- background-image: url(img/dt-fondo.png) background-size: cover ### Paso 6.1: matriz de confusiรณn ```r final_fit_dt %>% collect_predictions() %>% conf_mat(species, .pred_class) #para ver la matriz de confusiรณn ``` ``` ## Truth ## Prediction Adelie Chinstrap Gentoo ## Adelie 33 2 0 ## Chinstrap 2 16 0 ## Gentoo 2 1 27 ``` -- ```r final_fit_dt %>% collect_predictions() %>% sens(species, .pred_class) #sensibilidad global del modelo ``` ``` ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 sens macro 0.911 ``` --- background-image: url(img/dt-fondo.png) background-size: cover # Ejercicio .bg-near-white.b--purple.ba.bw2.br3.shadow-5.ph4.mt5[ #### Repetir estos pasos para el modelo de C=0 y min_n=5 ]
05
:
00
--- background-image: url(img/dt-fondo.png) background-size: cover ## Resumiendo .left-column[ ### Paso 1: __Dividimos los datos__ * initial_split() ] .right-column[ <img src="img/rsample.png" width="40%" style="display: block; margin: auto;" /> ] --- background-image: url(img/dt-fondo.png) background-size: cover ## Resumiendo .left-column[ ### Paso 2: __Preprocesamiento de los datos__ * step_*() ] .right-column[ <img src="img/recipes.png" width="40%" style="display: block; margin: auto;" /> ] --- background-image: url(img/dt-fondo.png) background-size: cover ## Resumiendo .left-column[ ### Paso 3: __Especificamos el modelo y sus args__ * set_engine() * mode() ] .right-column[ <img src="img/parsnip.png" width="40%" style="display: block; margin: auto;" /> ] --- background-image: url(img/dt-fondo.png) background-size: cover ## Resumiendo .left-column[ ### Paso 4: __Armamos el workflow con la receta y el modelo__ * workflow() * add_recipe() * add_model() ] .right-column[ <img src="img/workflow.png" width="40%" style="display: block; margin: auto;" /> ] --- background-image: url(img/dt-fondo.png) background-size: cover ## Resumiendo .left-column[ ### Paso 5: Tune __Tuneo de los hiperparรกmetros__ * last_fit() ] .right-column[ <img src="img/tune.png" width="40%" style="display: block; margin: auto;" /> ] --- background-image: url(img/dt-fondo.png) background-size: cover ## Resumiendo .left-column[ ### Paso 6: __Predicciรณn y comparaciรณn de las mรฉtricas__ * collect_metrics() * collect_predictions() + conf_mat() ] .right-column[ <img src="img/yardstick.png" width="40%" style="display: block; margin: auto;" /> ] --- class: inverse # Random Forest --- background-image: url(img/rf-fondo.png) background-size: cover ### Paso 2: Preprocesamos los datos ```r p_recipe <- training(p_split) %>% recipe(species~.) %>% step_corr(all_predictors()) %>% step_center(all_predictors(), -all_outcomes()) %>% step_scale(all_predictors(), -all_outcomes()) %>% prep() p_recipe ``` ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 4 ## ## Training data contained 250 data points and no missing data. ## ## Operations: ## ## Correlation filter removed no terms [trained] ## Centering for bill_length_mm, ... [trained] ## Scaling for bill_length_mm, ... [trained] ``` --- background-image: url(img/rf-fondo.png) background-size: cover ### Paso 3: Especificar el modelo ```r rf_spec <- rand_forest() %>% set_engine("ranger") %>% set_mode("classification") ``` --- ### Veamos como funciona el modelo sin tunning ```r set.seed(123) rf_spec %>% fit_resamples(species ~ ., resamples = p_folds) %>% collect_metrics() ``` ``` ## # A tibble: 2 x 5 ## .metric .estimator mean n std_err ## <chr> <chr> <dbl> <int> <dbl> ## 1 accuracy multiclass 0.972 10 0.0108 ## 2 roc_auc hand_till 0.996 10 0.00192 ``` --- background-image: url(img/rf-fondo.png) background-size: cover ## Random Forest con mtry=2 ๐ง ```r rf2_spec <- rf_spec %>% set_args(mtry = 2) set.seed(123) rf2_spec %>% fit_resamples(species ~ ., resamples = p_folds) %>% collect_metrics() ``` ``` ## # A tibble: 2 x 5 ## .metric .estimator mean n std_err ## <chr> <chr> <dbl> <int> <dbl> ## 1 accuracy multiclass 0.972 10 0.0108 ## 2 roc_auc hand_till 0.996 10 0.00192 ``` --- background-image: url(img/rf-fondo.png) background-size: cover ## Random Forest con mtry=3 ๐ง ```r rf3_spec <- rf_spec %>% set_args(mtry = 3) set.seed(123) rf3_spec %>% fit_resamples(species ~ ., resamples = p_folds) %>% collect_metrics() ``` ``` ## # A tibble: 2 x 5 ## .metric .estimator mean n std_err ## <chr> <chr> <dbl> <int> <dbl> ## 1 accuracy multiclass 0.967 10 0.0104 ## 2 roc_auc hand_till 0.997 10 0.00115 ``` --- background-image: url(img/rf-fondo.png) background-size: cover ## Random Forest con mtry=4 ๐ง ```r rf4_spec <- rf_spec %>% set_args(mtry = 4) set.seed(123) rf4_spec %>% fit_resamples(species ~ ., resamples = p_folds) %>% collect_metrics() ``` ``` ## # A tibble: 2 x 5 ## .metric .estimator mean n std_err ## <chr> <chr> <dbl> <int> <dbl> ## 1 accuracy multiclass 0.964 10 0.00972 ## 2 roc_auc hand_till 0.996 10 0.00147 ``` --- background-image: url(img/rf-fondo.png) background-size: cover ### Tuneo de hiperparรกmetros automรกtico ๐ง ```r tune_spec <- rand_forest( mtry = tune(), trees = 1000, min_n = tune() ) %>% set_mode("classification") %>% set_engine("ranger") tune_spec ``` ``` ## Random Forest Model Specification (classification) ## ## Main Arguments: ## mtry = tune() ## trees = 1000 ## min_n = tune() ## ## Computational engine: ranger ``` --- background-image: url(img/rf-fondo.png) background-size: cover ## Workflows ```r tune_wf <- workflow() %>% add_recipe(p_recipe) %>% add_model(tune_spec) set.seed(123) cv_folds <- vfold_cv(p_train, strata = species) tune_wf ``` ``` ## โโ Workflow โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ## Preprocessor: Recipe ## Model: rand_forest() ## ## โโ Preprocessor โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ## 3 Recipe Steps ## ## โ step_corr() ## โ step_center() ## โ step_scale() ## ## โโ Model โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ## Random Forest Model Specification (classification) ## ## Main Arguments: ## mtry = tune() ## trees = 1000 ## min_n = tune() ## ## Computational engine: ranger ``` --- background-image: url(img/rf-fondo.png) background-size: cover ### Paralelizamos los cรกlculos ```r doParallel::registerDoParallel() set.seed(123) tune_res <- tune_grid( tune_wf, resamples = cv_folds, grid = 20 ) tune_res ``` ``` ## # Tuning results ## # 10-fold cross-validation using stratification ## # A tibble: 10 x 4 ## splits id .metrics .notes ## <list> <chr> <list> <list> ## 1 <split [224/26]> Fold01 <tibble [40 ร 6]> <tibble [0 ร 1]> ## 2 <split [224/26]> Fold02 <tibble [40 ร 6]> <tibble [0 ร 1]> ## 3 <split [224/26]> Fold03 <tibble [40 ร 6]> <tibble [0 ร 1]> ## 4 <split [224/26]> Fold04 <tibble [40 ร 6]> <tibble [0 ร 1]> ## 5 <split [224/26]> Fold05 <tibble [40 ร 6]> <tibble [0 ร 1]> ## 6 <split [225/25]> Fold06 <tibble [40 ร 6]> <tibble [0 ร 1]> ## 7 <split [225/25]> Fold07 <tibble [40 ร 6]> <tibble [0 ร 1]> ## 8 <split [226/24]> Fold08 <tibble [40 ร 6]> <tibble [0 ร 1]> ## 9 <split [227/23]> Fold09 <tibble [40 ร 6]> <tibble [0 ร 1]> ## 10 <split [227/23]> Fold10 <tibble [40 ร 6]> <tibble [0 ร 1]> ``` --- background-image: url(img/rf-fondo.png) background-size: cover ## Rango de valores para min_n y mtry ![](clasificacion_files/figure-html/unnamed-chunk-26-1.png)<!-- --> --- background-image: url(img/rf-fondo.png) background-size: cover ## Elijo el mejor modelo ๐ * Con la funciรณn select_best ```r best_auc <- select_best(tune_res, "roc_auc") final_rf <- finalize_model( tune_spec, best_auc ) final_rf ``` ``` ## Random Forest Model Specification (classification) ## ## Main Arguments: ## mtry = 4 ## trees = 1000 ## min_n = 4 ## ## Computational engine: ranger ``` --- background-image: url(img/rf-fondo.png) background-size: cover ## Valores finales ```r set.seed(123) final_wf <- workflow() %>% add_recipe(p_recipe) %>% add_model(final_rf) final_res <- final_wf %>% last_fit(p_split) final_res %>% collect_metrics() ``` ``` ## # A tibble: 2 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy multiclass 0.952 ## 2 roc_auc hand_till 0.999 ``` --- background-image: url(img/rf-fondo.png) background-size: cover ## Matriz de Confusiรณn ```r final_res %>% collect_predictions() %>% conf_mat(species, .pred_class) ``` ``` ## Truth ## Prediction Adelie Chinstrap Gentoo ## Adelie 39 0 1 ## Chinstrap 2 11 1 ## Gentoo 0 0 29 ``` --- background-image: url(img/rf-fondo.png) background-size: cover ## Ejercicio 3 .bg-near-white.b--purple.ba.bw2.br3.shadow-5.ph4.mt5[ #### Con el set de datos de iris realice una clasificaciรณn con random forest. ]
10
:
00
--- ## Resumiendo .left-column[ ### Paso 2: Recipes __Preprocesamiento de los datos__ ] .right-column[ <img src="img/recipes.png" width="40%" style="display: block; margin: auto;" /> ] --- ## Resumiendo .left-column[ ### Paso 3: Parsnip __Especificamos el modelo y sus args__ ] .right-column[ <img src="img/parsnip.png" width="40%" style="display: block; margin: auto;" /> ] --- ## Resumiendo .left-column[ ### Paso 4: Workflow __Armamos el workflow con la receta y el modelo__ ] .right-column[ <img src="img/workflow.png" width="40%" style="display: block; margin: auto;" /> ] --- ## Resumiendo .left-column[ ### Paso 5: Tune __Tuneo de los hiperparรกmetros__ ] .right-column[ <img src="img/tune.png" width="40%" style="display: block; margin: auto;" /> ] --- ## Resumiendo .left-column[ ### Paso 6: __Predicciรณn y comparaciรณn de las mรฉtricas__ ] .right-column[ <img src="img/yardstick.png" width="40%" style="display: block; margin: auto;" /> ] --- background-image: url(img/rf-fondo.png) background-size: cover ### Importancia de las variables * libreria vip ```r library(vip) set.seed(123) final_rf %>% set_engine("ranger", importance = "permutation") %>% fit(species ~ ., data = juice(p_recipe)) %>% vip(geom = "point")+ theme_xaringan() ``` --- background-image: url(img/rf-fondo.png) background-size: cover ### Grรกfico ![](clasificacion_files/figure-html/unnamed-chunk-36-1.png)<!-- --> --- ### Ejemplos (un poco) mรกs reales #### [Tuning random forest hyperparameters with #TidyTuesday trees data](https://juliasilge.com/blog/sf-trees-random-tuning/) <br><br><br><br> #### [Hyperparameter tuning and #TidyTuesday food consumption](https://juliasilge.com/blog/food-hyperparameter-tune/) --- background-image: url(img/biblio.png) background-size: cover # Bibliografรญa #### Max Kuhn & Julia Silge - [Tidy Modeling (en desarrollo)](https://www.tmwr.org/) #### Julia Silge's [Personal Blog](https://juliasilge.com/blog/) #### Max Kuhn & Kjell Johnson - [Feature engineering and Selection: A Practical Approach for Predictive Models](http://www.feat.engineering/) #### Max Kuhn & Kjell Johnson - Applied Predictive Modeling #### Documentaciรณn de [`tidymodels`](https://www.tidymodels.org/) #### Alison Hill [rstudio conf](https://conf20-intro-ml.netlify.app/materials/) y [curso virtual](https://alison.rbind.io/post/2020-06-02-tidymodels-virtually/) material --- background-image: url(img/final-fondo.png) background-size: cover class: middle # Muchas gracias ๐ค