renaming

2026-02-17 23:13:20 +03:00
parent 65218abfb1
commit e52dde575a
429 changed files with 875 additions and 14 deletions
--- a/science/r/9.Rmd
+++ b/science/r/9.Rmd
@ -0,0 +1,123 @@
+---
+title: "Lab9: Decision trees"
+author: "Vladislav Litvinov <vlad@sek1ro>"
+output:
+  pdf_document:
+  toc_float: TRUE
+---
+# Data preparation
+```{r}
+setwd('/home/sek1ro/git/public/lab/ds/25-1/r')
+survey <- read.csv('survey.csv')
+
+train_df = survey[1:600,]
+test_df = survey[601:750,]
+```
+# Building classification tree
+decision formula is MYDEPV ~ Price + Income + Age
+
+Use three-fold cross-validation and the information gain splitting index
+Which features were actually used to construct the tree?
+Plot the tree using the “rpart.plot” package.
+
+Three-fold cross-validation - Делают 3 прогона:
+Прогон 1: обучаемся на B + C, тестируем на A
+Прогон 2: обучаемся на A + C, тестируем на B
+Прогон 3: обучаемся на A + B, тестируем на C
+
+Получаем 3 значения метрики (accuracy, F1, MSE и т.п.).
+Берём среднее значение — это и есть итоговая оценка качества модели.
+
+rpart сам отбрасывает признаки, если они не улучшают разбиение по information gain.
+
+CP-table - связь сложности дерева и ошибки
+Root node error — ошибка без разбиений
+nsplit — число split-ов
+rel error — обучающая ошибка относительно корня
+xerror — ошибка по cross-validation
+xstd — стандартное отклонение xerror
+
+type — расположение split-ов
+extra — доп. информация в узлах
+fallen.leaves — выравнивание листьев
+
+H = -x\cdot\log\left(x\right)-\left(1-x\right)\log\left(1-x\right)
+Gain(A) = Info(S) - Info(S_A) - максимизируем
+
+Ранняя остановка. Ограничение грубины. Минимальное количество примеров в узле.
+
+Отсечение ветвей.
+Строительство полного дерева, в котором листья содержат примеры одного класса.
+Определение двух показателей: относительную точность модели и абсолютную ошибку.
+Удаление листов и узлов, потеря которых минимально скажется на точности модели и увеличении ошибки.
+
+
+```{r}
+library(rpart)
+tree = rpart(
+  MYDEPV ~ Price + Income + Age,
+  data = train_df,
+  method = "class",
+  parms = list(split = "information"),
+  control = rpart.control(
+    xval = 3,
+  ),
+)
+printcp(tree)
+
+library(rpart.plot)
+
+rpart.plot(
+  tree,
+  type = 1,
+  extra = 106,
+  #6 Class models: the probability of the second class only. Useful for binary responses.
+  #100 display the percentage of observations in the node. 
+  fallen.leaves = TRUE,
+)
+```
+Score the model with the training data and create the model’s confusion matrix.  Which class of MYDEPV was the model better able to classify?
+```{r}
+pred_class = predict(tree, train_df, type="class")
+
+conf_mat = table(
+  Actual = train_df$MYDEPV,
+  Predicted = pred_class
+)
+
+conf_mat
+print(diag(conf_mat) / rowSums(conf_mat))
+```
+Define the resubstitution error rate, and then calculate it using the confusion matrix from the previous step.  Is it a good indicator of predictive performance?  Why or why not?
+
+Resubstitution error rate — это доля неправильных предсказаний на тех же данных, на которых обучалась модель
+```{r}
+print(1 - sum(diag(conf_mat)) / sum(conf_mat))
+```
+ROC curve - Receiver Operating Characteristic
+x - FPR = FP / (FP + TN)
+y - TPR = TP / (TP + FN)
+```{r}
+pred_prob = predict(tree, train_df, type="prob")[,2]
+
+library(ROCR)
+pred = prediction(pred_prob, train_df$MYDEPV)
+perf = performance(pred, "tpr", "fpr")
+
+plot(perf)
+abline(a = 0, b = 1)
+
+auc_perf = performance(pred, measure = "auc")
+auc_value = auc_perf@y.values[[1]]
+auc_value
+```
+Score the model with the testing data.  How accurate are the tree’s predictions?
+Repeat part (a), but set the splitting index to the Gini coefficient splitting index.  How does the new tree compare to the previous one? 
+
+индекс Джини показывает, как часто случайно выбранный пример обучающего множества будет распознан неправильно.
+
+Gini(Q) = 1 - sum(p^2) - максимизируем
+0 - все к 1 классу
+1 - все равновероятны
+```{r}
+```