lab/5/data science/r/9.Rmd

---
title: "Lab9: Decision trees"
author: "Vladislav Litvinov <vlad@sek1ro>"
output:
  pdf_document:
  toc_float: TRUE
---
# Data preparation
```{r}
setwd('/home/sek1ro/git/public/lab/ds/25-1/r')
survey <- read.csv('survey.csv')

train_df = survey[1:600,]
test_df = survey[601:750,]
```
# Building classification tree
decision formula is MYDEPV ~ Price + Income + Age

Use three-fold cross-validation and the information gain splitting index
Which features were actually used to construct the tree?
Plot the tree using the “rpart.plot” package.

Three-fold cross-validation - Делают 3 прогона:
Прогон 1: обучаемся на B + C, тестируем на A
Прогон 2: обучаемся на A + C, тестируем на B
Прогон 3: обучаемся на A + B, тестируем на C

Получаем 3 значения метрики (accuracy, F1, MSE и т.п.).
Берём среднее значение — это и есть итоговая оценка качества модели.

rpart сам отбрасывает признаки, если они не улучшают разбиение по information gain.

CP-table - связь сложности дерева и ошибки
Root node error — ошибка без разбиений
nsplit — число split-ов
rel error — обучающая ошибка относительно корня
xerror — ошибка по cross-validation
xstd — стандартное отклонение xerror

type — расположение split-ов
extra — доп. информация в узлах
fallen.leaves — выравнивание листьев

H = -x\cdot\log\left(x\right)-\left(1-x\right)\log\left(1-x\right)
Gain(A) = Info(S) - Info(S_A) - максимизируем

Ранняя остановка. Ограничение грубины. Минимальное количество примеров в узле.

Отсечение ветвей.
Строительство полного дерева, в котором листья содержат примеры одного класса.
Определение двух показателей: относительную точность модели и абсолютную ошибку.
Удаление листов и узлов, потеря которых минимально скажется на точности модели и увеличении ошибки.


```{r}
library(rpart)
tree = rpart(
  MYDEPV ~ Price + Income + Age,
  data = train_df,
  method = "class",
  parms = list(split = "information"),
  control = rpart.control(
    xval = 3,
  ),
)
printcp(tree)

library(rpart.plot)

rpart.plot(
  tree,
  type = 1,
  extra = 106,
  #6 Class models: the probability of the second class only. Useful for binary responses.
  #100 display the percentage of observations in the node.
  fallen.leaves = TRUE,
)
```
Score the model with the training data and create the model’s confusion matrix.  Which class of MYDEPV was the model better able to classify?
```{r}
pred_class = predict(tree, train_df, type="class")

conf_mat = table(
  Actual = train_df$MYDEPV,
  Predicted = pred_class
)

conf_mat
print(diag(conf_mat) / rowSums(conf_mat))
```
Define the resubstitution error rate, and then calculate it using the confusion matrix from the previous step.  Is it a good indicator of predictive performance?  Why or why not?

Resubstitution error rate — это доля неправильных предсказаний на тех же данных, на которых обучалась модель
```{r}
print(1 - sum(diag(conf_mat)) / sum(conf_mat))
```
ROC curve - Receiver Operating Characteristic
x - FPR = FP / (FP + TN)
y - TPR = TP / (TP + FN)
```{r}
pred_prob = predict(tree, train_df, type="prob")[,2]

library(ROCR)
pred = prediction(pred_prob, train_df$MYDEPV)
perf = performance(pred, "tpr", "fpr")

plot(perf)
abline(a = 0, b = 1)

auc_perf = performance(pred, measure = "auc")
auc_value = auc_perf@y.values[[1]]
auc_value
```
Score the model with the testing data.  How accurate are the tree’s predictions?
Repeat part (a), but set the splitting index to the Gini coefficient splitting index.  How does the new tree compare to the previous one?

индекс Джини показывает, как часто случайно выбранный пример обучающего множества будет распознан неправильно.

Gini(Q) = 1 - sum(p^2) - максимизируем
0 - все к 1 классу
1 - все равновероятны
```{r}
```