lab/5/data science/r/10.Rmd

---
title: "Lab10: Time Series"
author: "Vladislav Litvinov <vlad@sek1ro>"
output:
  pdf_document:
  toc_float: TRUE
---
Plotting data set
```{r}
setwd('/home/sek1ro/git/public/lab/ds/25-1/r')
jj = scan("jj.dat")
jj_ts = ts(jj, start = c(1960, 1), frequency = 4)

jj_ts

plot(jj_ts, ylab = "EPS", xlab = "Year")
```
In order to perform an ARIMA model, the time series will need to be transformed to remove any trend.  Plot the difference of xt and xt-1, for all t > 0.   Has this difference adequately detrended the series? Does the variability of the EPS appear constant over time?  Why does the constant variance matter?
```{r}
jj_diff = diff(jj_ts)

plot(jj_diff, xlab = "Year", ylab = "EPS diff")
```
Plot the log10 of the quarterly EPS vs. time and plot the difference of log10(xt ) and
log10(xt-1) for all t > 0.  Has this adequately detrended the series?  Has the variability of the differenced log10(EPS) become more constant?
```{r}
log_jj = log10(jj_ts)
log_jj_diff = diff(log_jj)

plot(log_jj, xlab = "Year", ylab = "log10(EPS)")
plot(log_jj_diff, xlab = "Year", ylab = "log10(EPS) diff")
```
Treating the differenced log10 of the EPS series as a stationary series, plot the ACF and PACF of this series.  What possible ARIMA models would you consider and why?

ACF(k) = Corr(x[t], x[t-k]) - Autocorrelation Function, показывает, насколько временной ряд коррелирует сам с собой

PACF - Partial Autocorrelation Function - оказывает чистую связь после удаления влияния всех промежуточных значений между t и t-k, это последний коэффициент в AR(k)-регрессии

xt = f1xt-1 + f2xt-2 + .. + fkxt-k + eps +  f1xt-1 + f2xt-2 + .. + fkxt-k
PACF(k) = fk

ARMA(p, q)
p - AR-часть, предыдущие значения - PACF
q - MA-часть, ошибки предыдущих предсказаний - ACF

ARIMA(p, d, q)
d - I-часть, number of differences

```{r}
acf(log_jj_diff, lag.max = 20)
ar(log_jj_diff)
pacf(log_jj_diff, lag.max = 20)
```
Run the proposed ARIMA models from part d and compare the results. Identify an appropriate model. Justify your choice.

Смысл: баланс между

Качеством подгонки (чем лучше модель описывает данные, тем ниже ошибка)

Сложностью модели (чем больше параметров, тем выше риск переобучения)

AIC=2k−2ln(L)
L > , k <

Why is the choice of natural log or log base 10 in Problem 4.8 somewhat irrelevant to the transformation and the analysis?

Why is the value of the ACF for lag 0 equal to one?
```{r}
library(forecast)

fit_model = function(order) {
  Arima(log_jj_diff, order = order)
}

models <- list(
  "1, 0, 1" = fit_model(c(1,0,1)),
  "1, 1, 1" = fit_model(c(1,1,1)),
  "1, 0, 5" = fit_model(c(1,0,5)),
  "1, 1, 5" = fit_model(c(1,1,5))
)

print(models["1, 0, 5"])

aic_values <- sapply(models, AIC)
print(aic_values)
```
Arima(1, 0, 5)
```{r}
n = 10000
phi4 = c(-0.18)
AR <- arima.sim(n=n, list(ar=phi4[1]))

plot(AR, main="AR series")
acf(AR, main="ACF AR")
pacf(AR, main="PACF AR")

theta4 <- c(-0.65, -0.22, -0.28, 1, -0.4)
MA <- arima.sim(n=n, list(ma=theta4))

plot(MA, main="MA series")
acf(MA, main="ACF MA")
pacf(MA, main="PACF MA")
```

```{r}
fit <- auto.arima(jj_ts)
summary(fit)

forecasted_values <- forecast(fit, h = 20)
plot(forecasted_values)
```