The function `read_csv2` from the `readr` package, part of the [tidyverse](https://www.tidyverse.org/) can hande this automatically. We also specify the column names, corresponding to the budgetary items, plus the corresponding year, while loading the data set into `R`:
```{r data_loadings, message = FALSE}
state_budget <- readr::read_csv2("budget.csv",
col_names = c(
"Year",
"Governments",
"Agriculture",
"Business and Industry",
"Employment",
"Housing",
"Education",
"Welfare",
"Veterans",
"Defence",
"Debt Interest",
"Others"
))
```

# Basic descriptive analysis
## Data table summary
**Have a look at the head of the data table $\mathbf{X}$**
`kable` is used adapt the formatting to the type of output: HTML, screen, PDF
```{r data header}
state_budget %>% head() %>% knitr::kable()
```
The following function is useful to have a quick look at a data frame a check types of each variables:
```{r data glimpse}
glimpse(state_budget)
```

Each variable (i.e. budget item) takes it value in $[0,100]$ (proportion of the current budget). **Propose a plot to summarize the distribution of these proportions**
```{r boxplot}
state_budget %>%
select(-Year) %>%
pivot_longer(everything()) %>%
ggplot() +
aes(x = name, y = value) + labs(x = "budget item", y = "value (proportion)") +
geom_boxplot() + geom_point(alpha = 0.5) +
theme(axis.text.x = element_text(angle = 90))
```

### Data transformation
**Create a categorical variable (factor) with a couple of level to regroup samples by time interval that you find relevant in history of 20th century. Check that these period are balanced.**
We identify 4 historical periods to regroup the 'Year' variable (the id of the sample) into four clusters that might be interesting for interpreting the the PCA. We amend our data frame by adding a column encoding this new descriptive variable (that will _not_ be used in the PCA, of course!).
```{r period}
state_budget <- state_budget %>%
mutate(Period = cut(Year,
breaks = c(-Inf, 1900, 1920, 1947, Inf), right = FALSE,
labels = c("<1900", "[1900, 1920)", "(1920, 1947]", ">= 1947"))
)
```

## Scatterplot/pairs plot
When only few variables are at play, plotting the scatterplot or (pairs-to-pairs plot) might help in finding obvious relations.
**Use the function `scatmat` from \{GGally\} to do that**
```{r scatterplot, fig.width = 8, fig.height = 8}
state_budget %>%
dplyr::select(-Year) %>% mutate_if(is_double, scale) %>%
GGally::scatmat(color = "Period")
```

When many more continuous variables are observed, a quick glance at the redondancy in information may be explored by representing the correlation matrix, optionally regrouped according to similarity (here, hierarchical clustering).
**Use the package \{corrplot\} to check correlations between variables. What do you conclude?**
```{r correlation matrix, fig.width = 8, fig.height = 8}
par(mfrow = c(1,2))
state_budget %>% dplyr::select(-Year, -Period) %>%
cor() %>% corrplot::corrplot()
state_budget %>% dplyr::select(-Year, -Period) %>%
cor() %>% corrplot::corrplot(method = "color", order = "hclust")
```
It seems that some variables carry the same information: dimension reduction might help. The data is probably living in a smaller space than $p=11$.

# Principal Component Anaysis
## Performing PCA
Before performing the PCA, we center (always!) and scale the continuous columns in the data table, in order to give the same weight to items with high or lower percentage.
**Form the matrix $\mathbf{X}$ of continuous scaled variables and perform the PCA with \{FactoMineR\}**
```{r PCA}
X <- state_budget %>%
mutate_if(is_double, scale) %>%
select(-Year, -Period)
rownames(X) <- state_budget$Year
myPCA <- FactoMineR::PCA(
X = X, # the data on which PCA is performed
scale.unit = FALSE, # scaling has been made "manually"
graph = FALSE, # to make plot right now
ncp = 11 # keep all component
)
```

### Quick Recap on PCA
Recall that, essentially, a PCA and all important quantities is obtained by computing
the eigen decomposition of the empirical covariance matrix:
\begin{equation*}
\hat{\boldsymbol \Sigma} = \mathbf{X}^\top \mathbf{X} = \mathbf{U} \boldsymbol{\Lambda} \mathbf{U}^\top.
\end{equation*}
Equivalently -- and this is how it is computed in most program -- PCA is obtained by performing the _Singular Value Decomposition_ of the data matrix $\mathbf{X}$, i.e.,
\begin{equation*}
\mathbf{X} = \mathbf{V} \boldsymbol{\Lambda}^{1/2} \mathbf{U}^\top.
\end{equation*}
We use the following vocabulary:
- the weights, or rotation matrix, or loadings, is the orthogonal matrix $\mathbf{W} = \mathbf{U}$
- the score matrix, or principal components, is the matrix $\mathbf{F} = \mathbf{X}\mathbf{U} = \mathbf{V}\mathbf{\Lambda}^{1/2}$.
- the singular values are stack in the diagonal matrix $\mathbf{\Lambda}^{1/2}$.
- the eigen values $\lambda_j$ of $\boldsymbol\Sigma$ are the square of the singular values of $\mathbf{X}$.
All in all, a PCA is a matrix factorisation such that
\begin{equation*}
\mathbf{F} \mathbf{U}^\top = \mathbf{X}.
\end{equation*}
## Scree plot
The first diagnostic plot to make is a scree plot, which display the proportion of explained variances by the successive component. Recall that is related to the eigen values of the empirical covariance:
\begin{equation*}
\mathrm{percent}(\mathrm{var}_{j}) = \frac{\lambda_j}{\sum_{j=1}^p \lambda_j}.
\end{equation*}
**Make a scree plot. Comment!**
```{r variances}
factoextra::fviz_eig(myPCA, addlabels = TRUE, ylim = c(0,50))
```
Remark that the last 6 axes (> 50% of the number of variables) explain less than 10% of the total variance: this is an important argument toward dimension reduction.

## Variables study
All information about variables can be reached as follow:
```{r variables summary}
var <- get_pca_var(myPCA)
var
```
### Eigen vector / Weights (a.k.a. loadings)
The eigen vectors of the empirical covariance matrix give weights for obtaining the new variables as a linear combinaison of the original ones.
**Check that the right eigen vectors of $\mathbf{X}$, and the eigen vectors or $\boldsymbol\Sigma$ match with the loadings. Also check that it is orthogonal**
```{r eigenvectors}
U <- data.frame(myPCA$svd$V) ## eigen(cov(X))$vector
dimnames(U) <- dimnames(var$coord)
kable(U, digits = 3)
```
This is a _rotation matrix_, and thus an orthogonal matrix:
```{r orthogonality}
sum((t(U) - solve(U))^2)
```

The coordinate of the variables projected on the correlation circle are equal to $\mathbf{U} \sqrt{\mathbf{\Lambda}}$, the correlation between the new variables and the original ones. Indeed,
```{r coord var}
U <- as.matrix(U)
D <- diag(myPCA$svd$vs)
sum((U %*% D - var$coord)^2)
```
**Check the contribution of the original variables to the new axis. Comment**
The contribution of the original variable to the (e.g first two) components are given in percentage by the field `contrib`:
```{r contrib var}
fviz_contrib(myPCA, choice = "var", axes = 1:2)
```

### Correlation circle
The correlation circle gives a quick representation of how new and original variables are related together.
**Make the plot, comment! Also check the quality of the representation if the variables**
```{r circle}
kable(var$cor, digits = 3)
fviz_pca_var(myPCA, col.var = "cos2", axes = 1:2)
```
These quantities measure the correlation between the variabes in the new and the original bases.
The quality of the representation of the original variables by the new ones is measure with the (squared) cosine between them: a high cos2 indicates a good representation of the variable on the principal component.
```{r, out.width = "50%", fig.align='center', eval=FALSE, echo=FALSE}
corrplot(var$cos2, is.corr = FALSE, method = 'color')
fviz_cos2(myPCA, choice = "var", axes = 1:2)
```

## Indiviual Factor Map
All information about variables can ge reached as follow:
```{r individual}
ind <- get_pca_ind(myPCA)
ind
```
### The principal components / Scores
The _principal components_ are the the coordinate of the points in the new basis, after rotating the original data by the matrix of eigen vectors $U$.
**Compute them by yourself and check that it matches the coordinate of the individuals**
```{r PC manual}
PC <- as.matrix(X) %*% U
kable(t(PC[, 1:2]), digits = 3)
```
This match the coordinate computed by `FactoMineR`:
```{r}
kable(t(ind$coord[, 1:2]), digits = 3)
```

The individual factor map is the representation of the principal components/scores in the specified axes.
**Use \{factoextra\} to represent the individuals in the new basis, on axes (1,2), (1,3), (2,3). Add some color related to the period. Comment**
Here, we change the size of the point according to the quality of their representation (measure with the cosine to the square, just like with variable factor map).
```{r ind map}
fviz_pca_ind (myPCA,
pointsize = "cos2",
pointshape = 21,
fill = "#E7B800",
repel = TRUE ,
axes = c(1, 2))
```
Let us enrich our plot by the Period, a qualitative variable that we wish related with the budget.
```{r Period}
par(mfrow = c(1,3))
fviz_pca_ind (myPCA,
geom.ind = "point",
col.ind = state_budget$Period,
palette = c("#00AFBB", "#E7B800", "#FC4E07","#2E9FDF", "#000000"),
legend.title = "Period",
axes = c(1, 2))
```

## Biplot
When there are only few variables (only tens), it is possible to give an unifying representation of individual and variable factor maps into a single plot, called _biplot_.
**Make the biplot and comment**
```{r biplot}
fviz_pca_biplot (myPCA,
geom.ind = "point",
col.ind = state_budget$Period,
palette = c("#00AFBB", "#E7B800", "#FC4E07","#2E9FDF", "#000000"),
legend.title = "Period",
axes = c(1, 2))
```
This plot is especially helpful to identifiy groups of individuals, how the new components are related to the original variables and finally which group individuals is carrying which part of the information/variables.

## References