If you want to enjoy and discover subtle Rmarkdown syntax, you can access the source code of this document on the github repository: https://github.com/jchiquet/MAP573/blob/master/homework_1_reporting_correction.Rmd

Let us load a couple of packages useful in our analyses:

```
library(tidyverse)
library(simputation)
```

The first part of this assignment is to use the swirl package to continue your basic R training.

*A bit long but essentially straightforward.*

The ‘fertility.csv’ data set contains the female fertility in Europe in 2012 for 39 European countries splitted by age groups 15-19, 20-24, 25-29, 30-34, 35-39, 40 and over. The fertility is given as the mean number of children born per 1000 women.

- Download and read the data set from the data directory of MAP573’s github repository (use, e.g.
`read.csv`

). Store it into a data frame called fertility.

I suggest two solutions, one in base R and another one using `readr`

.

```
## base R
<- read.csv("data/fertility.csv", row.names = 1, header = TRUE) %>% as_tibble()
fertility ## readr
<- read_csv("data/fertility.csv") %>% select(-id) fertility
```

- Display the first 10 rows in the R markdown report.

The `kable`

function is useful to format table-like printing The formating is adapted to the type of output (PDF, HTML, doc).

`%>% head() %>% knitr::kable() fertility `

Pays | Region | Age | Fertilite |
---|---|---|---|

Albanie | Europe sud | 15-19 | 20.8 |

Albanie | Europe sud | 20-24 | 107.1 |

Albanie | Europe sud | 25-29 | 126.7 |

Albanie | Europe sud | 30-34 | 72.3 |

Albanie | Europe sud | 35-39 | 24.4 |

Albanie | Europe sud | >=40 | 5.5 |

- Make a summary of the data and display it (command
`summary`

)

The **tibble**’s command `glimpse`

allows us to have a quick overview of the structure of your table:

`%>% tibble::glimpse() fertility `

```
## Rows: 234
## Columns: 4
## $ Pays <chr> "Albanie", "Albanie", "Albanie", "Albanie", "Albanie", "Alb…
## $ Region <chr> "Europe sud", "Europe sud", "Europe sud", "Europe sud", "Eu…
## $ Age <chr> "15-19", "20-24", "25-29", "30-34", "35-39", ">=40", "15-19…
## $ Fertilite <dbl> 20.8, 107.1, 126.7, 72.3, 24.4, 5.5, 8.0, 36.4, 78.9, 94.6,…
```

The command `summary`

- when applied on a table like object - gives a summary statistics of each column, depending on its type (factor, numeric).

`%>% summary() fertility `

```
## Pays Region Age Fertilite
## Length:234 Length:234 Length:234 Min. : 3.50
## Class :character Class :character Class :character 1st Qu.: 14.82
## Mode :character Mode :character Mode :character Median : 47.75
## Mean : 52.59
## 3rd Qu.: 84.97
## Max. :136.70
```

- Find the country and age group with the highest, smallest value in each Region

Let us try to answer this in a (more or less) tidy way:

```
%>%
fertility group_by(Region) %>%
summarise(
`fertility min ` = min(Fertilite),
`smallest (Country)` = Pays[which.min(Fertilite)],
`smallest (Age)` = Age[which.min(Fertilite)],
`fertility max ` = max(Fertilite),
`highest (Country)` = Pays[which.max(Fertilite)],
`highest (Age)` = Age[which.max(Fertilite)]) %>% knitr::kable()
```

`## `summarise()` ungrouping output (override with `.groups` argument)`

Region | fertility min | smallest (Country) | smallest (Age) | fertility max | highest (Country) | highest (Age) |
---|---|---|---|---|---|---|

Europe est | 3.8 | Moldavie | >=40 | 106.7 | Bielorussie | 25-29 |

Europe nord | 4.4 | Danemark | 15-19 | 134.1 | Suede | 30-34 |

Europe Ouest | 3.5 | Suisse | 15-19 | 136.7 | Pays-Bas | 30-34 |

Europe sud | 4.4 | Macedoine | >=40 | 126.7 | Albanie | 25-29 |

- Display boxplots of the fertility for each age group

I first recode the levels in Age for presenting them in the more interpretable and natural order.

```
%>%
fertility mutate(Age = fct_relevel(Age, c("15-19", "20-24", "25-29", "30-34", "35-39", ">=40"))) %>%
ggplot() + aes(x = Age, y = Fertilite) + geom_boxplot()
```

Bonus: split by region.

```
%>%
fertility mutate(Age = fct_relevel(Age, c("15-19", "20-24", "25-29", "30-34", "35-39", ">=40"))) %>%
ggplot() + aes(x = Age, y = Fertilite, fill = Region) +
geom_boxplot() + geom_jitter(alpha = 0.5, size = 1)
```

- Display boxplots of the fertility for each ‘Region’ for the youngest age group

Same idea: use `filter`

from **dplyr**:

```
%>%
fertility filter(Age == "15-19") %>%
group_by(Region) %>% #
ggplot() + aes(x = Region, y = Fertilite) + geom_boxplot()
```

- Extract all the data rows with Country starting by “A”

I had to make a research on the internet to find the appropriate “scoping” (function `stringr::str_detect()`

)

`filter(fertility, str_detect(Pays, "^A")) %>% knitr::kable()`

Pays | Region | Age | Fertilite |
---|---|---|---|

Albanie | Europe sud | 15-19 | 20.8 |

Albanie | Europe sud | 20-24 | 107.1 |

Albanie | Europe sud | 25-29 | 126.7 |

Albanie | Europe sud | 30-34 | 72.3 |

Albanie | Europe sud | 35-39 | 24.4 |

Albanie | Europe sud | >=40 | 5.5 |

Allemagne | Europe Ouest | 15-19 | 8.0 |

Allemagne | Europe Ouest | 20-24 | 36.4 |

Allemagne | Europe Ouest | 25-29 | 78.9 |

Allemagne | Europe Ouest | 30-34 | 94.6 |

Allemagne | Europe Ouest | 35-39 | 50.9 |

Allemagne | Europe Ouest | >=40 | 9.3 |

Autriche | Europe Ouest | 15-19 | 8.6 |

Autriche | Europe Ouest | 20-24 | 44.5 |

Autriche | Europe Ouest | 25-29 | 89.2 |

Autriche | Europe Ouest | 30-34 | 94.6 |

Autriche | Europe Ouest | 35-39 | 46.8 |

Autriche | Europe Ouest | >=40 | 9.6 |

- Extract the fertility values of the Region from North Europe and make a histogram

Some filtering again, and adjusted binwidth (the default choice send warning)

```
%>%
fertility filter(Region == 'Europe nord') %>%
ggplot() + aes(x = Fertilite) + geom_histogram(binwidth = 10)
```

- Extract the fertility values of the Region from East Europe and plot a kernel density estimate (function
`density`

)

The function `density`

is from the **stats** package, loaded at start-up. It produces an object for which a plot method is available:

```
%>%
fertility filter(Region == 'Europe est') %>% pull(Fertilite) %>%
density() %>% plot()
```

- Include in your report the mathematical expression of the kernel density estimator as given at https://en.wikipedia.org/wiki/Kernel_density_estimation

I just had to edit the wikipedia page, extract the formula and copy-paste. Here it is:

\[ \widehat{f}_h(x) = \frac{1}{n}\sum_{i=1}^n K_h (x - x_i) = \frac{1}{nh} \sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big), \]

The ‘ozoneNA.csv’ data set contains 112 observations about air pollution collected during the summer of 2001 in Rennes. The aim of this study is to analyze the relationship between the maximum daily ozone level (maxO3) and the other meteorological variables. Hence, the variables recorded in this data set are the following:

- maxO3 (maximum daily ozone)
- maxO3v (maximum daily ozone the previous day)
- T12 (temperature at midday)
- T9 (Temp at 9am)
- T15 (Temp at 3pm)
- Vx12 (projection of the wind speed vector on the east-west axis at midday)
- Vx9 and Vx15 (same at 9am and 3pm)
- Ne9, Ne12, Ne15 Nebulosity (cloud)

This data is incomplete - there are missing values. Indeed, it occurs frequently to have machines that fail one day, leading to some information not recorded. In this exercice, we will perform extremely basic imputation of missing values for the variable `maxO3`

, measuring the maximum value of ozone of the day.

- Import the data as a data frame named
`ozone`

.

The first column, which contains no header in the `csv`

file, is an id from the date of sampling. We do not need it so we remove it just after reading the data:

`<- read_csv("data/ozoneNA.csv") %>% select(-X1) ozone `

`## Warning: Missing column names filled in: 'X1' [1]`

```
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## X1 = col_double(),
## maxO3 = col_double(),
## T9 = col_double(),
## T12 = col_double(),
## T15 = col_double(),
## Ne9 = col_double(),
## Ne12 = col_double(),
## Ne15 = col_double(),
## Vx9 = col_double(),
## Vx12 = col_double(),
## Vx15 = col_double(),
## maxO3v = col_double(),
## WindDirection = col_character()
## )
```

- Represent an histogram of the variable
`maxO3`

.

```
%>%
ozone ggplot() + aes(x = maxO3) + geom_histogram()
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

`## Warning: Removed 16 rows containing non-finite values (stat_bin).`

- Perform a
*very*basic imputation by replacing`NA`

values in`maxO3`

by the mean value. Create a vector`maxO3_NA_mean`

.

Let us do this in base `R`

. Notice the `na.rm`

option in the call to function `mean`

:

```
<- ozone$maxO3
maxO3_NA_mean is.na(maxO3_NA_mean)] <- mean(maxO3_NA_mean, na.rm = TRUE) maxO3_NA_mean[
```

- We suggest now to perform imputation via the k-nearest neighbor approach (see https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm ). Use the package
**simputation**to perform knn-imputation of the missing-value in`maxO3`

based on a distance computed with all remaining variables. To do so, use

`::impute_knn(ozone, maxO3 ~ .) simputation`

where `ozone`

is your data frame. The string `maxO3 ~ .`

is a formula explaining that `maxO3`

is the variable on which imputation must be performed, and while `~ .`

says that all remaining variables of the data frame are used to perform the imputation. Create a vector `maxO3_NA_knn`

with the vector with imputed values.

We apply the imputation method on the while data frame and extract the correspoding column to store the reslut in a simple numeric vector:

```
<- ozone %>% as.data.frame() %>%
maxO3_NA_knn ::impute_knn(maxO3 ~ .) %>% pull(maxO3) simputation
```

- Compare visually the distributions of the three vectors
`maxO3`

,`maxO3_NA_mean`

and`maxO3_NA_knn`

.

We put all results together in the same data frame (remark the name exlicitly given), reshape it with `pivot_longer`

a make an histogram for each. Remark the detrimental effect of mean imputation on the histogram, compare to KNN (far from ideal, though).

```
tibble(
no_imputation = ozone$maxO3,
mean_imputation = maxO3_NA_mean,
knn_imputation = maxO3_NA_knn
%>%
) pivot_longer(everything()) %>%
ggplot() + aes(x = value) + geom_histogram() + facet_grid(~ name)
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

`## Warning: Removed 16 rows containing non-finite values (stat_bin).`