This project aims to analyse some real world data sets with regression models and regularization tools generalizing those studied during the first part of the class.

Applying the tools seen during the first part (multiple linear regression, variable selection with stepwise and alike, ridge and LASSO).

Studying some extensions according to the data set considered, such as

- variant of the LASSO: group-Lasso, fused-Lasso, Elastic-Net
- additive model and splines
- logistic regression
- generalized linear model

All these statistical models are implemented and distributed as R packages. ONe may use in particular the following ressource“sOn pourra en particulier s’appuyer sur

- the
`splines`

package, - the
`glmnet`

package, - the
`gam`

and`gamsel`

packages, - the
`genlasso`

,`grpreg`

packages, - etc.

- You should make some research about the models that you are considering (research paper, lecture books, etc.)
- You do not have to make some deep development in
`R`

. You basically have to use smartly the existing packages. - The final goal is to produce models with good predictive performances and interpretable models ;
- some discussion about the model and the data is expected; we want some data analysis, not the blind application of some black-box procedures.
- your report does not need to be long, it has to be precise, relevant and accurate.

You will present your work during a 5 minutes talk on December, the 15th (+ 5 minutes for discussion). You will send us two reports

A first report about basic data analysis and application of the methods seen during the first part of the class. We need this report

*november the 24th*.A second report about the new regularization and/or regression techniques at the edge of the class used for analyzing your data set. We need this report

*december the 11th*.

Penalties will apply in case of delay for the restitution of your work.

You can look for your own dataz set on the web, for instance on the UCI repository, for regression or classification task: UCI repository

Here are some possible datsets for instance that you can download here.

Gene expression data (20 genes for 120 samples) from the microarray experiments of mammalian eye tissue samples of Scheetz et al. (2006).

This data set contains 120 samples with 100 predictors (expanded from 20 genes using 5 basis B-splines, as described in Yang, Y. and Zou, H. (2012)). It consists in a list with the following elements:

- x a [120 x 100] matrix (expanded from a [120 x 20] matrix) giving the expression levels of 20 filtered genes for the 120 samples. Each row corresponds to a subject, each 5 consecutive columns to a grouped gene.
- y a numeric vector of length 120 giving expression level of gene TRIM32, which causes Bardet-Biedl syndrome.

T. Scheetz, K. Kim, R. Swiderski, A. Philp, T. Braun, K. Knudtson, A. Dorrance, G. DiBona, J. Huang, T. Casavant, V. Sheffield, E. Stone .Regulation of gene expression in the mammalian eye and its relevance to eye disease. *Proceedings of the National Academy of Sciences of the United States of America*, 2006.

```
load("data/bardet.rda")
str(bardet)
```

```
## List of 2
## $ x: num [1:120, 1:100] 0.44705 0.46684 0.01498 0 0.00281 ...
## $ y: num [1:120] 8.42 8.36 8.41 8.29 8.27 ...
```

`HIV`

data setGenotypes associated to 605 individuals with AIDS (HIV).

Two objects are created on load:

- X - a 605x300 matrix giving the genotypes of 605 individual for 300 SNPs.
- y - a size 605 vector giving the concentration of virus in the blood for each individual.

```
load("data/HIVdata.rda")
ls(); str(X); str(y)
```

`## [1] "X" "y"`

```
## num [1:605, 1:300] 3 2 2 3 2 2 1 1 3 2 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:605] "030101" "030102" "060101" "060102" ...
## ..$ : chr [1:300] "rs1264550" "rs9261947" "rs1079541" "rs9295871" ...
```

`## num [1:605] 4.56 5.18 5.2 6.18 6.11 4.64 2.91 6 5.9 4.74 ...`

Dalmasso, C., Carpentier, W., Meyer, L., Rouzioux, C., Goujard, C., Chaix, M. L., … & Theodorou, I. (2008). Distinct genetic loci control plasma HIV-RNA and cellular HIV-DNA levels in HIV-1 infection: the ANRS Genome Wide Association 01 study. *PloS one*, 3(12), e3907-e3907.

`Colorectal`

data setData set giving the gene expression levels in tumoral or healty tissues or for patient with colorectal cancer. 62 samples have been analyzed for2000 genes or transcripts.

Three objects are created on load:

- X - a 62x2000 matrix giving the log transform expression level sampled in 62 tissues
- y - a size 62 vector giving the status of the patient (-1: tumoral, 1: healty).
- genes.info - a list with length 2000 giving information about the studied genes.

```
load("data/colorectal.rda")
ls()
```

`## [1] "genes.info" "X" "y"`

U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine, “Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays”, *PNAS*, vol. 96, 1999.

`Liver`

data setThis data set contains the expression measure of 3116 genes and 10 clinical measurements for 64 subjects (rats) that were exposed to non-toxic ortoxic doses of acetaminophen in a controlled experiment.

Three objects are created on load:

- x - a 64x3116 matrix giving the log transform expression level sampled of 3116 genes of the 64 rats
- y - a size 64 vector giving the treatment (0: non-toxic, 1: toxic).

```
load("data/liver.RData")
ls()
```

`## [1] "x" "y"`

Bushel, P., Wolfinger, R. D. and Gibson, G. (2007). Simultaneous clustering of gene expression data with clinical chemistry

The nutrimouse dataset contains the expression measure of 120 genes potentially involved in nutritional problems and the concentrations of one hepatic fatty acids for forty mice.

A data frame with 121 columns and 40 rows. The first 120 numerical variables are gene expression. The last columns contains the contentration (in proportion) of fat lipid.

```
load("data/mice.rda")
ls()
```

`## [1] "mice"`

Martin, P. G. P., Guillou, H., Lasserre, F., Dejean, S., Lan, A., Pascussi, J.-M., San Cristobal, M., Legrand, P., Besse, P. and Pineau, T. (2007). Novel aspects of PPARα-mediated regulation of lipid and xenobiotic metabolism revealed through a multrigenomic study. Hepatology 54, 767-777.

This data set has been collected at the Australian National Sport Institue, representing the concentration in Ferretin and various covariate for 102 men et 100 women.

A data frame with 13 columns and 202 rows.

-Sport Sport -Sex male or female -Ht Height in cm -Wt Weight in kg -LBM Lean body mass -RCC Red cell count -WCC White cell count -Hc Hematocrit -Hg Hemoglobin -Ferr Plasma ferritin concentration -BMI Body mass index = weight/height^2 -SSF Sum of skin folds -XBfat % body fat

```
load("data/ferritin.RData")
head(ferritin)
```

```
## Sex Sport RCC WCC Hc Hg Ferr BMI SSF X.Bfat LBM Ht Wt
## 1 female BBall 3.96 7.5 37.5 12.3 60 20.56 109.1 19.75 63.32 195.9 78.9
## 2 female BBall 4.41 8.3 38.2 12.7 68 20.67 102.8 21.30 58.55 189.7 74.4
## 3 female BBall 4.14 5.0 36.4 11.6 21 21.86 104.6 19.88 55.36 177.8 69.1
## 4 female BBall 4.11 5.3 37.3 12.6 69 21.88 126.4 23.66 57.18 185.0 74.9
## 5 female BBall 4.45 6.8 41.5 14.0 29 18.96 80.3 17.64 53.20 184.6 64.6
## 6 female BBall 4.10 4.4 37.4 12.5 42 21.04 75.2 15.58 53.77 174.0 63.7
```

Global score for 200 Universities as a function of various predictors.

A data frame with the following columns

- World_Rank : Rang de l’université
- University_Name: Nom de l’université
- Country: Localisation de l’université
- Teaching_Rating: Taux de la qualité d’enseignement de l’université, entre 0-100 .
- Inter_Outlook_Rating: Taux de la composition internationale de l’université, entre 0-100.
- Research_Rating: Taux de la qualité de recherche de l’université, entre 0-100.
- Citations_Rating: Taux de citations des papiers par d’autres universités, entre 0-100.
- Industry_Income_Rating: Taux de l’investissement des entreprises dans la recherche de l’université, entre 0-100.
- Total_Score: Score Final (Variable à expliquer).
- Num_Students: Nombre total des étudiants.
- Student.Staff_Ratio: Ratio entre le nombre des étudiants et le nombre des membres académiques.
- X._Inter_Students: Pourcetage des étudiants étrangés.
- X._Female_Students: Pourcentage des étudiantes.
- Year: Année académique.

```
load("data/univ.RData")
head(univ[, 1:3])
```

```
## World_Rank University_Name
## 1 2 California Institute of Technology
## 2 75 University of Oxford
## 3 89 Stanford University
## 4 102 University of Cambridge
## 5 111 Massachusetts Institute of Technology
## 6 122 Harvard University
## Country
## 1 United States of America
## 2 United Kingdom
## 3 United States of America
## 4 United Kingdom
## 5 United States of America
## 6 United States of America
```

This dataset is composed of a range of biomedical voice measurements from people with early-stage Parkinson’s disease recruited to a six-month trial of a telemonitoring device for remote symptom progression monitoring. Columns in the table contain 16 biomedical voice measures. Each row corresponds to one of 5,875 voice recording from these individuals. The main aim of the data is to predict the total UPDRS scores from the 16 voice measures, a Clinical score for the disease.

A vector y with the Clinician’s total UPDRS score and a matrix x with the following columns

- 1-5 Jitterxx - Several measures of variation in fundamental frequency
- 6-11 Shimmerxx - Several measures of variation in amplitude
- 12 NHR - a measures of ratio of noise to tonal components in the voice
- 13 HNR - a measures of ratio of noise to tonal components in the voice
- 14 RPDE - A nonlinear dynamical complexity measure
- 15 DFA - Signal fractal scaling exponent
- 16 PPE - A nonlinear measure of fundamental frequency variation

```
load("data/parkinson.Rdata")
head(x)[, 1:3]
```

```
## Jitter... Jitter.Abs. Jitter.RAP
## [1,] 0.00662 3.380e-05 0.00401
## [2,] 0.00300 1.680e-05 0.00132
## [3,] 0.00481 2.462e-05 0.00205
## [4,] 0.00528 2.657e-05 0.00191
## [5,] 0.00335 2.014e-05 0.00093
## [6,] 0.00353 2.290e-05 0.00119
```

`head(y)`

`## [1] 34.398 34.894 35.389 35.810 36.375 36.870`

The goal is to predict whether epithelial cells are benign or malignant, based on 9 cytological features assessed on a scale of 1 to 10.

A vector y for the status of the cell (benign, malignant) and a matrix bc.raw with the following columns

- Clump Thickness

- Uniformity of Cell Size

- Uniformity of Cell Shape

- Marginal Adhesion

- Single Epithelial Cell Size

- Bare Nuclei

- Bland Chromatin

- Normal Nucleoli

- Mitoses

```
load("data/breast.RData")
head(bc.raw)[, 1:3]
```

```
## Clump.Thickness Uniformity.of.Cell.Size Uniformity.of.Cell.Shape
## 1 5 4 4
## 2 3 1 1
## 3 6 8 8
## 4 4 1 1
## 5 8 10 10
## 6 1 1 1
```

`head(y)`

`## [1] 0 0 0 0 1 0`