A. Introduction

This project aims to analyse some real world data sets with regression models and regularization tools generalizing those studied during the first part of the class.

Goals

  1. Applying the tools seen during the first part (multiple linear regression, variable selection with stepwise and alike, ridge and LASSO).

  2. Studying some extensions according to the data set considered, such as

  • variant of the LASSO: group-Lasso, fused-Lasso, Elastic-Net
  • additive model and splines
  • logistic regression
  • generalized linear model

All these statistical models are implemented and distributed as R packages. ONe may use in particular the following ressource“sOn pourra en particulier s’appuyer sur

  • the splines package,
  • the glmnet package,
  • the gam and gamsel packages,
  • the genlasso, grpreg packages,
  • etc.

Advices

  • You should make some research about the models that you are considering (research paper, lecture books, etc.)
  • You do not have to make some deep development in R. You basically have to use smartly the existing packages.
  • The final goal is to produce models with good predictive performances and interpretable models ;
  • some discussion about the model and the data is expected; we want some data analysis, not the blind application of some black-box procedures.
  • your report does not need to be long, it has to be precise, relevant and accurate.

Schedule

You will present your work during a 5 minutes talk on December, the 15th (+ 5 minutes for discussion). You will send us two reports

  1. A first report about basic data analysis and application of the methods seen during the first part of the class. We need this report november the 24th.

  2. A second report about the new regularization and/or regression techniques at the edge of the class used for analyzing your data set. We need this report december the 11th.

Penalties will apply in case of delay for the restitution of your work.

References

For extenting both regularization method and linear model, the two following books are sstrongly recommended

  • The Elements of Statistical learning, Friedman, Hastie, Tibshirani. chapitres 3, 4, 5. PDF

  • Extending the linear model with R, Faraway. PDF

B. Data sets

You can look for your own dataz set on the web, for instance on the UCI repository, for regression or classification task: UCI repository

Here are some possible datsets for instance that you can download here.

B.2 ‘Bardet’ data set

Description

Gene expression data (20 genes for 120 samples) from the microarray experiments of mammalian eye tissue samples of Scheetz et al. (2006).

Format

This data set contains 120 samples with 100 predictors (expanded from 20 genes using 5 basis B-splines, as described in Yang, Y. and Zou, H. (2012)). It consists in a list with the following elements:

  • x a [120 x 100] matrix (expanded from a [120 x 20] matrix) giving the expression levels of 20 filtered genes for the 120 samples. Each row corresponds to a subject, each 5 consecutive columns to a grouped gene.
  • y a numeric vector of length 120 giving expression level of gene TRIM32, which causes Bardet-Biedl syndrome.

References

T. Scheetz, K. Kim, R. Swiderski, A. Philp, T. Braun, K. Knudtson, A. Dorrance, G. DiBona, J. Huang, T. Casavant, V. Sheffield, E. Stone .Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences of the United States of America, 2006.

Usage

load("data/bardet.rda")
str(bardet)
## List of 2
##  $ x: num [1:120, 1:100] 0.44705 0.46684 0.01498 0 0.00281 ...
##  $ y: num [1:120] 8.42 8.36 8.41 8.29 8.27 ...

B.3 HIV data set

Description

Genotypes associated to 605 individuals with AIDS (HIV).

Format

Two objects are created on load:

  1. X - a 605x300 matrix giving the genotypes of 605 individual for 300 SNPs.
  2. y - a size 605 vector giving the concentration of virus in the blood for each individual.
load("data/HIVdata.rda")
ls(); str(X); str(y)
## [1] "X" "y"
##  num [1:605, 1:300] 3 2 2 3 2 2 1 1 3 2 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:605] "030101" "030102" "060101" "060102" ...
##   ..$ : chr [1:300] "rs1264550" "rs9261947" "rs1079541" "rs9295871" ...
##  num [1:605] 4.56 5.18 5.2 6.18 6.11 4.64 2.91 6 5.9 4.74 ...

Reference

Dalmasso, C., Carpentier, W., Meyer, L., Rouzioux, C., Goujard, C., Chaix, M. L., … & Theodorou, I. (2008). Distinct genetic loci control plasma HIV-RNA and cellular HIV-DNA levels in HIV-1 infection: the ANRS Genome Wide Association 01 study. PloS one, 3(12), e3907-e3907.

B.4. Colorectal data set

Description

Data set giving the gene expression levels in tumoral or healty tissues or for patient with colorectal cancer. 62 samples have been analyzed for2000 genes or transcripts.

Format

Three objects are created on load:

  1. X - a 62x2000 matrix giving the log transform expression level sampled in 62 tissues
  2. y - a size 62 vector giving the status of the patient (-1: tumoral, 1: healty).
  3. genes.info - a list with length 2000 giving information about the studied genes.
load("data/colorectal.rda")
ls()
## [1] "genes.info" "X"          "y"

Reference

U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine, “Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays”, PNAS, vol. 96, 1999.

B.5. Liver data set

Description

This data set contains the expression measure of 3116 genes and 10 clinical measurements for 64 subjects (rats) that were exposed to non-toxic ortoxic doses of acetaminophen in a controlled experiment.

Format

Three objects are created on load:

  1. x - a 64x3116 matrix giving the log transform expression level sampled of 3116 genes of the 64 rats
  2. y - a size 64 vector giving the treatment (0: non-toxic, 1: toxic).

Usage

load("data/liver.RData")
ls()
## [1] "x" "y"

Reference

Bushel, P., Wolfinger, R. D. and Gibson, G. (2007). Simultaneous clustering of gene expression data with clinical chemistry

 B.6 Nutrimouse

Description

The nutrimouse dataset contains the expression measure of 120 genes potentially involved in nutritional problems and the concentrations of one hepatic fatty acids for forty mice.

Format

A data frame with 121 columns and 40 rows. The first 120 numerical variables are gene expression. The last columns contains the contentration (in proportion) of fat lipid.

Usage

load("data/mice.rda")
ls()
## [1] "mice"

Reference

Martin, P. G. P., Guillou, H., Lasserre, F., Dejean, S., Lan, A., Pascussi, J.-M., San Cristobal, M., Legrand, P., Besse, P. and Pineau, T. (2007). Novel aspects of PPARα-mediated regulation of lipid and xenobiotic metabolism revealed through a multrigenomic study. Hepatology 54, 767-777.

B.7. Ferretin data set

Description

This data set has been collected at the Australian National Sport Institue, representing the concentration in Ferretin and various covariate for 102 men et 100 women.

Format

A data frame with 13 columns and 202 rows.

-Sport Sport -Sex male or female -Ht Height in cm -Wt Weight in kg -LBM Lean body mass -RCC Red cell count -WCC White cell count -Hc Hematocrit -Hg Hemoglobin -Ferr Plasma ferritin concentration -BMI Body mass index = weight/height^2 -SSF Sum of skin folds -XBfat % body fat

Usage

load("data/ferritin.RData")
head(ferritin)
##      Sex Sport  RCC WCC   Hc   Hg Ferr   BMI   SSF X.Bfat   LBM    Ht   Wt
## 1 female BBall 3.96 7.5 37.5 12.3   60 20.56 109.1  19.75 63.32 195.9 78.9
## 2 female BBall 4.41 8.3 38.2 12.7   68 20.67 102.8  21.30 58.55 189.7 74.4
## 3 female BBall 4.14 5.0 36.4 11.6   21 21.86 104.6  19.88 55.36 177.8 69.1
## 4 female BBall 4.11 5.3 37.3 12.6   69 21.88 126.4  23.66 57.18 185.0 74.9
## 5 female BBall 4.45 6.8 41.5 14.0   29 18.96  80.3  17.64 53.20 184.6 64.6
## 6 female BBall 4.10 4.4 37.4 12.5   42 21.04  75.2  15.58 53.77 174.0 63.7

B.8 - University Rank

Description

Global score for 200 Universities as a function of various predictors.

Details

A data frame with the following columns

  • World_Rank : Rang de l’université
  • University_Name: Nom de l’université
  • Country: Localisation de l’université
  • Teaching_Rating: Taux de la qualité d’enseignement de l’université, entre 0-100 .
  • Inter_Outlook_Rating: Taux de la composition internationale de l’université, entre 0-100.
  • Research_Rating: Taux de la qualité de recherche de l’université, entre 0-100.
  • Citations_Rating: Taux de citations des papiers par d’autres universités, entre 0-100.
  • Industry_Income_Rating: Taux de l’investissement des entreprises dans la recherche de l’université, entre 0-100.
  • Total_Score: Score Final (Variable à expliquer).
  • Num_Students: Nombre total des étudiants.
  • Student.Staff_Ratio: Ratio entre le nombre des étudiants et le nombre des membres académiques.
  • X._Inter_Students: Pourcetage des étudiants étrangés.
  • X._Female_Students: Pourcentage des étudiantes.
  • Year: Année académique.

Usage

load("data/univ.RData")
head(univ[, 1:3])
##   World_Rank                       University_Name
## 1          2    California Institute of Technology
## 2         75                  University of Oxford
## 3         89                   Stanford University
## 4        102               University of Cambridge
## 5        111 Massachusetts Institute of Technology
## 6        122                    Harvard University
##                    Country
## 1 United States of America
## 2           United Kingdom
## 3 United States of America
## 4           United Kingdom
## 5 United States of America
## 6 United States of America

B.9 Parkinson dataset

Description

This dataset is composed of a range of biomedical voice measurements from people with early-stage Parkinson’s disease recruited to a six-month trial of a telemonitoring device for remote symptom progression monitoring. Columns in the table contain 16 biomedical voice measures. Each row corresponds to one of 5,875 voice recording from these individuals. The main aim of the data is to predict the total UPDRS scores from the 16 voice measures, a Clinical score for the disease.

Format

A vector y with the Clinician’s total UPDRS score and a matrix x with the following columns

  • 1-5 Jitterxx - Several measures of variation in fundamental frequency
  • 6-11 Shimmerxx - Several measures of variation in amplitude
  • 12 NHR - a measures of ratio of noise to tonal components in the voice
  • 13 HNR - a measures of ratio of noise to tonal components in the voice
  • 14 RPDE - A nonlinear dynamical complexity measure
  • 15 DFA - Signal fractal scaling exponent
  • 16 PPE - A nonlinear measure of fundamental frequency variation

Usage

load("data/parkinson.Rdata")
head(x)[, 1:3]
##      Jitter... Jitter.Abs. Jitter.RAP
## [1,]   0.00662   3.380e-05    0.00401
## [2,]   0.00300   1.680e-05    0.00132
## [3,]   0.00481   2.462e-05    0.00205
## [4,]   0.00528   2.657e-05    0.00191
## [5,]   0.00335   2.014e-05    0.00093
## [6,]   0.00353   2.290e-05    0.00119
head(y)
## [1] 34.398 34.894 35.389 35.810 36.375 36.870

B.10 - Breast cancer data

Description

The goal is to predict whether epithelial cells are benign or malignant, based on 9 cytological features assessed on a scale of 1 to 10.

Format

A vector y for the status of the cell (benign, malignant) and a matrix bc.raw with the following columns

  • Clump Thickness
  • Uniformity of Cell Size
  • Uniformity of Cell Shape
  • Marginal Adhesion
  • Single Epithelial Cell Size
  • Bare Nuclei
  • Bland Chromatin
  • Normal Nucleoli
  • Mitoses

Usage

load("data/breast.RData")
head(bc.raw)[, 1:3]
##   Clump.Thickness Uniformity.of.Cell.Size Uniformity.of.Cell.Shape
## 1               5                       4                        4
## 2               3                       1                        1
## 3               6                       8                        8
## 4               4                       1                        1
## 5               8                      10                       10
## 6               1                       1                        1
head(y)
## [1] 0 0 0 0 1 0