## Preliminaries

### Package requirements

library(tidyverse)    # advanced data manipulation and vizualisation
library(knitr)        # R notebook export and formatting
library(ggsci)        # Fancy color palettes inspired by scientific journals
library(cowplot)      # To combine several plots on the same canvas

## Exercise 1

### Import data

This exercise relies on the native R dataset “UCBAdmissions” which stores student admissions at UC Berkeley for the six largest departments in 1973 in function of the student gender. Please note that the data is originally stored as a 3-dimensional array, which we convert into a data frame for the rest of the exercise.

#TO RUN
UCBAdmissions #3-dim array
## , , Dept = A
##
##           Gender
##   Rejected  313     19
##
## , , Dept = B
##
##           Gender
##   Rejected  207      8
##
## , , Dept = C
##
##           Gender
##   Rejected  205    391
##
## , , Dept = D
##
##           Gender
##   Rejected  279    244
##
## , , Dept = E
##
##           Gender
##   Rejected  138    299
##
## , , Dept = F
##
##           Gender
##   Rejected  351    317
UCBAdmissions <- as.data.frame(UCBAdmissions) #Convert into a data frame
head(UCBAdmissions)
##      Admit Gender Dept Freq
## 1 Admitted   Male    A  512
## 2 Rejected   Male    A  313
## 3 Admitted Female    A   89
## 4 Rejected Female    A   19
## 5 Admitted   Male    B  353
## 6 Rejected   Male    B  207

Question 1 : Have a closer look at the dataset : how many rows? How many variables? How many observations (i.e. number of applications)? How many different college departments are there? Briefly describe the variables.

UCBAdmissions %>% nrow() #24 rows
## [1] 24
UCBAdmissions %>% ncol() #4 variables
## [1] 4
UCBAdmissions %>% select(Dept) %>% unique() %>% nrow() #Number of departments
## [1] 6
sum(UCBAdmissions$Freq) #Total number of observations ## [1] 4526 Correction: The dataset comprises 24 rows and 4 variables : Admit (Admitted or Rejected), Gender (Male or Female), Dept (the university department) and Freq (the number of students). In total 4526 observations are stored (which is the sum of the column Freq). There are 6 different departments whose names go from A to F. ### Data description Question 2 : Using what you have learned about tidy data and the dplyr package, compute the total number of admissions and rejections per Gender as well as the admission rate (number of admissions divided by number of aplications) per Gender. Do you see a difference of admission rates per Gender at Berkeley? UCBAdmissions %>% group_by(Gender, Admit) %>% summarise(count = sum(Freq)) %>% group_by(Gender) %>% mutate(rate_per_gender = count / sum(count)) ## # A tibble: 4 x 4 ## # Groups: Gender [2] ## Gender Admit count rate_per_gender ## <fct> <fct> <dbl> <dbl> ## 1 Male Admitted 1198 0.445 ## 2 Male Rejected 1493 0.555 ## 3 Female Admitted 557 0.304 ## 4 Female Rejected 1278 0.696 Correction : It seem that there is a huge difference between male admissions (about 45% percent of applications) and female admissions (only 30% of applications). Question 3 : Compute now the number of : 1. Applications by departement 2. Admission/rejection rates per department (without distinction on the gender). Are there departments with more applications than others? Do the admission rates differ according to the department? UCBAdmissions %>% group_by(Dept, Admit) %>% summarise(count = sum(Freq)) %>% group_by(Dept) %>% mutate(total_number_applications = sum(count)) %>% mutate(rate_per_dpt = count / total_number_applications) %>% filter(Admit == "Admitted") #No need to keep the rows for rejection since they do not add more information.  ## # A tibble: 6 x 5 ## # Groups: Dept [6] ## Dept Admit count total_number_applications rate_per_dpt ## <fct> <fct> <dbl> <dbl> <dbl> ## 1 A Admitted 601 933 0.644 ## 2 B Admitted 370 585 0.632 ## 3 C Admitted 322 918 0.351 ## 4 D Admitted 269 792 0.340 ## 5 E Admitted 147 584 0.252 ## 6 F Admitted 46 714 0.0644 Correction The number of applications per departement does not vary much : it goes from 584 (department E) to 933 (department A). There is yet a huge difference in admission rates according to the departments, since it goes from about 6% (department F) to 64% (department A). Question 4 : Compute finally the rates of male/female applications per department (without taking into account the admission results). Can you draw an hypothesis to explain the difference of rate admissions between male and female (question 2)? UCBAdmissions %>% group_by(Dept, Gender) %>% summarise(count = sum(Freq)) %>% group_by(Dept) %>% mutate(total_number_applications = sum(count)) %>% mutate(gender_rate_per_dpt = count / sum(count)) ## # A tibble: 12 x 5 ## # Groups: Dept [6] ## Dept Gender count total_number_applications gender_rate_per_dpt ## <fct> <fct> <dbl> <dbl> <dbl> ## 1 A Male 825 933 0.884 ## 2 A Female 108 933 0.116 ## 3 B Male 560 585 0.957 ## 4 B Female 25 585 0.0427 ## 5 C Male 325 918 0.354 ## 6 C Female 593 918 0.646 ## 7 D Male 417 792 0.527 ## 8 D Female 375 792 0.473 ## 9 E Male 191 584 0.327 ## 10 E Female 393 584 0.673 ## 11 F Male 373 714 0.522 ## 12 F Female 341 714 0.478 Correction There is a huge difference in application rates per department according to the Gender : males tend to apply more to departments A (88%) and B (95%) , women tend to apply more to departments C (64%) and E (67%). There is about a 50% gender rate application for departments D and F. When crossing these results with the results from the previous question, it seems that women tend to apply more to departments with lower application rates (30 and 25% respectively for department C and E). This may explain the huge difference in admission rates between men and women. ### Fancy data representation The paradox you have illustrated in the previous part is called the Simpson’s paradox. You can read more about it here. The goal of this section is to illustrate this paradox by plotting the sex admission rate (in percentage) per department. Question 5 Please observe the plot thereafter and reproduce it using ggplot2. include_graphics("resources/sex_admission_rate.png") Hints : 1. The color palette is “JAMA”, which comes from the package ‘ggsci’ loaded at the beginning of the homework. You can use the function “scale_color_jama()” or “scale_fill_jama()”. 2. You can use the function round to round the percentage of admission per sex. 3. You can use the function paste0 to concatenate in a single string the number of applications (Freq) and the sex admission rate in percentage. The special character ‘\n’ is used for line break. 4. To use geom bar with pre-computed Frequencies (and thus with an “x” and a “y” aesthetics) you should use stat = “identity”. 5. To ask geom_text to use a dodge position, you can set the argument “position” to the value position_dodge(width =1). 6. The argument “size” from the function geom_text enables to change the font size. #Step 1 : compute the pourcentage of admission rate per gender and per departments UCBAdmissions_sexrate <- UCBAdmissions %>% group_by(Dept, Gender) %>% mutate(percent_admission_sex = round(100* Freq / sum(Freq))) %>% filter(Admit == "Admitted") %>% mutate(label = paste0("n = ", Freq, "\n(", percent_admission_sex,"%)")) head(UCBAdmissions_sexrate) ## # A tibble: 6 x 6 ## # Groups: Dept, Gender [6] ## Admit Gender Dept Freq percent_admission_sex label ## <fct> <fct> <fct> <dbl> <dbl> <chr> ## 1 Admitted Male A 512 62 "n = 512\n(62%)" ## 2 Admitted Female A 89 82 "n = 89\n(82%)" ## 3 Admitted Male B 353 63 "n = 353\n(63%)" ## 4 Admitted Female B 17 68 "n = 17\n(68%)" ## 5 Admitted Male C 120 37 "n = 120\n(37%)" ## 6 Admitted Female C 202 34 "n = 202\n(34%)" #Step 2 : plot ggplot(data = UCBAdmissions_sexrate, aes(x = Dept, y = percent_admission_sex, fill = Gender)) + geom_bar(stat = "identity",position = "dodge") + scale_fill_jama() + geom_text(aes(label = label, y = percent_admission_sex/2), position = position_dodge(width=1), color = "white", size = 2) + theme_minimal() + xlab("Department") + ylab("Sex admission rate (in percentage)") + theme(legend.position = "bottom") + labs(title = "Admission rate at UC Berkeley in 1993 per sex and per department", subtitle = "Illustration of the Simpson's paradox", caption = "Data coming from UCBAdmissions dataset") #+  #ggsave("resources/sex_admission_rate.png") ## Exercise 2 ### Import data Exercise 2 relies on the dataset “relig_income” coming from the tidyverse package. The dataset store the results of a religion vs income survey : number of survey respondents in each category religion/income. The incomes are cut into 10k dollar ranges. Note that there are special categories for “missing” data : “Don’t know/ refused”. relig_income ## # A tibble: 18 x 11 ## religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k
##    <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
##  1 Agnostic      27        34        60        81        76       137        122
##  2 Atheist       12        27        37        52        35        70         73
##  3 Buddhist      27        21        30        34        33        58         62
##  4 Catholic     418       617       732       670       638      1116        949
##  5 Don’t k…      15        14        15        11        10        35         21
##  6 Evangel…     575       869      1064       982       881      1486        949
##  7 Hindu          1         9         7         9        11        34         47
##  8 Histori…     228       244       236       238       197       223        131
##  9 Jehovah…      20        27        24        24        21        30         15
## 10 Jewish        19        19        25        25        30        95         69
## 11 Mainlin…     289       495       619       655       651      1107        939
## 12 Mormon        29        40        48        51        56       112         85
## 13 Muslim         6         7         9        10         9        23         16
## 14 Orthodox      13        17        23        32        32        47         38
## 15 Other C…       9         7        11        13        13        14         18
## 16 Other F…      20        33        40        46        49        63         46
## 17 Other W…       5         2         3         4         2         7          3
## 18 Unaffil…     217       299       374       365       341       528        407
## # … with 3 more variables: 100-150k <dbl>, >150k <dbl>, Don't ## # know/refused <dbl> The goal of this exercise is to plot the distribution of religious beliefs faceted by income ranges. You see that the way the data is presented (with the income ranges in different columns) prevents us from directly using “facet_wrap” or “group_by” function. One commonly says that the dataset is in a “wide” format. We will present two methods to override this issue : 1. Draw separately one plot per income range and gather all the plots using the package cowplot. Though being a good introduction to cowplot, we will see why this method proves to be very inefficient here. 2. Transform the dataset from a wide format to a long format with tidyverse tools in order to use “facet_wrap”. This method is much more recommended. ### Method 1 Question 1 Let’s focus for this question on a given income range (let’s say less than 10k dollars). Note that you surround the colnames by “” since they contain sepcial character (which is something you should definitely avoid when creating datasets). 1. Compute the rounded percentage of religious affiliations for this income range. Arrange the dataset per decreasing percentage and keep only the 5th top rows (you can use the function head) 2. Plot the distibution of the top 5 religious affiliations as a bar plot. Make sure to rename the y axis label properly as well as to choose a color palette for religion. Use geom_text to write the percentage of each religion for this income range on top of the bars (you can use vjust = 0 to align the text to the top). Remove the x axis text and x axis title since it is redondant with the legend and position the legend at the bottom of the plot. Once done, rename your ggplot object as relig_income_less_10k. #Step 1 relig_income_one_range <- relig_income %>% select(religion, <10k) %>%
mutate(percent = round(100 *<$10k / sum(<$10k))) %>%
arrange(desc(percent)) %>%

#Step 2
relig_income_less_10k = ggplot(data = relig_income_one_range, aes(x = religion, y = <$10k, fill = religion)) + geom_text(aes(label= paste0(percent,"%")), color = "black", vjust = 0) + geom_bar(stat = "identity") + scale_fill_npg(name = "Religion") + xlab("") + ylab("Count") + ggtitle("Distibution of the religous beliefs for <$10k incomes")+
theme(legend.position = "bottom",
axis.text.x = element_blank())
relig_income_less_10k

Question 2

Reproduce the code above for data income range “more than 150k dollars”. Call your plot relig_income_more_150k. What issue can you see with the color palette?

#Step 1
relig_income_one_range <- relig_income %>%
select(religion, >150k) %>%
mutate(percent = round(100 *>150k / sum(>150k))) %>%
arrange(desc(percent)) %>%

#Step 2
relig_income_more_150k = ggplot(data = relig_income_one_range, aes(x = religion, y = >150k, fill = religion)) +
geom_text(aes(label= paste0(percent,"%")), color = "black", vjust = 0) +
geom_bar(stat = "identity") +
scale_fill_npg(name = "Religion") +
xlab("") +
ylab("Count") +
ggtitle("Distibution of the religous beliefs for >*150k incomes")+
theme(legend.position = "bottom",
axis.text.x = element_blank())
relig_income_more_150k

Correction The top five religions differ among the two plots. Since the two plots were generated independently with the same color palette, the color affiliation changes : there is no common legend.

Question 3

Use the function plot_grid from the cowplot package to gather the two plots into one figure (on the same row). Are you satsified with the results?

#?plot_grid
plot_grid(relig_income_less_10k,
relig_income_more_150k,
nrow = 1 )

Correction : Plot_grid function is very useful to gather plots (easy to use, automatic alginements). Yet, the plots to gather must be drawn independently. As a result, there is no common legend in the final plot. Though being vey useful to gather plots coming from different sources into one figure, cowplot is less efficient that facet_wrap to stratify studies according to one variable (the income range in this case).

You have seen how to gather two plots into a single one using plot_grid. Acutually, plot_grid can take as argument a list of plots, meaning that we could have used it to draw bar plots for each level of incomes (using a for loop or a function such as lapply to generate the list of plots).

In next section, we will foccus on another method, much more useful in practice, to plot hte distribution of top 5 religious beliefs per income range.

### Method 2

The relig_income dataset is stored into a wide format. In this section, we are going to convert it into a “long” format meaning that we are going to create a new variable storing income ranges. The dataset in a long format will have only 3 columns (religion, income and count) but more rows than the original dataset in wide format : one row per couple religion/income range.

Question 4

Transform relig_income into relig_income_long using the function pivot_longer function from the tidyverse package.

?pivot_longer

relig_income_long <- relig_income %>%
pivot_longer(!religion,
names_to = "income",
values_to = "count")

nrow(relig_income_long) #18 religions * 10 income ranges = 180 
## [1] 180

Question 5

Compute the rounded percentage of people belonging to each religion per income range. Keep only the 5 top religions per income range (you can use group_by and slice_max function).

relig_income_long_top5 <- relig_income_long %>%
group_by(income) %>%
dplyr::mutate(percent = round(100*count/sum(count))) %>%
slice_max(order_by = count, n = 5)
relig_income_long_top5
## # A tibble: 50 x 4
## # Groups:   income [10]
##    religion                income count percent
##    <chr>                   <chr>  <dbl>   <dbl>
##  1 Evangelical Prot        <$10k 575 30 ## 2 Catholic <$10k    418      22
##  3 Mainline Prot           <$10k 289 15 ## 4 Historically Black Prot <$10k    228      12
##  5 Unaffiliated            <10k 217 11 ## 6 Mainline Prot >150k 634 24 ## 7 Catholic >150k 633 24 ## 8 Evangelical Prot >150k 414 16 ## 9 Unaffiliated >150k 258 10 ## 10 Jewish >150k 151 6 ## # … with 40 more rows Question 6 Plot the distribution of top 5 religions facetted by income ranges (using facet_wrap). Make sure to rename the y axis label properly as well as to choose a color palette for religion. Use geom_text to write the percentage of each religion for this income range on top of the bars (you can use vjust = 0 to align the text to the top). Remove the x axis text and title since it is redondant with the legend and position the legend at the bottom of the plot. ggplot(data = relig_income_long_top5, aes(x = religion, y = percent, fill = religion)) + geom_text(aes(label= paste0(percent,"%")), color = "black", vjust = 0) + geom_bar(stat = "identity") + scale_fill_npg(name = "Religion") + xlab("") + ylab("Count") + facet_wrap( .~ income) + ggtitle("Distribution of the religious beliefs")+ theme(legend.position = "bottom", axis.text.x = element_blank()) You should see that facet_wrap creates a common legend (a religion has the same color in all the plots) and leaves blanks when a religion does stand in the top 5 religion for a given income range. Question 7 We still have an issue concerning the order of the facet wrap plots which seems to follow the alphabetical order, while we would have liked it to be arranged from lower to higher incomes. One way to cope with this is to set the levels of the variable income (seen as a factor variable) either by hand or by using regular expressions. In this case, we can use the order defined by the columns of the wide relig_income dataset, which were already in the good order. Re-do the plot after arranging the levels of the variable income. relig_income_long_top5income <- factor(relig_income_long_top5\$income, levels = setdiff(colnames(relig_income),"religion"))

ggplot(data = relig_income_long_top5, aes(x = religion, y = percent, fill = religion)) +
geom_text(aes(label= paste0(percent,"%")), color = "black", vjust = 0) +
geom_bar(stat = "identity") +
scale_fill_npg(name = "Religion") +
xlab("") +
ylab("Count") +
facet_wrap( .~ income) +
ggtitle("Distribution of the religious beliefs")+
theme(legend.position = "bottom",
axis.text.x = element_blank())`