We start by loading a couple of packages for data manipulation, dimension reduction clustering, fancy representations, etc.
library(tidyverse) # opinionated collection of packages for data manipulation library(aricode) # fast computation of clustering measures theme_set(theme_bw()) # plots themes
We start with the sample of scRNA data presented during the course on non-linear dimension reduction approaches. It consists in the normalized gene-level expression of 100 representative genes for a collection of 301 cells spread in 11 cell-lines.
The data are imported as follows:
load("data/scRNA.RData") <- pollen$data %>% t() %>% as_tibble() %>% scRNA add_column(cell_type = pollen$celltypes)
kmeans) for 11 clusters a hundred of time and keep track of the within sum of squares.
We replicate 100 runs of the kmeans with a single start:
<- scRNA %>% select(-cell_type) %>% as.matrix() scRNA_matrix <- replicate(100, kmeans(scRNA_matrix, centers = 11, nstart = 1), simplify = FALSE)kmeans_out
Overall, there can be a variation of up to 50% of the total sum of squares for different initialization of the kmeans:
data.frame(WSS = map_dbl(kmeans_out, "tot.withinss"), run = as.character(1:100)) %>% ggplot() + aes(y = WSS/max(WSS)) + geom_boxplot() + ggtitle("Variation of the rescaled Total sum of squares")
At the cluster level, one can see that some clusters are a bit more stable (unstable) than others, probably with individual at the boundary, more difficult to affect to a given group.
map(kmeans_out, "withinss") %>% # tranformation to a named data frame/tibble do.call("rbind", .) %>% as_tibble() %>% setNames(paste0("cluster", 1:11)) %>% pivot_longer(everything(), names_to = "cluster", values_to = "WSS") %>% ggplot() + aes(x = cluster, y = WSS) + geom_boxplot() + ggtitle("Within sum of squares per cluster")
nstartset to 20. Save the results.
map of any other
apply-like functional is your friend ! We go up to 30 clusters and make 50 restart for each run, in order to get rather smooth ARI/NID curves:
<- 1:30 nb_clusters <- map(nb_clusters, ~kmeans(scRNA_matrix, centers = ., nstart = 50))kmeans_var_cluster
A consensus appear around 10 clusters: note that the curves are not perfectly smooth, partly due to convergence of the k-means algorithm to local minima.
<- map(kmeans_var_cluster, "cluster") %>% # extract the clusterings ARI map_dbl(aricode::ARI, scRNA$cell_type) ## compute the ARI <- map(kmeans_var_cluster, "cluster") %>% # extract the clusterings NID map_dbl(aricode::NID, scRNA$cell_type) ## compute the ARI tibble(ARI = ARI, NID = NID, nb_clusters = nb_clusters) %>% pivot_longer(-nb_clusters, values_to = "value", names_to = "measure") %>% group_by(measure) %>% ggplot() + aes(x = nb_clusters, y = value, color = measure) + geom_line() + ggtitle("K-means Clustering Performance")
cutree with Ward criterion do a similar job to the k-means:
<- dist(scRNA_matrix, method = "euclidean") %>% hclust(method = "ward.D2") hclust_var_cluster <- cutree(hclust_var_cluster, nb_clusters) %>% apply(2, ARI, scRNA$cell_type) ARI <- cutree(hclust_var_cluster, nb_clusters) %>% apply(2, NID, scRNA$cell_type) NID tibble(ARI = ARI, NID = NID, nb_clusters = nb_clusters) %>% pivot_longer(-nb_clusters, values_to = "value", names_to = "measure") %>% group_by(measure) %>% ggplot() + aes(x = nb_clusters, y = value, color = measure) + geom_line() + ggtitle("Hierarchical Clustering Performance")
We use the SNP data studied during homework #3. We analyze the 5500 most variant SNP for 728 individuals with various origin, with the following descriptors:
The data are imported as follows:
load("data/SNP.RData") <- data$Geno %>% as_tibble() %>% replace(is.na(.), 0) %>% snp add_column(origin = data$origin, .before = 1)
Try the various algorithms seen during the course (kmeans, hierarchical clustering, spectral clustering, kernel versions), possibly with a step of dimension reduction first (e.g. PCA, t-SNE, UMAP). Try to get the best result in terms of ARI/NID compared to the natural classification (the origin of the population). If you have time, assess the robustness of your result in terms of ARI by performing resampling in the initial table snp (i.e., by running the analysis several times on a random subsample of your data, say 80%, and then averaging your results in terms of ARI).
A generally good baseline is hierarchical clustering (to chose visually the number of cluster) + k-means. In this case, they gave similar results:
<- dist(select(snp, -origin)) %>% hclust(, method = "ward.D2") hclust_snp <- cutree(hclust_snp, 1:20) %>% apply(2, ARI, snp$origin) ARI_hc <- kmeans(select(snp, -origin), centers = which.max(ARI_hc), nstart = 20) kmeans_snp <- ARI(kmeans_snp$cluster, snp$origin) ARI_km cat("\nHierarchical clustering: ", max(ARI_hc))
## ## Hierarchical clustering: 0.8194639
cat("\nK-means clustering: ", max(ARI_km))
## ## K-means clustering: 0.7898888
It’s going to be hard to beat!
I have tried umap with custom configuration, which is doing not so bad:
library(umap) <- umap.defaults custom.settings $n_components <- 4 custom.settings$n_neighbors <- 20 custom.settings <- select(snp, -origin) %>% as.matrix() %>% umap_out umap(config = custom.settings) $layout %>% as.data.frame() %>% add_column(origin = snp$origin) %>% umap_outggplot(aes(x = V1, y = V2, color = origin)) + geom_point(size=1.25) + guides(colour = guide_legend(override.aes = list(size=6)))
ARI(kmeans(umap_out$layout, centers = 6, nstart = 100)$cluster, snp$origin)
##  0.68618
Finally, dimension reduction via SVD and kmeans on the projected points do a little better than hierarchical clustering.
<- select(snp, -origin) %>% scale(TRUE, FALSE) %>% svd(nv = 10) SVD <- SVD$u[, 1:10] %*% diag(SVD$d[1:10]) %>% data.frame() snp_proj %>% ggplot(aes(x = X1, y = X2, color = snp$origin)) + snp_proj geom_point(size=1.25) + guides(colour = guide_legend(override.aes = list(size=6)))
ARI(kmeans(snp_proj, centers = 6, nstart = 100)$cluster, snp$origin)
##  0.8364261