-
Notifications
You must be signed in to change notification settings - Fork 583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scanpy Clustering #670
Comments
I only can advice you on your second part of questions there is no rule of thumb for that. I also don't know what do you exactly mean by best suggestion resolution and how did you assess that. This is a general problem for many supervised clustering methods such as k-mean that user has to provide number of clusters or in this case the resolution which determines the number of clusters. Although there are some indirect ways to assess the clustering quality for example silhouette coefficient which gives you a score between -1 to 1 that tell you how similar your point in each clusters are. The other possibility is that you already expect the number of clusters so you can optimize the resolution based on your previous knowledge. |
Thanks @bioguy2018 for your kind reply. Actually I was confused as for Pbmc3k, Scanpy and previous version of Seurat says it has 8 clusters, but in new version of Seurat clusters are 9. I think it's totally based on biological knowledge rather than optimizing paramters, like resolution. It will be great to add something like silhouette coefficient in scanpy. |
You can easily access the silhouette coefficient via scikit-learn. I would be hesitant to base optimal numbers of clusters on the silhouette coefficient though. The number of clusters is typically dependent on the biological question of interest. There's not really a scale at which all biological questions can be answered. Therefore you have a resolution parameter to check multiple resolutions. For example, T cells could be taken as one cluster or sub-clustered into CD4+ and CD8+ (which is typically done). Here a problem with the silhouette coefficient also shows: often you have one big cluster of T-cells which reluctantly cluster into the CD4+ and CD8+ subtypes (early 10X datasets show this nicely). This will have a lower silhouette coefficient, but it is probably more informative for many people. |
@LuckyMD I am wondering have you tried it before? or did someone try it with the pbmc data with regards to the t-cells that you mentioned? I agree with you that it can not be an option for answering biological questions but in the case that there is no clear biological knowledge like looking for new sub-types or so it can be an option to get some idea (mathematically). p.s. it just crossed my mind so I am pushing a technical question also here 😄 Since we would need to calculate the distance matrix, would the input for the function be adata.X ? should it be the raw file? |
@LuckyMD I think we cann't use Silhouette co-efficient for the data like single cell. Where the are chances we have clusters with few points and silhouette won't be able to detect it separate cluster. E.g. in Pbmc3k dataset 'Megakaryocytes' and 'Dendritic' cell type will not be marked as a separate cluster by using your suggested co-efficient. |
@grst Thanks it seems logical, but, It is mentioned in Seurat Pbmc3k example that best resolution parameter is 0.6-1.2 , but you used less and get more clusters. May be because i didn't explore random seed in leiden. |
I suspect this is because I processed the dataset with |
@Khalid-Usman Of course you can use silhouette coefficient for any kind of clustering in principal. you just need to choose the "best option" depending on your dataset which is again depending on what you are interested in and then you can validate it by looking into your clusters markers. I am actually very curious to see the T-cells case that was mentioned here....you can also have a look here: https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set |
It would be nice if scanpy could provide the maximum modularity obtained in the Louvain and Leiden Algorithm. |
Is there anything like clustree in python that integrates nicely with scanpy? |
I have resorted to writing a small Rscript that takes a saved adata.h5ad file as input, loads it using Here is the script I use in case it helps: suppressPackageStartupMessages({
library(reticulate)
library(SingleCellExperiment)
library(glue)
library(clustree)
})
sc <- import("scanpy")
args <- commandArgs(trailingOnly = TRUE)
H5AD_PATH = args[1]
OUT_PATH = args[2]
print(glue("H5AD_PATH: {H5AD_PATH}"))
print(glue("OUT_PATH: {OUT_PATH}"))
load_adata = function(h5ad_path) {
adata <- sc$read_h5ad(h5ad_path)
return(adata)
}
count_clusterings = function(adata){
# Ryan suggests:
# length(grep("leiden",names(adata$obs)))
clusterings = c()
for (x in adata$obs_keys()){
if (startsWith(x, "leiden")){
clusterings = append(clusterings, x)
}
}
return(length(clusterings))
}
set_fig_dimensions = function(num_clusterings){
width = 10
height = (0.6 * num_clusterings)
if (height < 8){
height = 8
}
png(width = width, height = height)
options(repr.plot.width = width, repr.plot.height = height)
return(list(width=width,height=height))
}
adata = load_adata(h5ad_path=H5AD_PATH)
dims = set_fig_dimensions(num_clusterings = count_clusterings(adata))
# dims
# options(repr.plot.width = 10, repr.plot.height = 10)
g = clustree(
x=adata$obs,
prefix="leiden_",
# suffix = NULL,
# metadata = NULL,
# count_filter = 0,
# prop_filter = 0.1,
# layout = "sugiyama",
# layout = "tree",
# use_core_edges = FALSE,
# highlight_core = FALSE,
# node_colour = prefix,
# node_colour_aggr = NULL,
# node_size = "size",
# node_size_aggr = NULL,
# node_size_range = c(4, 15),
# node_alpha = 1,
# node_alpha_aggr = NULL,
# node_text_size = 3,
# scale_node_text = FALSE,
# node_text_colour = "black",
# node_label = NULL,
# node_label_aggr = NULL,
# node_label_size = 3,
# node_label_nudge = -0.2,
# edge_width = 1.5,
# edge_arrow = TRUE,
# edge_arrow_ends = c("last", "first", "both"),
# show_axis = FALSE,
# return = c("plot", "graph", "layout"),
) #+ scale_color_brewer(palette = "Set3")
g
ggsave(
OUT_PATH,
width = dims$width,
height = dims$height,
dpi = 100
)
|
@lazappi this comment was made for you to answer ;) |
I don't have a lot to add. As far as I know there's no native Python implementation of clustering trees. If you want to use the R {clustree} package you will need to transfer your data from R to Python in some way. Using straight {reticulate} to read a If you only want a basic clustering tree you could just transfer the clustering assignments (by saving to CSV for example). That would probably be easier/quicker than transferring the whole dataset but you would lose the opportunity to overlay other information such as marker gene expression (which is often really helpful). Unless you plan ahead and append that to whatever you save to disk. Sorry, that was longer than I thought 😸! Hope it helps. |
Thank you both for taking the time to answer. I was hoping there is a python port/re-implementation of clustree, but this is clearly not the case. Nonetheless, the workarounds you suggest are helpful |
Hopefully I am not too out of date to ask this question. Extending on this discussion, I was wondering how a few of you @bioguy2018 @Khalid-Usman @LuckyMD calculate the Silhouette Scores for your graphs? The simplest way I can think of to extract the vectors required for the calculation will be to use the adjacency matrices as vectors. However, I quickly run into memory issues on large datasets with >= 100K nodes? (Each vector will contain 100K elements) I couldn't even load the matrix into memory to perform any form of dimension reduction. |
Hi @lvxhnat, You can check how we use silhouette scores here: https://github.com/theislab/scib |
Hi,
I have few queries regarding scanpy.
As scanpy is using Louvain Leiden algorithms for clustering which optimize modularity 'Q', so how we can access and print modularity funciton?
Resolution parameter gave us different number of clusters when we iterated between the best suggested 0.6-1.2. So how we know best number of cluster and how we can choose the optimum value of parameter 'resolution'?
Thanks,
Khalid
The text was updated successfully, but these errors were encountered: