[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scanpy Clustering #670

Open
Khalid-Usman opened this issue May 30, 2019 · 18 comments
Open

Scanpy Clustering #670

Khalid-Usman opened this issue May 30, 2019 · 18 comments

Comments

@Khalid-Usman
Copy link

Hi,

I have few queries regarding scanpy.

  1. As scanpy is using Louvain Leiden algorithms for clustering which optimize modularity 'Q', so how we can access and print modularity funciton?

  2. Resolution parameter gave us different number of clusters when we iterated between the best suggested 0.6-1.2. So how we know best number of cluster and how we can choose the optimum value of parameter 'resolution'?

Thanks,
Khalid

@bioguy2018
Copy link
bioguy2018 commented Jun 2, 2019

I only can advice you on your second part of questions there is no rule of thumb for that. I also don't know what do you exactly mean by best suggestion resolution and how did you assess that. This is a general problem for many supervised clustering methods such as k-mean that user has to provide number of clusters or in this case the resolution which determines the number of clusters. Although there are some indirect ways to assess the clustering quality for example silhouette coefficient which gives you a score between -1 to 1 that tell you how similar your point in each clusters are. The other possibility is that you already expect the number of clusters so you can optimize the resolution based on your previous knowledge.
@falexwolf Out of curiosity, can we integrate such methods like silhouette coefficient inside scanpy? that would be cool!

@Khalid-Usman
Copy link
Author

Thanks @bioguy2018 for your kind reply. Actually I was confused as for Pbmc3k, Scanpy and previous version of Seurat says it has 8 clusters, but in new version of Seurat clusters are 9. I think it's totally based on biological knowledge rather than optimizing paramters, like resolution.

It will be great to add something like silhouette coefficient in scanpy.

@LuckyMD
Copy link
Contributor
LuckyMD commented Jun 2, 2019

You can easily access the silhouette coefficient via scikit-learn.

I would be hesitant to base optimal numbers of clusters on the silhouette coefficient though. The number of clusters is typically dependent on the biological question of interest. There's not really a scale at which all biological questions can be answered. Therefore you have a resolution parameter to check multiple resolutions. For example, T cells could be taken as one cluster or sub-clustered into CD4+ and CD8+ (which is typically done). Here a problem with the silhouette coefficient also shows: often you have one big cluster of T-cells which reluctantly cluster into the CD4+ and CD8+ subtypes (early 10X datasets show this nicely). This will have a lower silhouette coefficient, but it is probably more informative for many people.

@bioguy2018
Copy link
bioguy2018 commented Jun 2, 2019

@LuckyMD I am wondering have you tried it before? or did someone try it with the pbmc data with regards to the t-cells that you mentioned? I agree with you that it can not be an option for answering biological questions but in the case that there is no clear biological knowledge like looking for new sub-types or so it can be an option to get some idea (mathematically).

p.s. it just crossed my mind so I am pushing a technical question also here 😄 Since we would need to calculate the distance matrix, would the input for the function be adata.X ? should it be the raw file?

@grst
Copy link
Contributor
grst commented Jun 3, 2019

Some notes/observations from my side towards choosing the proper resolution:

  • the Leiden algorithm depends on a random seed. With a different random seed, you might get a different number of clusters with the same resolution
  • a sensible resolution depends on the input data: when clustering on data processed with sc.tl.diffmap a much lower resolution will give the same number of clusters than without.
  • I performed a hyperparameter search for the resolution (steps of 0.005) on a large dataset of CD8+ T cells. I observed that at certain resolution ranges, the number of clusters is stable. In my case, I was looking for subtypes of CD8+ T cells and hypothesized that at ~0.1 and ~0.3 I would find something biologically meaningful. Would be interesting to re-do that on the PBMC dataset. I would expect a plateau at a resolution that recovers the well-known cell types CD8+, CD4+, etc.

2019-06-03_09:53:34_911x604
Fig: hyperparameter search for resolution in steps of 0.005. The graph shows the resolution vs. detected number of Leiden-clusters.

@Khalid-Usman
Copy link
Author

@LuckyMD I think we cann't use Silhouette co-efficient for the data like single cell. Where the are chances we have clusters with few points and silhouette won't be able to detect it separate cluster. E.g. in Pbmc3k dataset 'Megakaryocytes' and 'Dendritic' cell type will not be marked as a separate cluster by using your suggested co-efficient.

@Khalid-Usman
Copy link
Author

@grst Thanks it seems logical, but,

It is mentioned in Seurat Pbmc3k example that best resolution parameter is 0.6-1.2 , but you used less and get more clusters. May be because i didn't explore random seed in leiden.
In louvain and leiden we usually optimize 'modularity' value, what if we just calculate modularity values for different resolution instead of optimizing for given resolution and then for resolution where 'modularity' is maximum, we optimized 'modalarity'. Is this ok ? But i also think that 'modularity' increases when we have small number of clusters. Any suggestion ?

@grst
Copy link
Contributor
grst commented Jun 3, 2019

It is mentioned in Seurat Pbmc3k example that best resolution parameter is 0.6-1.2 , but you used less and get more clusters. May be because i didn't explore random seed in leiden.

I suspect this is because I processed the dataset with sc.tl.diffmap. The random seed tends to make only minor differences (+/- 1). As far as I can tell, the resolution parameter really is dataset dependent. But maybe someone with a better knowledge of the actual algorithm can comment on this.

@bioguy2018
Copy link
bioguy2018 commented Jun 3, 2019

@Khalid-Usman Of course you can use silhouette coefficient for any kind of clustering in principal. you just need to choose the "best option" depending on your dataset which is again depending on what you are interested in and then you can validate it by looking into your clusters markers. I am actually very curious to see the T-cells case that was mentioned here....you can also have a look here: https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
Again I would like to mention using such control parameters are mathematical methods to assess your clustering quality it might has nothing to do with the biological side and they can be actually helpful when you have no clue over the number of clusters you would look for so you would reply only on mathematics ! your question is really topic specific that what you look for and in which case you want to assess your data...if you already have an estimation or no.......there are also other ways to look for the quality of the data as grst mentioned but again it really depends on what you look into.

@Siliegia
Copy link
Siliegia commented Sep 6, 2019

It would be nice if scanpy could provide the maximum modularity obtained in the Louvain and Leiden Algorithm.

@gokceneraslan
Copy link
Collaborator
gokceneraslan commented Sep 7, 2019

@Siliegia I actually needed this many times. Here is a PR: #819

@chris-rands
Copy link
Contributor

Is there anything like clustree in python that integrates nicely with scanpy?

@xguse
Copy link
xguse commented Feb 24, 2021

Is there anything like clustree in python that integrates nicely with scanpy?

I have resorted to writing a small Rscript that takes a saved adata.h5ad file as input, loads it using reticulate, runs Clustree, saves it. I then run the script from a notebook using invoke.run from the invoke package as a function in a notebook and load the output figure as an image in the notebook.

Here is the script I use in case it helps:

suppressPackageStartupMessages({
    library(reticulate)
    library(SingleCellExperiment)
    library(glue)
    library(clustree)
})
sc <- import("scanpy")


args <- commandArgs(trailingOnly = TRUE)
H5AD_PATH = args[1]
OUT_PATH = args[2]

print(glue("H5AD_PATH: {H5AD_PATH}"))
print(glue("OUT_PATH: {OUT_PATH}"))



load_adata = function(h5ad_path) {
    adata <- sc$read_h5ad(h5ad_path)


    return(adata)
}


count_clusterings = function(adata){
    # Ryan suggests:
    # length(grep("leiden",names(adata$obs)))

    clusterings = c()
    for (x in adata$obs_keys()){
        if (startsWith(x, "leiden")){
            clusterings = append(clusterings, x)
        }
    }
    
    return(length(clusterings))
}


set_fig_dimensions = function(num_clusterings){
    width = 10
    height = (0.6 * num_clusterings)
    
    if (height < 8){
        height = 8
    }
    
    png(width = width, height = height)
    options(repr.plot.width = width, repr.plot.height = height)
    
    return(list(width=width,height=height))
}


adata = load_adata(h5ad_path=H5AD_PATH)

dims = set_fig_dimensions(num_clusterings = count_clusterings(adata))
# dims

# options(repr.plot.width = 10, repr.plot.height = 10)

g = clustree(
    x=adata$obs,
    prefix="leiden_",
#     suffix = NULL,
#     metadata = NULL,
#     count_filter = 0,
#     prop_filter = 0.1,
#     layout = "sugiyama",
#     layout = "tree",
#     use_core_edges = FALSE,
#     highlight_core = FALSE,
#     node_colour = prefix,
#     node_colour_aggr = NULL,
#     node_size = "size",
#     node_size_aggr = NULL,
#     node_size_range = c(4, 15),
#     node_alpha = 1,
#     node_alpha_aggr = NULL,
#     node_text_size = 3,
#     scale_node_text = FALSE,
#     node_text_colour = "black",
#     node_label = NULL,
#     node_label_aggr = NULL,
#     node_label_size = 3,
#     node_label_nudge = -0.2,
#     edge_width = 1.5,
#     edge_arrow = TRUE,
#     edge_arrow_ends = c("last", "first", "both"),
#     show_axis = FALSE,
#     return = c("plot", "graph", "layout"),
) #+ scale_color_brewer(palette = "Set3")

g

ggsave(
    OUT_PATH, 
    width = dims$width, 
    height = dims$height, 
    dpi = 100
)

I then run it like this from the notebook:
image

@LuckyMD
Copy link
Contributor
LuckyMD commented Feb 24, 2021

@lazappi this comment was made for you to answer ;)

@lazappi
Copy link
Contributor
lazappi commented Feb 25, 2021

I don't have a lot to add. As far as I know there's no native Python implementation of clustering trees. If you want to use the R {clustree} package you will need to transfer your data from R to Python in some way. Using straight {reticulate} to read a .h5ad file like you have here is one option but there are packages that will do it for you including {zellkonverter}, {anndata} and {SeuratDisk}. Once you have a SingleCellExperiment or Seurat you can plug that directly into {clustree}. It should also be possible to call {clustree} from Python using anndata2ri but I'm not sure of the details of how to do that.

If you only want a basic clustering tree you could just transfer the clustering assignments (by saving to CSV for example). That would probably be easier/quicker than transferring the whole dataset but you would lose the opportunity to overlay other information such as marker gene expression (which is often really helpful). Unless you plan ahead and append that to whatever you save to disk.

Sorry, that was longer than I thought 😸! Hope it helps.

@chris-rands
Copy link
Contributor

Thank you both for taking the time to answer. I was hoping there is a python port/re-implementation of clustree, but this is clearly not the case. Nonetheless, the workarounds you suggest are helpful

@lvxhnat
Copy link
lvxhnat commented Mar 13, 2023

Hopefully I am not too out of date to ask this question. Extending on this discussion, I was wondering how a few of you @bioguy2018 @Khalid-Usman @LuckyMD calculate the Silhouette Scores for your graphs? The simplest way I can think of to extract the vectors required for the calculation will be to use the adjacency matrices as vectors. However, I quickly run into memory issues on large datasets with >= 100K nodes? (Each vector will contain 100K elements) I couldn't even load the matrix into memory to perform any form of dimension reduction.

@LuckyMD
Copy link
Contributor
LuckyMD commented Mar 13, 2023

Hi @lvxhnat,

You can check how we use silhouette scores here: https://github.com/theislab/scib

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants