[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime warning by rank_genes_groups #653

Closed
bioguy2018 opened this issue May 21, 2019 · 22 comments
Closed

Runtime warning by rank_genes_groups #653

bioguy2018 opened this issue May 21, 2019 · 22 comments

Comments

@bioguy2018
Copy link

Dear all,
I am receiving the following runtime warning when I search for markers within my clusters using

sc.tl.rank_genes_groups

RuntimeWarning: invalid value encountered in log2
rankings_gene_logfoldchanges.append(np.log2(foldchanges[global_indices]))

The warning only happened after I subset my initial clustering and keep few clusters and then again run PCA and HVG analysis on them and do the clustering. It still though run the command and I get the results. Does anyone know why is it happening right after I want to analyze my subset? and is it something that I should worry about ?

Thanks

@LuckyMD
Copy link
Contributor
LuckyMD commented May 21, 2019

Hey!

Just something that crossed my mind... could it be that after subsetting, you have 0 variance in some genes? You may have to rerun sc.pp.filter_genes() to take out genes that are 0 everywhere after subsetting. This would give you an NaN in the testing. Maybe check that the genes you filter out are the ones that gave you the issues in the initial run.

@bioguy2018
Copy link
Author
bioguy2018 commented May 21, 2019

@LuckyMD Hi, Thanks a lot for your reply. Well the thing is I have already filtered my cells before and after the preprocessing I did imputation on my data using MAGIC and the first clustering was with no warning. I wonder how would it be possible my values have changed because I just subset some clusters and if there was something wrong within some gene values it should have popped up before too because basically I am taking the same genes into consideration for my subset analysis...that is very confusing to me...as long as it doesn't really effect my results I wouldn't mind it but I want to be sure about it and also discover the reason behind it but so far I am clueless

@LuckyMD
Copy link
Contributor
LuckyMD commented May 21, 2019

So my idea was the following:
If you have a full dataset and some genes are 0 everywhere, except in the cells in cluster A, then you filter out cluster A in your new dataset, and recompute everything... those genes now have 0 variance in your filtered dataset. That would give you an error that didn't appear before.

@bioguy2018
Copy link
Author
bioguy2018 commented May 21, 2019

@LuckyMD Dear Malter, Thanks, now I got your point. I just filtered my cells and genes again using
sc.pp.filter_cells(adata, min_genes=200) sc.pp.filter_genes(adata, min_cells=3)
but I still get the same warning message

@LuckyMD
Copy link
Contributor
LuckyMD commented May 21, 2019

In that case my idea was wrong and that was not the warning. The invalid value encountered does sound like a NaN or a negative value that you get though... maybe check your results and check the code what it could have been. Do you have log2FC values with NaN?

@bioguy2018
Copy link
Author
bioguy2018 commented May 21, 2019

@LuckyMD The thing is after imputation for sure I do get some negative values and I have observed it but such warning was not popping up before and NaN I am doubtful because otherwise I could see a warning message when I ran the imputation for all of my genes.

p.s. This is how I made my subset

adata_magic_sub=adata_magic[(adata_magic.obs.louvain_04=="3")|(adata_magic.obs.louvain_04=="7")|(adata_magic.obs.louvain_04=="8")|(adata_magic.obs.louvain_04=="11")]

@LuckyMD
Copy link
Contributor
LuckyMD commented May 21, 2019

I think you are using a view of the anndata object, rather than the object with that method of subsetting. That shouldn't be related to the issue, but if you want to work with the subset, I would use .copy() at the end. Also, does this give you the number of cells and genes as intended? I typically put a : for the genes to get something like adata[cell_filter,:].copy(). Not sure if that's necessary though.

So I think the issue is the foldchanges[global_indices] that has values that you can't input into np.log2. That can be NaN or negative values. A fold change is something like (average expression in condition 1)/(average expression in condition 2) if expression values can be negative, then one of those values can be negative, giving a negative fold change. I would guess that these testing frameworks don't play well with negative values.

It is likely that this only pops up now, as the average expression value for the e.g., "not cluster 3" data is now negative, where before it wasn't as there was different data to average over.

If this is the issue, I'm not entirely sure what to do about this... fold changes aren't defined for such a case. I would either:

  1. rescale the data to be between two non-negative values.
  2. Set all negative values to 0.

You could take the code in the sc.tl.rank_genes_groups() function and calculate the fold changes for your genes step by step to see if this is the problem. I assume that it is.

@bioguy2018
Copy link
Author
bioguy2018 commented May 21, 2019

@LuckyMD Dear Malte, Thanks a lot for your hint and reply.
Regarding to the subset, well I got actually the cells that I needed with same number of the genes that I had before so I assume it is fine.
I did a rescale of my data to 10 again but unfortunately the same warning is happening!
I don't know really if turning all negative values to zero would really make sense because first of all as I mentioned I had negative values before and it was not a problem and if I turn all those negative values to zero I guess I will lose quite a bit of genes during my downstream analysis. What I can imagine is those problematic values are anyway not considered during the marker analysis so in the case they are NaN turning them to zero wouldn't affect my result. my fear was mainly that those genes may be really something and due to a bug or a miscommand I am not getting them but I think it is a rare probability. I guess I should leave it as it is. I will though try to take the sc.tl.rank_genes_groups function code and run it step by step

@LuckyMD
Copy link
Contributor
LuckyMD commented May 21, 2019

So the reason you didn't get this before, would be that if you do a sc.tl.rank_genes_groups() call with default parameters, you compare the expression in one cluster with the expression in the rest of the dataset. The "rest of the dataset" has changed, so you could now get -ve average expression for genes in the "rest of the dataset". This will likely create -ve fold changes, which cannot be logged and probably give you NaN. This is hiding a signal that is actually there, and not because the gene is not actually differentially expressed.

I did a rescale of my data to 10 again but unfortunately the same warning is happening!
What do you mean by this?

Turning negative values to 0, doesn't mean you lose the data. You have some expression space, of which 0 is a valid number. The question is really what does a negative expression value mean after MAGIC? Is it just a confidence of the gene not being expressed? Then putting it to zero makes sense.

Again... if you ignore this, you will just ignore particular genes which are likely differentially expression, because MAGIC has rescaled the expression values in the "rest of the dataset" to a negative value.

@LuckyMD
Copy link
Contributor
LuckyMD commented May 21, 2019

This is of course another theory, but it makes sense to me at least. Going through the function step by step will allow you to find out if this is actually the case.

@bioguy2018
Copy link
Author

@LuckyMD by rescaling to 10 I meant this

sc.pp.scale(adata_magic, max_value=10)

And regarding to the negative values in MAGIC, this is what one the creators has mentioned about it

The negative values are an artifact of the imputation process, but the absolute values of expression are not really important, since normalized scRNAseq data is only really a measure of relative expression anyway

TBH I am not sure if being an artifact would mean that it's ok to put them into zero. Would it be still your suggestion ?

@LuckyMD
Copy link
Contributor
LuckyMD commented May 21, 2019

I guess if they regard the relative values as accurate, then scaling is the way forward. Does sc.pp.scale() leave any negative values?

@bioguy2018
Copy link
Author

@LuckyMD yes, one get still negative values after scaling

@LuckyMD
Copy link
Contributor
LuckyMD commented May 21, 2019

Maybe rescale yourself then... just add the min expression value per gene to each gene... then you'll get at least a rebasing of expression values. Relative expression within genes will still be okay.

@bioguy2018
Copy link
Author

I will try it out and see what will come up ! Thanks a lot for all of your suggestions !

@LuckyMD
Copy link
Contributor
LuckyMD commented May 21, 2019

good luck!

@bioguy2018
Copy link
Author
bioguy2018 commented May 21, 2019

@LuckyMD Ok so I basically found the reason of it. it's simply the scaling! What I used to do is that after imputation and selecting HVG I was scaling my data and then getting the clusters and subsetting and running hvg agaian. But apparently the scaled data cause the warning but I am not sure why ! because it doesn't turn any of my expression to NaN when I checked my adata.X and negative values shouldn't be the source of the warning too because I already had negative values after imputation and didn't cause any warning. One thing I also should mention is that this warning happens too if I compute the HVG so its not really cluster specific or so

@LuckyMD
Copy link
Contributor
LuckyMD commented May 21, 2019

So I think the issue is the foldchanges[global_indices] that has values that you can't input into np.log2. That can be NaN or negative values. A fold change is something like (average expression in condition 1)/(average expression in condition 2) if expression values can be negative, then one of those values can be negative, giving a negative fold change. I would guess that these testing frameworks don't play well with negative values.

I believe this is the reason why it happens. If one of those two averages are negative, then your fold change is negative, and you get an error when feeding that into np.log2().

@bioguy2018
Copy link
Author
bioguy2018 commented May 21, 2019

@LuckyMD yea after some further investigation I do agree with you! do you already know a method which can I use to scale my data to non-negative values? then I can scale my data only once and right after imputation and that should be fine for the rest of downstream analysis including the sub-clustering

@LuckyMD
Copy link
Contributor
LuckyMD commented May 22, 2019

I guess negative values can mean different things across imputation methods. So having one standard way to scale is maybe not the best approach. That being said, I would probably simply do this:

CodeCogsEqn

Here, you would also put expression values to 0 if all expression values are +ve and non-zero. Otherwise, you should only do this for genes where the min() is -ve.

The above scaling solution would keep the relative scale between the genes. If you however prefer to scale to values between 0 and 1 (which I usually don't do, but others advocate; this would ensure equal weighting between genes for PCA), you can also rescale by expression range like this:

CodeCogsEqn(1)

Overall though, I'm not a big fan of imputation... especially after this paper

@bioguy2018
Copy link
Author

Interesting paper... the question is which is worse false signals from experiment or from the imputation.
Anyway. Great. Just the last thing that came to my mind is that the whole thingy is happening because I am doing imputation on my already normalized and transformed data. Wouldn't it make more sense if i do imputation on raw data after filtering and then normalize and transform my data or transform the data but normalizing it afterward? With this I won't face this problem and also I keep the original variability within my data

@LuckyMD
Copy link
Contributor
LuckyMD commented May 22, 2019

After imputation you don't have your initial variability in your data. Ideally you have only the biologically relevant variability, but that's another question. Also, imputation methods take different data as input. Our DCA method takes count data, but MAGIC takes pre-processed data I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants