[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adata.raw gets modified upon log normalization of adata #3073

Open
2 of 3 tasks
AlejandraRodelaRo opened this issue May 22, 2024 · 7 comments
Open
2 of 3 tasks

Adata.raw gets modified upon log normalization of adata #3073

AlejandraRodelaRo opened this issue May 22, 2024 · 7 comments
Assignees
Labels

Comments

@AlejandraRodelaRo
Copy link

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of scanpy.
  • (optional) I have confirmed this bug exists on the main branch of scanpy.

What happened?

I am working with a set of 2 10x scRNA samples. I read them, concatenated them and then I did basic filtering. I then used "adata.raw = adata" to freeze the counts on adata.raw before proceding. Then I ran:

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

To my surprise, when I check the adata.raw I see that the values have been also lognormized (and not only adata).
Is that how it is supposed to be? Is there any way to avoid this behavior ? I know I can store the raw counts in layers, I just want to understand how it works.

To check the data I used :
print(adata.raw.X[1:10,1:10])

Minimal code sample

#read the data
Data1_adata= sc.read_10x_mtx(
    '/Data_1/filtered_feature_bc_matrix',  
    var_names='gene_symbols', index)
    cache=True)   
#concatenate
adata = Data1_adata.concatenate(Data2_adata)
# save raw counts in raw slot.
adata.raw = adata  
# normalize to depth 10 000
sc.pp.normalize_total(adata, target_sum=1e4)

# logaritmize
sc.pp.log1p(adata)

#check adata.raw 
print(adata.raw.X[1:10,1:10])

Error output

No response

Versions

anndata     0.10.7
scanpy      1.10.0
-----
PIL                         8.4.0
anyio                       NA
arrow                       1.3.0
asttokens                   NA
attr                        23.2.0
attrs                       23.2.0
babel                       2.14.0
backcall                    0.2.0
bottleneck                  1.3.7
brotli                      NA
certifi                     2024.02.02
cffi                        1.16.0
chardet                     5.2.0
charset_normalizer          3.3.2
cloudpickle                 3.0.0
colorama                    0.4.6
comm                        0.2.1
cycler                      0.12.1
cython_runtime              NA
cytoolz                     0.12.3
dask                        2024.2.0
dateutil                    2.8.2
debugpy                     1.8.1
decorator                   5.1.1
defusedxml                  0.7.1
exceptiongroup              1.2.0
executing                   2.0.1
fastjsonschema              NA
fqdn                        NA
h5py                        3.7.0
idna                        3.6
igraph                      0.11.4
importlib_resources         NA
ipykernel                   6.29.2
ipywidgets                  8.1.2
isoduration                 NA
jedi                        0.19.1
jinja2                      3.1.3
joblib                      1.3.2
json5                       NA
jsonpointer                 2.4
jsonschema                  4.21.1
jsonschema_specifications   NA
jupyter_events              0.9.0
jupyter_server              2.12.5
jupyterlab_server           2.25.3
kiwisolver                  1.4.5
legacy_api_wrap             NA
leidenalg                   0.10.2
llvmlite                    0.40.0
louvain                     0.8.0
lz4                         4.3.3
markupsafe                  2.1.5
matplotlib                  3.8.0
matplotlib_inline           0.1.6
mpl_toolkits                NA
natsort                     8.4.0
nbformat                    5.9.2
nt                          NA
numba                       0.57.1
numexpr                     2.8.7
numpy                       1.23.0
overrides                   NA
packaging                   23.2
pandas                      1.5.3
parso                       0.8.3
patsy                       0.5.6
pickleshare                 0.7.5
pkg_resources               NA
platformdirs                4.2.0
plotly                      5.19.0
prometheus_client           NA
prompt_toolkit              3.0.42
psutil                      5.9.8
pure_eval                   0.2.2
pyarrow                     11.0.0
pycparser                   2.21
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.9.5
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pygments                    2.17.2
pynndescent                 0.5.11
pyparsing                   2.4.7
pythoncom                   NA
pythonjsonlogger            NA
pytz                        2024.1
pywin32_system32            NA
pywintypes                  NA
referencing                 NA
requests                    2.31.0
rfc3339_validator           0.1.4
rfc3986_validator           0.1.1
rpds                        NA
scipy                       1.13.0
seaborn                     0.13.2
send2trash                  NA
session_info                1.0.0
six                         1.16.0
sklearn                     1.2.2
sniffio                     1.3.0
socks                       1.7.1
sphinxcontrib               NA
stack_data                  0.6.2
statsmodels                 0.14.0
tblib                       3.0.0
texttable                   1.7.0
threadpoolctl               3.3.0
tlz                         0.12.3
toolz                       0.12.1
tornado                     6.3.3
traitlets                   5.14.1
typing_extensions           NA
uri_template                NA
urllib3                     1.26.15
wcwidth                     0.2.13
webcolors                   1.13
websocket                   1.7.0
win32api                    NA
win32com                    NA
win32con                    NA
win32trace                  NA
winerror                    NA
yaml                        6.0.1
zipp                        NA
zmq                         25.1.2
zoneinfo                    NA
zope                        NA
zstandard                   0.22.0
-----
IPython             8.15.0
jupyter_client      8.6.0
jupyter_core        5.7.1
jupyterlab          4.1.1
notebook            7.1.0
-----
Python 3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:29:04) [MSC v.1929 64 bit (AMD64)]
Windows-10-10.0.22631-SP0

@AlejandraRodelaRo AlejandraRodelaRo added Bug 🐛 Triage 🩺 This issue needs to be triaged by a maintainer labels May 22, 2024
@ilan-gold
Copy link
Contributor

@AlejandraRodelaRo I am not so familiar with raw as it somewhat predates my time on the project. I will say, though, that I don't think there is a copy made of the object when you assign it to raw, so the change makes sense (even if the raw name indicates otherwise). And looking at the code, when you assign to raw using an already made AnnData object, there is no copy made (from what I can tell). Whether or not this is a bug or feature is sort of out of my knowledge base.

cc @flying-sheep

@mssher07
Copy link
mssher07 commented Jun 4, 2024

Hi all, @flying-sheep @falexwolf

Wanted to echo Alejandro and highlight this is a critical bug, since nearly every function carries a use_raw flag, and the assumption that .raw contains counts is used explicitly or implicitly in numerous scanpy functions. We just realized that a massive dataset we've been processing for ~6 weeks also has no reads in the .raw despite saving it prior to log1p/normalize functions.

I am not sure it's helpful, but we see this bug in version scanpy 1.9.8, but in an old dataset/environment with scanpy 1.6.0, .raw correctly preserved counts.

@ivirshup
Copy link
Member
ivirshup commented Jun 5, 2024

My opinion would be that you need to write adata.raw = adata.copy() if you want a copy to be made, since almost all assignments do not create a copy of the assigned object in anndata. But we should look into whether this is a change that was made deliberately or not.

If we don't change it, we could maybe warn if we're mutating adata.X and adata.raw.X also refers to the same thing?

Overall, I would recommend that you use adata.layers["counts"] = adata.X.copy() instead of using .raw at all though.

@mssher07
Copy link
mssher07 commented Jun 6, 2024

My opinion would be that you need to write adata.raw = adata.copy() if you want a copy to be made, since almost all assignments do not create a copy of the assigned object in anndata. But we should look into whether this is a change that was made deliberately or not.

That makes python-sense. This is absolutely a change in convention though, see:

  1. The original scanpy tutorial
  2. The scVI tutorial (where they discuss needing to retain counts in raw)
  3. Most notably in the anndata API

In addition, both sc.pl.umap and sc.pl.paga_path() come to mind as functions that default to using the .raw layer

If we don't change it, we could maybe warn if we're mutating adata.X and adata.raw.X also refers to the same thing?

I think that's a good idea. In general, it would be very helpful to preserve in the anndata structure some record of the major transformations to .X (or any layer)

Overall, I would recommend that you use adata.layers["counts"] = adata.X.copy() instead of using .raw at all though.

This seems like good practice and the workaround we'll apply for now.

I do wonder if some change was made after this conversation which you were a part of. Thank you by the way, this package is an amazing tool.

@flying-sheep flying-sheep removed the Triage 🩺 This issue needs to be triaged by a maintainer label Jun 7, 2024
@AlejandraRodelaRo
Copy link
Author

Hello, thank you for your helpful answers.

Could someone please elaborate on when to use .copy() and when it is not needed?
In the original scanpy tutorial I see it in ocasions but not always when modifying adata.
For example:
image

@flying-sheep
Copy link
Member
flying-sheep commented Jun 20, 2024

It’s needed when you modify the AnnData object afterwards.

The above slices it twice, and only then copies it, because slicing isn’t a modification. So what’s happening is:

adata_orig = AnnData(...)

adata_sliced_view = adata_orig[..., :]
assert adata_sliced_view.is_view
adata_sliced_copy = adata_sliced_view[..., :].copy()
assert not adata_sliced_copy.is_view

do_modify(adata_sliced_copy)

The slicing could also have been done in one operation

adata = adata_orig[(adata.obs["n_genes_by_count"] < 2500) & (adata.obs["pct_counts_mt"] < 5), :].copy()

@AlejandraRodelaRo
Copy link
Author

So, does that mean that every time we apply some kind of filtration (adata = adata[ condition]) we should use .copy()?
For instance, when filtering the highly variable genes (see the image extracted from the scanpy legacy workflow)?
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants