Adata.raw gets modified upon log normalization of adata #3073

AlejandraRodelaRo · 2024-05-22T00:23:45Z

Please make sure these conditions are met

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of scanpy.
(optional) I have confirmed this bug exists on the main branch of scanpy.

What happened?

I am working with a set of 2 10x scRNA samples. I read them, concatenated them and then I did basic filtering. I then used "adata.raw = adata" to freeze the counts on adata.raw before proceding. Then I ran:

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

To my surprise, when I check the adata.raw I see that the values have been also lognormized (and not only adata).
Is that how it is supposed to be? Is there any way to avoid this behavior ? I know I can store the raw counts in layers, I just want to understand how it works.

To check the data I used :
print(adata.raw.X[1:10,1:10])

Minimal code sample

#read the data
Data1_adata= sc.read_10x_mtx(
    '/Data_1/filtered_feature_bc_matrix',  
    var_names='gene_symbols', index)
    cache=True)   
#concatenate
adata = Data1_adata.concatenate(Data2_adata)
# save raw counts in raw slot.
adata.raw = adata  
# normalize to depth 10 000
sc.pp.normalize_total(adata, target_sum=1e4)

# logaritmize
sc.pp.log1p(adata)

#check adata.raw 
print(adata.raw.X[1:10,1:10])

Error output

No response

Versions

anndata     0.10.7
scanpy      1.10.0
-----
PIL                         8.4.0
anyio                       NA
arrow                       1.3.0
asttokens                   NA
attr                        23.2.0
attrs                       23.2.0
babel                       2.14.0
backcall                    0.2.0
bottleneck                  1.3.7
brotli                      NA
certifi                     2024.02.02
cffi                        1.16.0
chardet                     5.2.0
charset_normalizer          3.3.2
cloudpickle                 3.0.0
colorama                    0.4.6
comm                        0.2.1
cycler                      0.12.1
cython_runtime              NA
cytoolz                     0.12.3
dask                        2024.2.0
dateutil                    2.8.2
debugpy                     1.8.1
decorator                   5.1.1
defusedxml                  0.7.1
exceptiongroup              1.2.0
executing                   2.0.1
fastjsonschema              NA
fqdn                        NA
h5py                        3.7.0
idna                        3.6
igraph                      0.11.4
importlib_resources         NA
ipykernel                   6.29.2
ipywidgets                  8.1.2
isoduration                 NA
jedi                        0.19.1
jinja2                      3.1.3
joblib                      1.3.2
json5                       NA
jsonpointer                 2.4
jsonschema                  4.21.1
jsonschema_specifications   NA
jupyter_events              0.9.0
jupyter_server              2.12.5
jupyterlab_server           2.25.3
kiwisolver                  1.4.5
legacy_api_wrap             NA
leidenalg                   0.10.2
llvmlite                    0.40.0
louvain                     0.8.0
lz4                         4.3.3
markupsafe                  2.1.5
matplotlib                  3.8.0
matplotlib_inline           0.1.6
mpl_toolkits                NA
natsort                     8.4.0
nbformat                    5.9.2
nt                          NA
numba                       0.57.1
numexpr                     2.8.7
numpy                       1.23.0
overrides                   NA
packaging                   23.2
pandas                      1.5.3
parso                       0.8.3
patsy                       0.5.6
pickleshare                 0.7.5
pkg_resources               NA
platformdirs                4.2.0
plotly                      5.19.0
prometheus_client           NA
prompt_toolkit              3.0.42
psutil                      5.9.8
pure_eval                   0.2.2
pyarrow                     11.0.0
pycparser                   2.21
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.9.5
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pygments                    2.17.2
pynndescent                 0.5.11
pyparsing                   2.4.7
pythoncom                   NA
pythonjsonlogger            NA
pytz                        2024.1
pywin32_system32            NA
pywintypes                  NA
referencing                 NA
requests                    2.31.0
rfc3339_validator           0.1.4
rfc3986_validator           0.1.1
rpds                        NA
scipy                       1.13.0
seaborn                     0.13.2
send2trash                  NA
session_info                1.0.0
six                         1.16.0
sklearn                     1.2.2
sniffio                     1.3.0
socks                       1.7.1
sphinxcontrib               NA
stack_data                  0.6.2
statsmodels                 0.14.0
tblib                       3.0.0
texttable                   1.7.0
threadpoolctl               3.3.0
tlz                         0.12.3
toolz                       0.12.1
tornado                     6.3.3
traitlets                   5.14.1
typing_extensions           NA
uri_template                NA
urllib3                     1.26.15
wcwidth                     0.2.13
webcolors                   1.13
websocket                   1.7.0
win32api                    NA
win32com                    NA
win32con                    NA
win32trace                  NA
winerror                    NA
yaml                        6.0.1
zipp                        NA
zmq                         25.1.2
zoneinfo                    NA
zope                        NA
zstandard                   0.22.0
-----
IPython             8.15.0
jupyter_client      8.6.0
jupyter_core        5.7.1
jupyterlab          4.1.1
notebook            7.1.0
-----
Python 3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:29:04) [MSC v.1929 64 bit (AMD64)]
Windows-10-10.0.22631-SP0

The text was updated successfully, but these errors were encountered:

ilan-gold · 2024-05-28T08:28:57Z

@AlejandraRodelaRo I am not so familiar with raw as it somewhat predates my time on the project. I will say, though, that I don't think there is a copy made of the object when you assign it to raw, so the change makes sense (even if the raw name indicates otherwise). And looking at the code, when you assign to raw using an already made AnnData object, there is no copy made (from what I can tell). Whether or not this is a bug or feature is sort of out of my knowledge base.

cc @flying-sheep

mssher07 · 2024-06-04T14:12:52Z

Hi all, @flying-sheep @falexwolf

Wanted to echo Alejandro and highlight this is a critical bug, since nearly every function carries a use_raw flag, and the assumption that .raw contains counts is used explicitly or implicitly in numerous scanpy functions. We just realized that a massive dataset we've been processing for ~6 weeks also has no reads in the .raw despite saving it prior to log1p/normalize functions.

I am not sure it's helpful, but we see this bug in version scanpy 1.9.8, but in an old dataset/environment with scanpy 1.6.0, .raw correctly preserved counts.

ivirshup · 2024-06-05T17:25:01Z

My opinion would be that you need to write adata.raw = adata.copy() if you want a copy to be made, since almost all assignments do not create a copy of the assigned object in anndata. But we should look into whether this is a change that was made deliberately or not.

If we don't change it, we could maybe warn if we're mutating adata.X and adata.raw.X also refers to the same thing?

Overall, I would recommend that you use adata.layers["counts"] = adata.X.copy() instead of using .raw at all though.

mssher07 · 2024-06-06T00:43:37Z

My opinion would be that you need to write adata.raw = adata.copy() if you want a copy to be made, since almost all assignments do not create a copy of the assigned object in anndata. But we should look into whether this is a change that was made deliberately or not.

That makes python-sense. This is absolutely a change in convention though, see:

The original scanpy tutorial
The scVI tutorial (where they discuss needing to retain counts in raw)
Most notably in the anndata API

In addition, both sc.pl.umap and sc.pl.paga_path() come to mind as functions that default to using the .raw layer

If we don't change it, we could maybe warn if we're mutating adata.X and adata.raw.X also refers to the same thing?

I think that's a good idea. In general, it would be very helpful to preserve in the anndata structure some record of the major transformations to .X (or any layer)

Overall, I would recommend that you use adata.layers["counts"] = adata.X.copy() instead of using .raw at all though.

This seems like good practice and the workaround we'll apply for now.

I do wonder if some change was made after this conversation which you were a part of. Thank you by the way, this package is an amazing tool.

AlejandraRodelaRo · 2024-06-19T15:39:21Z

Hello, thank you for your helpful answers.

Could someone please elaborate on when to use .copy() and when it is not needed?
In the original scanpy tutorial I see it in ocasions but not always when modifying adata.
For example:

flying-sheep · 2024-06-20T11:34:54Z

It’s needed when you modify the AnnData object afterwards.

The above slices it twice, and only then copies it, because slicing isn’t a modification. So what’s happening is:

adata_orig = AnnData(...)

adata_sliced_view = adata_orig[..., :]
assert adata_sliced_view.is_view
adata_sliced_copy = adata_sliced_view[..., :].copy()
assert not adata_sliced_copy.is_view

do_modify(adata_sliced_copy)

The slicing could also have been done in one operation

adata = adata_orig[(adata.obs["n_genes_by_count"] < 2500) & (adata.obs["pct_counts_mt"] < 5), :].copy()

AlejandraRodelaRo · 2024-06-26T19:21:28Z

So, does that mean that every time we apply some kind of filtration (adata = adata[ condition]) we should use .copy()?
For instance, when filtering the highly variable genes (see the image extracted from the scanpy legacy workflow)?

AlejandraRodelaRo added Bug 🐛 Triage 🩺 This issue needs to be triaged by a maintainer labels May 22, 2024

ilan-gold assigned flying-sheep May 28, 2024

flying-sheep removed the Triage 🩺 This issue needs to be triaged by a maintainer label Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adata.raw gets modified upon log normalization of adata #3073

Adata.raw gets modified upon log normalization of adata #3073

Adata.raw gets modified upon log normalization of adata #3073

Adata.raw gets modified upon log normalization of adata #3073

Comments

Please make sure these conditions are met

What happened?

Minimal code sample

Error output

Versions