[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: ufunc 'log1p' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule ''same_kind'' #435

Closed
ajitjohnson opened this issue Jan 18, 2019 · 24 comments · Fixed by #1400

Comments

@ajitjohnson
Copy link

Hi there,

I seem to have trouble analyzing a dataset.

https://www.dropbox.com/s/og5lw42chh2qujm/Trial_data1.csv?dl=0

If I run sc.pp.log1p (adata), I get the following error.

TypeError: ufunc 'log1p' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule ''same_kind''

I could normalize it on my own and say if I do a PCA analysis and try to plot the results, I get the following error.

sc.pl.pca(adata, color = 'DAPI')
TypeError: object of type 'numpy.int64' has no len()

If I plot it without the color, it does work though.

Below is how I go from CSV to AnnData

Read File

x = pd.read_csv('Trial_data1.csv', delimiter=',', index_col=0)

Convert to AnnData

file_url = 'https://raw.githubusercontent.com/ajitjohnson/Jupyter-Notebooks/master/py_scripts/mi_pp_anndata.py'
exec(open(wget.download(file_url)).read())
adata = mi_pp_anndata (x)

Any help is appreciated. Thank you.

@falexwolf
Copy link
Member

This is really strange, have you had problems with other datasets? Is your dataset corrupted in some way?

@ajitjohnson
Copy link
Author
ajitjohnson commented Jan 22, 2019

Hi Alex,
I managed to get the log working by using your function to convert to AnnData rather than mine. (adata = sc.AnnData(x))

However, coloring the plots still does not work. I get the following error.
TypeError: object of type 'numpy.int64' has no len()

You can reproduce the error by the following

Load Data

x = pd.read_csv('Trial_data.csv', delimiter=',', index_col=0)

Drop DAPI

x = x.drop(list(x.filter(regex='DAPI.', axis=1)), axis=1)

Convert to AnnData

adata = sc.AnnData(x)

Filter cells

sc.pp.filter_cells(adata, min_genes=1)
sc.pp.filter_genes(adata, min_cells=1)
adata.obs['n_counts'] = adata.X.sum(axis=1)

Normalize data

sc.pp.log1p(adata)

PCA

sc.tl.pca(adata, svd_solver='arpack')
sc.pl.pca(adata)
sc.pl.pca(adata, color='CD3D')

I also tried it on a different dataset.

@falexwolf
Copy link
Member

Could you post the full error traceback so that I see where the error is raised?

@ajitjohnson
Copy link
Author
ajitjohnson commented Jan 23, 2019

Hi Alex,

Below is the error I get. Thank you for looking at this.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-4fad8adf5d00> in <module>
----> 1 sc.pl.pca(adata, color='CD3D')

~\AppData\Local\Continuum\anaconda3\lib\site-packages\scanpy\plotting\tools\scatterplots.py in pca(adata, **kwargs)
    148     If `show==False` a `matplotlib.Axis` or a list of it.
    149     """
--> 150     return plot_scatter(adata, basis='pca', **kwargs)
    151 
    152 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\scanpy\plotting\tools\scatterplots.py in plot_scatter(adata, color, use_raw, sort_order, edges, edges_width, edges_color, arrows, arrows_kwds, basis, groups, components, projection, color_map, palette, size, frameon, legend_fontsize, legend_fontweight, legend_loc, ncols, hspace, wspace, title, show, save, ax, return_fig, **kwargs)
    275         color_vector, categorical = _get_color_values(adata, value_to_plot,
    276                                                       groups=groups, palette=palette,
--> 277                                                       use_raw=use_raw)
    278 
    279         # check if higher value points should be plot on top

~\AppData\Local\Continuum\anaconda3\lib\site-packages\scanpy\plotting\tools\scatterplots.py in _get_color_values(adata, value_to_plot, groups, palette, use_raw)
    658     # check if value to plot is in var
    659     elif use_raw is False and value_to_plot in adata.var_names:
--> 660         color_vector = adata[:, value_to_plot].X
    661 
    662     elif use_raw is True and value_to_plot in adata.raw.var_names:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\anndata\base.py in __getitem__(self, index)
   1307     def __getitem__(self, index):
   1308         """Returns a sliced view of the object."""
-> 1309         return self._getitem_view(index)
   1310 
   1311     def _getitem_view(self, index):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\anndata\base.py in _getitem_view(self, index)
   1311     def _getitem_view(self, index):
   1312         oidx, vidx = self._normalize_indices(index)
-> 1313         return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
   1314 
   1315     def _remove_unused_categories(self, df_full, df_sub, uns):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\anndata\base.py in __init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, oidx, vidx)
    662             if not isinstance(X, AnnData):
    663                 raise ValueError('`X` has to be an AnnData object.')
--> 664             self._init_as_view(X, oidx, vidx)
    665         else:
    666             self._init_as_actual(

~\AppData\Local\Continuum\anaconda3\lib\site-packages\anndata\base.py in _init_as_view(self, adata_ref, oidx, vidx)
    723             self._X = None
    724         else:
--> 725             self._init_X_as_view()
    726 
    727         self._layers = AnnDataLayers(self, adata_ref=adata_ref, oidx=oidx, vidx=vidx)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\anndata\base.py in _init_X_as_view(self)
    750             shape = (
    751                 get_n_items_idx(self._oidx, self._adata_ref.n_obs),
--> 752                 get_n_items_idx(self._vidx, self._adata_ref.n_vars)
    753             )
    754             if np.isscalar(X):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\anndata\utils.py in get_n_items_idx(idx, l)
    148         return 1
    149     else:
--> 150         return len(idx)

TypeError: object of type 'numpy.int64' has no len()

@falexwolf
Copy link
Member

Sorry for all the trouble. I just wanted to download from your dropbox link but the file wasn't there anymore...

@falexwolf
Copy link
Member

Are you running on Windows? Then the solution could be this fix in anndata: scverse/anndata#102

@ajitjohnson
Copy link
Author

Ah sorry, I happened to have deleted it today. Here is it.

https://www.dropbox.com/s/26a5rhrjj99czeq/Trial_data.csv?dl=0

Meanwhile, I will also take a look at the solution you had sent. I usually run it on a jupyter notebook and I work between windows and mac.

@LuckyMD
Copy link
Contributor
LuckyMD commented Mar 22, 2019

So I just reproduced this error for sc.pp.log1p() using my own data after using the sc.pp.downsample_counts() function. It might have to do with that?

i noticed that sc.pp.downsample_counts() returns np.int64 rather than np.float64 I reckon that's what the log transformation is complaining about.

If I add the line:

adata.X = adata.X.astype(np.float64)

after the downsampling call, it works again. Maybe add that to sc.pp.log1p()? Or change sc.pp.downsample_counts() to return np.float64?

@ajitjohnson
Copy link
Author

It was related to adata conversion that @falexwolf alluded to and specifically affects windows machines (because of changes in numpy). I got the latest version of AnnData and it works now.

@LuckyMD LuckyMD reopened this Mar 22, 2019
@LuckyMD
Copy link
Contributor
LuckyMD commented Mar 22, 2019

Exactly the same error message pops up when inputting np.int64 data into sc.pp.log1p(). This is with the latest scanpy, and using data that has otherwise worked well when not using sc.pp.downsample_counts(). I thus wouldn't consider this resolved, although I can open another issue as well.

@ajitjohnson
Copy link
Author

Aha okay. My problem was resolved when I updated the AnnData package for converting pandas dataframe into AnnData object using

'''adata = sc.AnnData(x)'''

@falexwolf
Copy link
Member

We just merged an update on the downsample_counts function by @ivirshup; evidently, the data type shouldn't be changed by downsampling, should it?

@LuckyMD
Copy link
Contributor
LuckyMD commented Mar 22, 2019

I can confirm that it is currently

Screen Shot 2019-03-22 at 23 25 14

@ivirshup
Copy link
Member

No matter what it returns, it definitely shouldn't make stuff fail. I think that downsample_counts was returning integers before the most recent PR as well.

iirc, I made downsample_counts use integers because a) numba was failing inference unless I was explicit about integers and b) downsampling counts only makes sense for integer valued numbers. At the time I couldn't see a reason to convert the output to a different type.

I figure that log1p should be able to take an integer valued expression matrix. However, I tried to implement that and ended up adding a lot of flow control to an already flow control heavy function, which got ugly:

🍝
def log1p(data, copy=False, chunked=False, chunk_size=None):
    """Logarithmize the data matrix.

    Computes `X = log(X + 1)`, where `log` denotes the natural logarithm.

    Parameters
    ----------
    data : :class:`~anndata.AnnData`, `np.ndarray`, `sp.sparse`
        The (annotated) data matrix of shape `n_obs` × `n_vars`. Rows correspond
        to cells and columns to genes.
    copy : `bool`, optional (default: `False`)
        If an :class:`~anndata.AnnData` is passed, determines whether a copy
        is returned.

    Returns
    -------
    Returns or updates `data`, depending on `copy`.
    """
    if copy:
        if not isinstance(data, AnnData):
            data = data.astype(np.floating)
        data = data.copy()
    elif not isinstance(data, AnnData) and np.issubdtype(data.dtype, np.integer):
        raise TypeError("Cannot perform inplace log1p on integer array")

    def _log1p(X):
        if issparse(X):
            np.log1p(X.data, out=X.data)
        else:
            np.log1p(X, out=X)

        return X

    if isinstance(data, AnnData):
        if not np.issubdtype(data.X.dtype, np.floating):
            data.X = data.X.astype(np.floating, copy=False)
        if chunked:
            for chunk, start, end in data.chunked_X(chunk_size):
                 data.X[start:end] = _log1p(chunk)
        else:
            _log1p(data.X)
    else:
        _log1p(data)

    return data if copy else None

I'll give that another shot, and open a PR. On the return type of downsample_counts, I've noticed many functions in scanpy return float32 matrices regardless of what was given to them. Is this a design that's meant to be propagated? Even if not, what should the return type of downsample_counts be? At the time I figured it didn't matter, since anything downstream should be able to deal with it.

@LuckyMD
Copy link
Contributor
LuckyMD commented Mar 24, 2019

Originally everything was np.float32 in scanpy, but as of a recent anndata commit (somewhere between 0.6.11 and 0.6.12), that was changed and now the input data precision is left up to the user.

Which functions hard code np.float32?

@falexwolf
Copy link
Member

Nothing should be hardcoded np.float32, but it might be that some functions still do that from an early time, where, for instance, scikit-learn's PCA was silently transforming to float64 (and Scanpy silently transformed back etc.).

Nothing should change the dtype that the user wants, except, for instance, when we logarithmize an integer matrix etc. Here, there should be a default dtype='float32' parameter.

[PS: In algorithms that inherently are unstable and would profit more from higher precision, one could think about increasing precision.]

@ivirshup
Copy link
Member

Should downsample_counts also get a dtype argument? Internally I've used np.integer, which I think just uses the system word size.

@davidhbrann
Copy link
Contributor
davidhbrann commented Oct 6, 2019

Hi,

I've been getting the same error when trying to use sc.pp.normalize_total after sc.pp.downsample_counts. Normalize total returns a CSR sparse matrix of type <class 'numpy.int64'>, which then causes sc.pp.normalize_total to error. Not sure where the correct dtype should take place.

pbmc = sc.datasets.pbmc68k_reduced()
pbmc.X = pbmc.raw.X
sc.pp.downsample_counts(pbmc, counts_per_cell=500)
sc.pp.normalize_total(pbmc, target_sum=1e4)

Here's the traceback:

Normalizing counts per cell.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-136-3305b6c650f4> in <module>
      2 pbmc.X = pbmc.raw.X
      3 sc.pp.downsample_counts(pbmc, counts_per_cell=500)
----> 4 sc.pp.normalize_total(pbmc, target_sum=1e4)

~/anaconda2/envs/scanpy/lib/python3.6/site-packages/scanpy/preprocessing/_normalization.py in normalize_total(adata, target_sum, exclude_highly_expressed, max_fraction, key_added, layers, layer_norm, inplace)
    166             adata.obs[key_added] = counts_per_cell
    167         if hasattr(adata.X, '__itruediv__'):
--> 168             _normalize_data(adata.X, counts_per_cell, target_sum)
    169         else:
    170             adata.X = _normalize_data(adata.X, counts_per_cell, target_sum, copy=True)

~/anaconda2/envs/scanpy/lib/python3.6/site-packages/scanpy/preprocessing/_normalization.py in _normalize_data(X, counts, after, copy)
     14     after = np.median(counts[counts>0]) if after is None else after
     15     counts += (counts == 0)
---> 16     counts /= after
     17     if issparse(X):
     18         sparsefuncs.inplace_row_scale(X, 1/counts)

TypeError: ufunc 'true_divide' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule ''same_kind''
>>> pbmc.X
<700x765 sparse matrix of type '<class 'numpy.int64'>'

@ivirshup
Copy link
Member
ivirshup commented Oct 8, 2019

@dhb2128, as a workaround, this should work:

pbmc = sc.datasets.pbmc68k_reduced()
pbmc.X = pbmc.raw.X
sc.pp.downsample_counts(pbmc, counts_per_cell=500)
pbmc.X = pbmc.X.astype(float)
sc.pp.normalize_total(pbmc, target_sum=1e4)

I've just opened a PR to fix normalize_total not working with integer input values.

@WeilerP
Copy link
Contributor
WeilerP commented Aug 26, 2020

Hi there,
stumbled on this by chance when debugging a similar problem - though I'd share my gained insight:

As @LuckyMD already pointed out here, the root of the problem is feeding np.int64 into sc.preprocessing.log1p. More specifically, the problem occurs in log1p_array here. When specifying out in np.log1p, the input types need to be castable. However, np.log1p returns np.floatX (type code double precision 'd') which cannot be cast to np.int64 (type code long integer 'l'). The error is reproducible with this small snippet of code:

import numpy

a = np.zeros(shape=(1, 1), dtype='int64')
np.log1p(x=a, out=a)

The error can be prevented like this:

import numpy

a = np.zeros(shape=(1, 1), dtype='int64')
a = np.log1p(x=a)

In the case of scanpy, this would mean replacing this line of code by X = np.log1p(X). The drawback being that the operation is no longer inplace.

@ivirshup
Copy link
Member

@WeilerP I'm pretty sure that should work right now. I actually think issue this has been solved, but just wasn't closed

Here's an example for this specific case:

import scanpy as sc, numpy as np
from scipy import sparse

adata = sc.AnnData(
    np.abs(sparse.random(100, 100, density=0.1, dtype=int, format="csr")),
    dtype=int,
)
display(adata.X)
# <100x100 sparse matrix of type '<class 'numpy.int64'>'
# 	with 1000 stored elements in Compressed Sparse Row format>
sc.pp.log1p(adata)
display(adata.X)
# <100x100 sparse matrix of type '<class 'numpy.float64'>'
# 	with 1000 stored elements in Compressed Sparse Row format>

Basically, we just try to make inplace refer to the anndata object, and be truly inplace on the array when possible.

@WeilerP
Copy link
Contributor
WeilerP commented Aug 31, 2020

@ivirshup, yes, your example works. However, I would not consider the issue as resolved as it still exists IMO.
Your example only works as you are working with a sparse matrix. If X is a np.ndarray, the method still fails:

>>> adata = sc.AnnData(
    np.ceil(np.abs(np.random.randn(10, 10))).astype('int64'),
    dtype=int,
)
>>> sc.pp.log1p(adata)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.7/functools.py", line 840, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/opt/anaconda3/lib/python3.7/site-packages/scanpy/preprocessing/_simple.py", line 350, in log1p_anndata
    X = log1p(X, copy=False, base=base)
  File "/opt/anaconda3/lib/python3.7/functools.py", line 840, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/opt/anaconda3/lib/python3.7/site-packages/scanpy/preprocessing/_simple.py", line 318, in log1p_array
    np.log1p(X, out=X)
TypeError: ufunc 'log1p' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule ''same_kind''

@ivirshup
Copy link
Member

Ah, I'd missed that. Should be fixed with #1400.

@brianpenghe
Copy link
pbmc.X = pbmc.X.astype(float)

Thanks for this work-around. This solved my problem!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants