[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utils.validation.check_array throws bad TypeError pandas series is passed in #12699

Closed
greghoop opened this issue Nov 30, 2018 · 16 comments
Closed
Labels
Milestone

Comments

@greghoop
Copy link
greghoop commented Nov 30, 2018

Description

validation.check_array throws bad TypeError pandas series is passed in. It cropped up when using the RandomizedSearchCV class. Caused when line 480 is executed

480 - if hasattr(array, "dtypes") and len(array.dtypes):

Steps/Code to Reproduce

validation.check_array(y, ensure_2d=False, dtype=None) where y is a pandas series

Expected Results

No error (I'm not familiar with this code so not sure on the details)

Actual Results

TypeError: object of type 'DTYPE NAME OF THE SERIES' has no len()

Versions

0.20.1

@rth
Copy link
Member
rth commented Nov 30, 2018

This should have been fixed in #12625, and I can confirm that the corresponding commit is part of the 0.20.1 tag.

Could you please double-check that you are using 0.20.1. Also please post the full traceback, which will help us investigate.

@qinhanmin2014
Copy link
Member

I think @greghoop is using 0.20.1, seems that it's not appropriate to use len(array.dtypes)

@greghoop Please provide self-contained example code, including imports and data (if possible), so that other contributors can just run it and reproduce your issue. Ideally your example code should be minimal.

And please provide the output of sklearn.show_versions(). Thanks.

@qinhanmin2014 qinhanmin2014 added this to the 0.20.2 milestone Nov 30, 2018
@qinhanmin2014
Copy link
Member

I cannot reproduce locally. And I think we have a test in Travis CI.

@greghoop
Copy link
Author
greghoop commented Nov 30, 2018

@qinhanmin2014

Please provide the output of sklearn.show_versions()

I've already downgraded to 0.20.0 so that won't help.

@greghoop Please provide self-contained example code

If I get a second I'll create an example, but I can't upload the current code.

The problem occurs for me on both Mac and Ubuntu in Python 3.7 with Pandas 0.23.4

@jorisvandenbossche
Copy link
Member

Given the initial error message, I think this is indeed 0.20.1. The Series.dtypes attrribute returns a single dtype, and numpy dtypes have a length of 0:

In [21]: len(np.dtype('int64'))                                                                                                                                                 
Out[21]: 0

that was the characteristics we were using in

# check if the object contains several dtypes (typically a pandas
# DataFrame), and store them. If not, store None.
dtypes_orig = None
if hasattr(array, "dtypes") and len(array.dtypes):
dtypes_orig = np.array(array.dtypes)

to deal with Series objects.

But I am now thinking of a case where the error can come up: with the custom pandas dtypes ...

In [25]: s = pd.Series(['a', 'b', 'c']).astype('category')                                                                                                                      

In [26]: s.dtypes                                                                                                                                                               
Out[26]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)

In [27]: len(s.dtypes)                                                                                                                                                          
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-27-3d39f9ba3f74> in <module>
----> 1 len(s.dtypes)

TypeError: object of type 'CategoricalDtype' has no len()

@greghoop can it be the case that you were also using categorical data?

@qinhanmin2014
Copy link
Member

@greghoop

I've already downgraded to 0.20.0 so that won't help.

Downgrading to 0.20.0 won't help (downgrading to 0.19.X will)

If I get a second I'll create an example, but I can't upload the current code.

Thanks, I think you only need to provide the data type of y (you passed into check_array).

@jorisvandenbossche
Copy link
Member

Apart from fixing it here in sklearn, we should consider on the pandas side if we want to fix this (pandas-dev/pandas#24033).
But that said, relying on the length of the dtype was maybe not a good idea. Because eg structured dtypes actually do have a length > 0. Although, at the moment you cannot store a structured array in a Series, so we should not be able to encounter that case.

@qinhanmin2014
Copy link
Member

I agree that we should avoid using len
@jorisvandenbossche Do you know when we'll get object of type 'DTYPE NAME OF THE SERIES' has no len()? I can't find relevant information in pandas.

@jorisvandenbossche
Copy link
Member

@qinhanmin2014 Yes, see my example above: #12699 (comment) (at least, I assume that is what happened, @greghoop needs to confirm that, as you will never actually get DTYPE NAME OF THE SERIES in practice)

@greghoop
Copy link
Author
greghoop commented Dec 1, 2018

Yes, it was specifically for a categorical dtype series. Apologies for the confusion.

FYI, this isn’t a problem in 0.20.0 (I don’t need to downgrade to 0.19.x to get this working).

I don’t not know if it affects other dtypes

@qinhanmin2014
Copy link
Member

@jorisvandenbossche Thanks, I've seems your example above. I'm wondering what's type 'DTYPE NAME OF THE SERIES' since we need to reproduce the reporter's error. Any insights?

@jorisvandenbossche
Copy link
Member

I'm wondering what's type 'DTYPE NAME OF THE SERIES' since we need to reproduce the reporter's error. Any insights?

That's just something that @greghoop put there, he apologized for the confusion :-)

@qinhanmin2014
Copy link
Member

Ahh, I see the comment from the contributor, let's replace len with other options.

@jorisvandenbossche
Copy link
Member

@greghoop And thanks for the report!

@jorisvandenbossche
Copy link
Member

@qinhanmin2014 I would maybe replace the and len(array.dtypes) with for example and hasattr(array.dtypes; 'iloc') (the duck type check we use in several places for pandas objects).
Ideally, I think we should start doing actual instance checks instead of those duck types, but that's for another issue.

Shall I do a PR, or are you already on it?

@qinhanmin2014
Copy link
Member

@jorisvandenbossche I'm not available now so please go ahead :)
FYI I proposed a solution in #12622 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants