ENH Adds encoded_missing_value to OrdinalEncoder #21988

thomasjpfan · 2021-12-15T15:21:26Z

Reference Issues/PRs

Related to #21967

What does this implement/fix? Explain your changes.

This PR adds encoded_missing_value to OrdinalEncoder. Similar to unknown_value, encoded_missing_value is not allowed to be one of the encoding used by a known category.

CC @glemaitre

ogrisel

Just a quick suggestion for a usability improvement in the error message both other than that, LGTM!

sklearn/preprocessing/_encoders.py

glemaitre · 2021-12-23T11:10:47Z

Maybe a short example in the docstring:

    By default, :class:`OrdinalEncoder` is lenient towards missing values by
    propagating them.

    >>> import numpy as np
    >>> X = [['Male', 1], ['Female', 3], ['Female', np.nan]]
    >>> enc.fit_transform(X)
    array([[ 1.,  0.],
           [ 0.,  1.],
           [ 0., nan]])

    You can use the parameter `encoded_missing_value` to encode missing values.

    >>> enc.set_params(encoded_missing_value=-1).fit_transform(X)
    array([[ 1.,  0.],
           [ 0.,  1.],
           [ 0., -1.]])

glemaitre · 2021-12-23T11:19:44Z

In the User Guide l.540 of preprocessing.rst, I think that we can add/modify with:

By default, :class:`OrdinalEncoder` will also passthrough missing values that are
indicated by `np.nan`.

    >>> enc = preprocessing.OrdinalEncoder()
    >>> X = [['male'], ['female'], [np.nan], ['female']]
    >>> enc.fit_transform(X)
    array([[ 1.],
           [ 0.],
           [nan],
           [ 0.]])

:class:`OrdinalEncoder` provides a parameter `encoded_missing_value` to encode
the missing values without the need to create a pipeline and using
:class:`~sklearn.impute.SimpleImputer`.

    >>> enc = preprocessing.OrdinalEncoder(encoded_missing_value=-1)
    >>> X = [['male'], ['female'], [np.nan], ['female']]
    >>> enc.fit_transform(X)
    array([[ 1.],
           [ 0.],
           [-1.],
           [ 0.]])

The above processing is equivalent to the following pipeline::

    >>> from sklearn.pipeline import Pipeline
    >>> from sklearn.impute import SimpleImputer
    >>> enc = Pipeline(steps=[
    ...     ("encoder", preprocessing.OrdinalEncoder()),
    ...     ("imputer", SimpleImputer(strategy="constant", fill_value=-1)),
    ... ])
    >>> enc.fit_transform(X)
    array([[ 1.],
           [ 0.],
           [-1.],
           [ 0.]])

jeremiedbb

LGTM

thomasjpfan added 2 commits December 15, 2021 10:11

ENH Adds encoded_missing_value to OrdinalEncoder

1663443

ENH Adds error if encoded_missing_value is used

42f98b4

github-actions bot added the module:preprocessing label Dec 15, 2021

DOC Adds whats new PR number

5ba293f

ogrisel approved these changes Dec 16, 2021

View reviewed changes

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

ogrisel mentioned this pull request Dec 16, 2021

Documenting missing-values practices #21967

Open

7 tasks

thomasjpfan added 2 commits December 17, 2021 11:09

Merge remote-tracking branch 'upstream/main' into ordinal_encoder_nan_rb

7500146

ENH Error message outputs which feature was invalid

a1e8598

glemaitre self-requested a review December 23, 2021 10:55

thomasjpfan added 2 commits December 23, 2021 09:21

Merge remote-tracking branch 'upstream/main' into ordinal_encoder_nan_rb

a5b479d

DOC More documentation

ed4334f

thomasjpfan added this to the 1.1 milestone Mar 3, 2022

jeremiedbb added 2 commits March 23, 2022 11:39

Merge branch 'main' into ordinal_encoder_nan_rb

4d4e5dc

cln merge mistake

1d01d81

jeremiedbb approved these changes Mar 23, 2022

View reviewed changes

jeremiedbb merged commit 51fddb7 into scikit-learn:main Mar 23, 2022

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Apr 6, 2022

ENH Adds encoded_missing_value to OrdinalEncoder (scikit-learn#21988)

cd95bf9

This was referenced Nov 15, 2022

Update scikit learn 1.2 automl/auto-sklearn#1611

Closed

[Maint] Specify encoded_missing_value to OrdinalEncoder automl/auto-sklearn#1615

Open

thomasjpfan mentioned this pull request Feb 23, 2023

Handle missing values in OrdinalEncoder #11997

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Adds encoded_missing_value to OrdinalEncoder #21988

ENH Adds encoded_missing_value to OrdinalEncoder #21988

ENH Adds encoded_missing_value to OrdinalEncoder #21988

ENH Adds encoded_missing_value to OrdinalEncoder #21988

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Choose a reason for hiding this comment

Choose a reason for hiding this comment