[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type annotation considerations #373

Closed
flying-sheep opened this issue Nov 21, 2018 · 38 comments
Closed

Type annotation considerations #373

flying-sheep opened this issue Nov 21, 2018 · 38 comments
Labels

Comments

@flying-sheep
Copy link
Member

@falexwolf wrote:

Wow, this looks great! One remark for future PRs: We’re migrating to a new documentation style using type annotations

I'm still not convinced that we should use type annotations for Scanpy toplevel functions. People use Scanpy in Jupyter Lab and notebooks and not in Pycharm. Hence, there is no gain in the annotations, by contrast, the function annotations simply look super complicated and it's no longer feasible to grasp at first sight what's going on. This also regards the output of the help in Jupyter Lab and Notebooks.

So, while I think that for AnnData and everything in the background, type annotations may make sense for a few developers (not for me, as I'm doing everything on remote servers using emacs), it doesn't make sense for the Scanpy user.

Also, all the other big packages I work with all the time simply don't have it (numpy, seaborn, pandas, tensorflow) and it makes it harder and lengthier for contributors to contribute if they need to go through it.

Finally, I'm still not happy about how the automatically generated docs from the type annotation look:
image
which is from
image
Clearly, the automatically generated line with Union[...] is just way too complicated for a human to make sense of. The mix of auto-generated types in the docs and the manual annotations also looks inhomogeneous.

So, please let's stay away from having more type annotations and corresponding docstrings at this stage and let's simply continue imitating what all the major packages are doing.

Also: regarding your comment about the use of '``' vs. '' in the docs: again, I think it leads to an inhomogeneous appearance to have *two* types of markup for code-related things. I agree that the read-the-docs italicized default style for '' might be supoptimal, and I'll work on that if there is some time. But in general, I think there should be essentially one markup for code, as it's done in the tensorflow docs and a couple of other examples.

Happy to also discuss offline, @flying-sheep ;).

@flying-sheep
Copy link
Member Author
flying-sheep commented Nov 21, 2018

Hi! I think we have a different focus here, and not all of what you stated as fact is correct, so I’ll do my best to clear this up:

  1. There is an advantage for type hints in common Scanpy usage. IPython should use Jedi to create autocompletions since this summer, but they forgot to reenable it. I sent them an issue to do so, Reenable jedi by default ipython/ipython#11503 and a fix in Reenable jedi if it’s installed ipython/ipython#11506. Jedi supports type hints, so with c.Completer.use_jedi = True now or by default in a month, people will profit from them.

    Furthermore, people are using scanpy in applications and scripts, not just in notebooks. When you use an IDE (or install the jedi extension in EMACS) you should profit from it.

  2. The Jupyter shift-tab help being hard to read in the presence of type hints is what I consider a bug. I reported it in Inspected functions’ signatures undecipherable when type hints are in play ipython/ipython#11504 and fixed it in Render signatures ourselves if they’re too long ipython/ipython#11505

  3. The numpy is on it (see here) and will probably integrate it once there needs to be no Python 2 compat. e.g. scikit-learn waits for numpy: Roadmap for Static Type Hints in Low-level Algorithms scikit-learn/scikit-learn#11170

    I see your concern about entry hurdles, but I don’t agree. It’s super easy. Union is “or”, Optional is “or None”. If there’s questions, they can be answered. (or people click on the links in the docs and read like one sentence of explanation)

  4. If you want we can change how all that is rendered. Union[a, b] could be done as :class:`a` or :class:`b` But it’s really not hard…

    Honestly I think the Callable[…] is much better than the textual description that was there before: Until it was there, people (including me when i was writing that annotation) had to dive into the code to figure out what function signature is really expected there. Now they have to be able to parse what that Callable[[a,b], c] there means. If they have never encountered it before, they can click on it, read one sentence of explanation and know that a and b are parameters and c the return type. Done in 5 seconds.

    It’s a big improvement to no longer have fuzzy explanations. E.g. what does “int, optional” mean in function x? Can you pass None there or does it just mean there’s a default value? No idea until you look into the code. Optional[int] always means “you can pass None or an int here”

    The mix of auto-generated types in the docs and the manual annotations also looks inhomogeneous.

    I like it the fact that we can do that. It allows us to specify things we can’t express in type hints (like “only this set of strings is accepted”) In places where it looks inhomogeneous, we should think about how to make it look more homogeneous. Any ideas?

I think that should cover it. I’m awaiting your comment about Union[a, b]a or b / a | b!

@ivirshup
Copy link
Member

Hey, this has been something that's been confusing me a bit when annotating my arguments. Since python is pretty polymorphic (until its not), I find it hard to capture the traits an object should have using types I'm familiar with.

Some examples:

  • If you need to provide a list of genes, this could be a finite (ordered?) iterable whose elements are coercible to the same type as obs_names.
  • An integer. Could be a numpy integer, could be a python integer.

What's are the correct typings for these? Do I do a Union of everything I can think of that matches this? Is there a way to say: "should behave right if I call np.array on it" (limiting possible arguments types to pd.Series, list, tuple, np.array, dask array, and probably some others)?

I guess I'd like to so some information on best practices and common idioms in the contribution guide. I haven't seen too many scientific python packages use type annotations, so I'm not sure how set conventions are. If anyone has seen some good writing on type annotations for the scientific python stack, I'd love to take a look.

@flying-sheep
Copy link
Member Author
flying-sheep commented Nov 23, 2018

Hi!

There’s a series of abstract base classes (other languages call them interfaces) that can be used if you know specifically what you want (e.g. typing.Sequence[T] if it only needs to be iterable and indexable, but not necessarily a list)

About your examples, generally I always have to dig into the code to figure such things out. Annoying, but it means that people after me can just use the type annotations instead of wasting their time doing the same.

  • Currently we only accept str for obs_names, so depending on the actual usage in the function, Iterable[str] or Sequence[str] would be it I guess.
  • Usually single integers being passed around should be python ints, because they can be any size. But if you need to accept int32 and int64, you can do Union[int, np.integer].

Is there a way to say: "should behave right if I call np.array on it"

In truth, array eats just about anything (with weird results), and for us this is probably a good idea:

Number = Union[float, int, np.integer, np.floating]
Num1DArrayLike = Sequence[Number]
Num2DArrayLike = Sequence[Num1DArrayLike]
Num3DArrayLike = Sequence[Num2DArrayLike]
NumNDArrayLike = Union[Num1DArrayLike, Num2DArrayLike, Num3DArrayLike]

But if we want to be exact about array_likes, we’d need this ABC:

class ArrayLike(ABC):
    """An array,
    any object exposing the array interface,
    an object whose __array__ method returns an array,
    or any (nested) sequence.
    """
    @classmethod
    def __subclasshook__(cls, C):
        if issubclass(C, np.ndarray):
            return True
        if any('__array_interface__' in B.__dict__ for B in C.__mro__):
            return True
        if any('__array__' in B.__dict__ for B in C.__mro__):
            return True
        return Sequence.__subclasshook__(cls, C)

Two thoughts here:

  1. It’s fine if you don’t know exactly. Just use your best guess. The worst case is that someone wastes a second of runtime converting their argument to the type you thought was needed, while the original type would have been OK.
  2. It’s good if someone thinks about all that because that means things don’t break unexpectedly!

@LuckyMD
Copy link
Contributor
LuckyMD commented Nov 23, 2018

@flying-sheep Regarding your first thought... it may cause issues when interfacing with other functions that do not have type annotations on the arguments. And users may then find it difficult to interpret the errors.

@flying-sheep
Copy link
Member Author
flying-sheep commented Nov 23, 2018

…no? why would it? type annotations are only used for people and IDEs (unless you use mypy in your tests to check if everything is sound)

@LuckyMD
Copy link
Contributor
LuckyMD commented Nov 23, 2018

Assuming I understand typing correctly... I'm imagining this in the same way that anndata's copy function always cast to np.float32 regardless of adata.X type. If your input expects adata.X in np.float32, and a previous function (e.g., ComBat) has changed the data to np.foat64, then you may not know why running ComBat and then copying your dataset no longer works.

@flying-sheep
Copy link
Member Author
flying-sheep commented Nov 23, 2018

How? As said, they’re just for people and IDEs. Scanpy doesn’t use them. It doesn’t throw errors in case something doesn’t fit.

We could use https://pypi.org/project/typecheck-decorator/ to throw errors when something is passed that doesn’t fit the annotations. However, doing so has a performance hit and requires flawless annotations (because if the annotations were wrong, that would start suddenly throwing errors)

I’m just adding type annotations to improve user friendliness by being more clear what functions accept, and because it makes writing documentation easier.

@LuckyMD
Copy link
Contributor
LuckyMD commented Nov 23, 2018

In that case, I don't fully understand this typing and will just continue reading quietly ;). I assumed it would throw errors as for example in C++.

@flying-sheep
Copy link
Member Author
flying-sheep commented Nov 23, 2018

There’s a few uses:

  1. Humans. Once you understand the syntax (very easy, i just get Generator wrong all the time) it improves your understanding what a function really accepts and returns
  2. IDEs. They’ll get better when inferring the types of variables and will show you more actual problems in the code and less false positives
  3. Testing. Some projects use mypy to check if all code in your repo typechecks properly, which can be integrated into a test suite
  4. Runtime type checking. Has a performance hit (as said) but given proper type hints, it makes your code safer and the error messages better (“Function blah excepted a parameter foo of type Bar, but you passed a foo of type Baz”)

i’m not planning to do 3 and 4 (yet, and probably never)

@ivirshup
Copy link
Member

@flying-sheep Thanks for the thorough response! This is a topic I have a lot of thoughts on, though I'm not so sure how coherently I can communicate all of them.

On your first thought:

The worst case scenario I see here me typing something so poorly a newbie trying to follow the documentation gets horrible numba errors they can't figure out. This could happen if I hadn't thought about Set being a subtype of Collection. It can be difficult to know the abcs in typing are supposed to mean without spending a while using them. For example, this is not what I was expecting:

>>> issubclass(np.ndarray, typing.Collection)
True
>>> issubclass(np.ndarray, typing.Sequence)
False

On defining abcs for this package, I think it would have to be done in a way where it was easy to discover exactly what is meant by the annotated types. Personally, I read most of my documentation through the ipython repl, where it can be difficult to figure out where a type referenced in a doc string is defined. Otherwise, I'd be interested in seeing how other people are doing it, but like @falexwolf, most of the packages I use don't have type annotations.

Again, I think these would be less of an issue if quality writing on type annotation usage, particularly for scientific python, was available. As a Julia user, I found this blog post very helpful not just for understanding how to implement trait types in Julia, but also when they're useful.

@falexwolf
Copy link
Member

@flying-sheep As always, thank you for your thorough thoughts on the topic! And as always, my "hacking-numerics" perspective likely is not a path that is long term sustainable. With what I wrote at the very beginning of this thread, I simply wanted to express that I thought that we shouldn't transition quickly and immediately; for the cosmetic reasons and for the reason of staying away from creating entry hurdles. I still don't think that scanpy needs to precede major packages like numpy and many others in adapting type annotations. But, in essence, I trust you and if you want to push this further I'm fine if scanpy becomes somewhat a field of experimentation for how to deal with type annotations in scientific and numerics-centered software.

@ivirshup Thank you very much for your remarks, too! I agree with your concerns and examples, but wouldn't have been able to summarize them as neatly.

Conclusion: @flying-sheep if you feel you have bandwidth for improving the cosmetics (thanks for what you did already, also the PR to ipython) that lead to more homogeneous docstrings (I'd say: Union[a, b]a, b), of course, please go ahead. If people make PRs with old-school docstrings and without type annotations, I'd still not trouble them, for now. When we have converged on new docstrings and canonical type annotations so that at least people who really know what they're doing (@ivirshup) don't feel things are ambiguous anymore (say in a year), we can start to rigorously ask for them.

PS: Thanks for the hints about Jedi etc. @flying-sheep. But likely, I'll keep playing around and reading documentation of packages using shift-tab in jupyter and develop using emacs relatively plain (there were times when I worked with quite some extensions, but these days, I'm back to almost plain for performance reasons - I know that's probably not smart, but anyways)...

@flying-sheep
Copy link
Member Author
flying-sheep commented Nov 26, 2018

@ivirshup

The worst case scenario I see here me typing something so poorly a newbie trying to follow the documentation gets horrible numba errors they can't figure out.

Well, that’s an improvement over the current situation of “the freeform text type annotations make me guess what I can pass and I get horrible numba errors”, right?

issubclass(np.ndarray, typing.Sequence) == False

That looks like a bug. The docs to Sequence say: “Concrete subclasses must override __new__ or __init__, __getitem__, and __len__”, and

>>> np.ndarray.__new__                                                                                                                                                                  
<function ndarray.__new__(*args, **kwargs)>
>>> np.ndarray.__getitem__                                                                                                                                                              
<slot wrapper '__getitem__' of 'numpy.ndarray' objects>
>>> np.ndarray.__len__                                                                                                                                                                  
<slot wrapper '__len__' of 'numpy.ndarray' objects>

@flying-sheep
Copy link
Member Author

@falexwolf

we shouldn't transition quickly and immediately; for the cosmetic reasons and for the reason of staying away from creating entry hurdles.

Got it! so no “move fast and break things” but instead to identify problems and fix them before they occur. I think the most painful issues here are

  1. the signature rendering in ipython

    Fixed in Render signatures ourselves if they’re too long ipython/ipython#11505, We might incorporate a fix right now ourselves by monkey-patching inspect.Signature.__str__ if we want.

  2. losing contributions because of an entry hurdle

    Hard to measure if this happens. If we lose someone, they won’t announce it. So maybe friendly PR/issue templates or contributing guidelines might help prevent that!


if you feel you have bandwidth for improving the cosmetics (thanks for what you did already, also the PR to ipython) that lead to more homogeneous docstrings (I'd say: Union[a, b] → a, b), of course, please go ahead.

Will do, but a comma is ambiguous, as it could mean union, intersection, or (in Python) tuple. I think Union[a, b, c]a, b, or c would be clearer. I think we should leave everything else as is: Option[...], is clear enough, and Callable is better than introducing our own syntax (Some other languages know things like (a, b) -> c as type for functions, but Python doesn’t)

When we have converged on new docstrings and canonical type annotations so that at least people who really know what they're doing (@ivirshup) don't feel things are ambiguous anymore (say in a year), we can start to rigorously ask for them.

good call! I might just edit them in-PR as I did to fix the colormaps in @fidelram’s last PR.

@falexwolf
Copy link
Member

Great!

One last thing: In docstrings, why would you interpret a comma separated list as intersection or a tuple? This is not code but for humans. I'm even having a hard time to imagine the case that gives rise to an intersection. Also, a tuple in a docstring should always be verbose with (,), so that no confusion is possible; we'll enforce that in the docs.

Right now, the convention across all the major packages is to simply print out a comma separated list of types if you are allowed to pass different types to a parameter. This produces the least amount of visual distraction and maintains consistency for how it's done in Scanpy in the manual docstrings and everywhere else.

If there is a case where an intersection is relevant, I'd treat that separately.

Finally: a, b, or c is pretty elegant, too... but if there are no good answers to my two remarks I'd go for a, b, c

@falexwolf
Copy link
Member

Ah, we already have a contributing sheet, but we should probably directly link to it from the readme. Do you want to edit it, before doing so, @flying-sheep?

And yes, ipython/ipython#11505 is awesome!

@flying-sheep
Copy link
Member Author

I created legacy-api-wrap.

Only caveat: Scanpy is still 3.5 compatible, and I’m using f string syntax in it and its dependency get_version, so it’s python 3.6 only (which could be circumvented via future-fstrings or so)

@flying-sheep
Copy link
Member Author
flying-sheep commented Nov 26, 2018

In docstrings, why would you interpret a comma separated list as intersection or a tuple?

In natural language “a, b, c” usually means “a, b, and c” (i.e. a composite or a logical intersection). And an intersection type is one that has all the attributes of all the types, like in class x(a, b, c): ... (where commas are also used). In Python plain a, b, c constructs a tuple (a composite type): tup = a, b, c.

It took me a long time to find a numpy function that uses commas for anything other than the “, optional”, but of course you’re right. They do it like that. Why don’t people think before establishing conventions…

A good example of that function’s docs is also how braindead the “optional” is: for tol, it means “or None”, for hermetian it means “has a default” (probably, no way to know for sure). Goddamn.

Ah, we already have a contributing sheet

oh, is this visible? or does it need to be uppercase for that? CONTRIBUTING.md? I don’t see it when creating an issue, but maybe because I’m an organization member?

@falexwolf
Copy link
Member

In natural language...

... I know all that. 😉 But numpy, pandas, scikit learn, tensorflow, seaborn all have the comma-separated list as a convention and I'd really like to stick to that convention.

A good example...

No, the optional keyword always means that a parameter has a default. Very often, people forget to append "or None" (, None) to the list of possible types. Btw: that's maybe a nice way of thinking about it for you: you use a "tuple of possible types" to denote that any of these types can be passed in the function. As mentioned before, there is no point in using set-theoretic/logical notions like union or intersection as the topic is so simple that it doesn't need it (no need for an intersection, it's not even clear what that would mean; if you're stringent about it, it's also not clear for union).

So, let's simply take the comma-separated list.

@falexwolf
Copy link
Member

CONTRIBUTING.md

Hm, good point. I never wondered. But yes, you might be right.

@flying-sheep
Copy link
Member Author
flying-sheep commented Nov 27, 2018

I know all that. 😉

And I know that you know! I just like to be comprehensive when presenting my arguments!

But numpy, pandas, scikit learn, tensorflow, seaborn all have the comma-separated list as a convention and I'd really like to stick to that convention.

OK. I’d prefer “a, b, or c”, but I’ll concede. It would also be no problem to change it later since all will be automated 👍

No, the optional keyword always means that a parameter has a default. Very often, people forget to append "or None" (, None) to the list of possible types.

Well, when I open scanpy in PyCharm and someone forgot that in a type annotation, it highlights that fact to me. Pretty nice.

As mentioned before, there is no point in using set-theoretic/logical notions like union or intersection as the topic is so simple that it doesn't need it (no need for an intersection, it's not even clear what that would mean; if you're stringent about it, it's also not clear for union).

Oh, then you didn’t hear of type theory. It’s a branch of logic: Type systems are formal systems, and in most of them the terms I used are well defined. The kinds of composite types I mentioned are:

  • Union of types / Sum Type / Tagged Union: Variables with one of those have one of several fixed types.
  • Subtype / Intersection Type: Variables have all the properties of the supertypes.
  • Tuple / Product Type: Variables contain multiple entries that each have one corresponding type.

@falexwolf
Copy link
Member

Oh, then you didn’t hear of type theory. It’s a branch of logic:

Indeed, I have never heard of that. But I doubt that it would be considered a branch of logic.

  1. Union type is a pretty bad descriptor for a variable that can take one of a set of fixed types. A union usually denotes a composition of multiple sets giving rise to a new set that contains all elements from these sets.
  2. Subtype is a great descriptor for a type that has properties of supertypes.
  3. Intersection type is an insanely bad descriptor for a variable that denotes the intersection of properties of supertypes; the concept of such a subtype might be something useful in some languages and some cases and it might deserve a special name as it's the converse behavior of subclassing. But I have no idea how such a type would be useful in Python and in all cases that I've encountered. The example on Wikipedia already constructs a highly artificial case, whose relevance is opaque to me even though Scanpy features it in many instances: functions that overload parameters and have different overloading-dependent return types (standard example is passing an array instead of an AnnData, which triggers the automatic return of the computed annotation).

What do you think about 3, @flying-sheep?

@flying-sheep
Copy link
Member Author
flying-sheep commented Nov 30, 2018

I doubt that it would be considered a branch of logic

What do you define as logic here? I was talking about the logic theory that encompasses formal systems and so on.

[Union and intersection are bad names]

I agree, wikipedia enumerates more names, and explains where “union” comes from:

tagged union, variant, variant record, choice type, discriminated union, disjoint union, or sum type

Mathematically, tagged unions correspond to disjoint or discriminated unions, usually written using +. Given an element of a disjoint union A + B, it is possible to determine whether it came from A or B. If an element lies in both, there will be two effectively distinct copies of the value in A + B, one from A and one from B.

I think “discriminated union/intersection of types” would make sense here. leaving out the “discriminated/tagged/disjoint” here is the problem. in C there’s actual untagged unions, which simply means that C reserves the memory for the largest of the intersected types and you need to keep track yourself of which the type of the value is. In python you can always do isinstance, so a more correct name for Union[A, B] would be TaggedUnion[A, B]. I’d also like OneOf[A, B], but that ship has sailed.

And intersections are basically duck types or structural types (when anonymous) and traits/interfaces (when named). (i.e. hasattr(obj, 'foo') defines an intersection type of all types having that attribute. So it makes sense for python, it’s just defined more explicitly than by literally intersecting types.

@ryan-williams
Copy link

just passing by to say I love seeing this discussion, and particularly the type-algebra perspective @flying-sheep 😀

I have to admit I am missing what the original context was, in case anyone wants to attempt a small summary. Something about describing types in docstrings (e.g. a, b, or c)? I am not the most important stakeholder by far though, so no worries either way. Thanks!

@flying-sheep
Copy link
Member Author
flying-sheep commented Nov 30, 2018

sure! in short: alex said he didn’t like the switch to type annotations at all, citing a few gripes.

i went on to fix them at various places (fixes are now in) and argued against a few others. i convinced alex that we should (slowly and carefully) adapt type annotations.

the only thing that was missing is a consensus on how to best pretty-print typing.Union, because alex was not a fan of the name and clumsiness. I preferred a, b, or c, he just a, b, c. i explained why a, b, c is a bad convention, but alex insisted to go with it because (sadly) everyone is doing it.

from there on we went deeper into algebraic types and so on. without need really, as we already decided on what to do.

@maximilianh
Copy link
Contributor
maximilianh commented Nov 30, 2018 via email

@falexwolf
Copy link
Member

What do you define as logic here?

Logic as the mother of all formal reasoning and its close relative set theory in mathematics. When you say type theory is a branch of logic then 90% of computer science is a branch of logic. In many contexts this might be a valid but not a very useful statement.

I’d also like OneOf[A, B]

I love OneOf[A, B]. This also doesn't pretend to be logic or set stuff.

hasattr(obj, 'foo') defines an intersection type of all types having that attribute

This is what I meant when I said intersection of properties of supertypes. But I still don't know when you'd need such a type in a practical context, given that we just keep overloading functions like crazy and simply treat passed arguments dependent on their type. Any example when intersection types are actually useful? In a function we might see in Scanpy (this was the whole beginning of this discussion; I cannot imagine a case in which we need to label something intersection type in the docs).

@ivirshup
Copy link
Member
ivirshup commented Dec 1, 2018

Just catching up on this conversation after a conference, glad to see the interest! @falexwolf, I think I can add some context here.

Within type theory, these operations with types are actually set operations on a type lattice. For example, a supertype is the union of its subtypes. This is a core feature of type systems. If you're curious about the details, I really like this paper defining abstract interpretation -- which type algebra is a case of.

Here's an example I think is pretty cool. Julia's Distributions package uses intersection types in it's model of statistical distributions. Distributions defines types for a number of distributions as well as methods to work with them (i.e. pdf, rand). The following is roughly how intersections are used (more info here):

# Types are capitalized
# `<:` is the subtype operator

# Distributions are parametric on their variate form and value support
Distribution{F<:VariateForm, S<:ValueSupport}

# Defining some abstract subtypes
DiscreteDistribution{F<:VariateForm} = Distribution{F<:VariateForm, Discrete} # Discrete is a subtype of ValueSupport
UnivariateDistributions{S<:ValueSupport} = Distribution{Univariate, S<:ValueSupport} # SingleVariate is a subtype of VariateForm
MultivariateDistribution{S<:ValueSupport} = Distribution{Multivariate, S<:ValueSupport} # SingleVariate is a subtype of VariateForm

# Defining an intersection type
DiscreteUnivariateDistribution = Distribution{Univariate, Discrete}
# `===` is the absolute identity operator
DiscreteUnivariateDistribution === typeintersect(DiscreteDistribution, UnivariateDistribution) # returns `true`

# Examples of `Distribution`s which are subtypes of `DiscreteUnivariateDistribution`, these return `true`
Hypergeometric <: DiscreteUnivariateDistribution
Bernoulli <: DiscreteUnivariateDistribution

Drawing out (crudely) part of the type lattice (edges denote subtype relationship, with supertypes towards the top):

                    Distribution
                   /            \
DiscreteDistribution          UnivariateDistributions
                 \                /
          DiscreteUnivariateDistribution # The intersection
          /               |             \
Bernoulli             Binomial            Hypergeometric 

That said: Julia was designed to make reasoning about types straight forward, and this has limited application to Python. An example I could think of is needing key value lookup which is also ordered could be thought of as the the intersection of Mapping and Sequence types.

@falexwolf
Copy link
Member

Thank you for the additional explanations, @ivirshup!

... set operations on a type lattice.

OK, I can imagine that; one just needs to define what a type lattice is exactly. But let's not get into that, I roughly picture what is...

DiscreteUnivariateDistribution = Distribution{Univariate, Discrete}

The concept of an intersection type as such makes a lot of sense to me, but as mentioned above, I'd see this as a special form of subclassing using the intersection of properties instead of the union. And yes, your example is nice.

An example I could think of is needing key value lookup which is also ordered could be thought of as the intersection of Mapping and Sequence types.

I get that. I can easily imagine more examples. But I cannot imagine why you would ever need to explicitly express that in the Scanpy docs: listing via OneOf will always do the job for us. Even in this very generic case that Phil mentioned (via hasattr('foo')), which we have several times, we just use the OneOf way in the docs and I cannot think of any problems that this might generate.

@ivirshup
Copy link
Member
ivirshup commented Dec 3, 2018

I agree that there's no need for it, I just like this stuff :).

Just to be sure I get what you're saying:

For the ordered map example, is your position that since we can say something like Union[OrderedDict, pd.Series], this is a fine substitute for Intersect[Mapping, Sequence]? If not, what should the type description be? To me, Union[Mapping, Sequence] would imply either a list or dict could be used, when neither fits the bill (ignoring 3.7s dict ordering).

@flying-sheep
Copy link
Member Author
flying-sheep commented Dec 3, 2018

People are discussing intersections here.

And I agree with you @ivirshup: That’s a great example where an intersection would be needed. Unions are only useful if you accept several things and somewhere switch behavior based on what you got like if isinstance(...):.

I think that Alex just means that something like that isn’t needed anywhere in scanpy.

An aside about switching behavior based on types: Too bad Python hasn’t been designed with destructuring match/switch. Rust is beautiful because of it. However, Python would probably need to use something like scala’s unappy which I could never wrap my brain around.

@falexwolf
Copy link
Member
falexwolf commented Dec 4, 2018

When I read

like Union[OrderedDict, pd.Series], this is a fine substitute for Intersect[Mapping, Sequence]

in the context of

An example I could think of is needing key value lookup which is also ordered could be thought of as the intersection of Mapping and Sequence types.

then Intersect[Mapping, Sequence] expects a new "intersection object" (here just an OrderedDict), which is to me an "intersection way" of subclassing - and the first is OneOf. So these are different things.

My point is (repeating what Philipp said): in practice (in all the numerical stuff that I've done so far, including Scanpy), I have never encountered the need for defining such an intersection object on the typing level. I just overload functions using OneOf and account for differences in the passed objects attributes via if isinstance(...):... If I need a function that only eats a "weird intersection type object", I'll go and define the corresponding class and throw an error if the function gets fed something different. Fortunately, that happens quite rarely; but yeah, I had cases where I only wanted an OrderedDict but neither a dict or a list. But I'd never call this an "intersection type".

@ivirshup You didn't explain the "type lattice": but according to what I learned about Union and Intersection in this thread, the sets involved in the mentioned "set operations on the type lattice" should have elements that are "properties" of types (as they are not restricted to actual class attributes, this, unfortunately, doesn't tell you right away which "property" you are intersecting: "being ordered", "having a key accesor", "having a certain numerical range"). Right? Union and Intersection then refer to the maximal set of properties of the objects you pass. As each passed object can be characterized by a set of properties, all that naming makes sense. But for someone reading the docs, who isn't expected to know about all the properties of all each object that comes along the way, it's a really sophisticated concept. As a user, I want to characterize things with a simple name for a class, like AnnData or OrderedDict. And these are the things that I want to see in the docs. Don't you agree? By contrast, I'm fine with having the sophisticated in the code; even though I still think that a plain old school untyped function signature looks more beautiful and its content can immediately be grasped by a human.

@ivirshup
Copy link
Member
ivirshup commented Dec 6, 2018

@flying-sheep, I'm pretty sure the logical conclusion of any long discussion about types is that everything should be done in Haskell.

I don't like the use of branches with isinstance because it breaks polymorphism, which is a key part of pythonic code to me.

@falexwolf, I completely agree with you on "what makes a good docstring". The knowledge overhead for numeric python doesn't include type theory, so the docs should be interpretable without them. Ideally, interfaces are simple and the documentation makes the expected behavior clear. I'm still not sure I totally understand what the intent of the "type" vs. "class" system is in python, so I'm often a little unsure what to do with heavily typed code.

That said, if expected behaviors could be encapsulated (both formally and intuitively) with some abstract types (representing interfaces or traits) that would be a nicer solution. I don't think we're near that point in python.

Lattices

Sorry about not giving some info on lattices, I'd thought you didn't want to get into it. It's the partially ordered set kind of lattice, where each type is an element or subset.

I'll give a short python based example (ignoring that Union[] can't be instantiated).

The code:
from typing import Any, Union

class A():
    pass

class B(A):
    pass

class C(A):
    pass

class D():
    pass

class E(D):
    pass

that defines a lattice, which can be represented as a DAG like this:

    Any
    / \
   A   D
  / \  |
  B C  E
  \ | /
  Union[]

It's partially ordered in that you can't say A contains E or vice-versa, but you can say things like A is contains B, and Any is a supertype of (contains) everything else.

I think that how you're viewing it is pretty close, except the elements are types instead of their properties. My mental model has types being a collection of properties, and being a subtype means an object inherits it's supertypes properties, and can have more.

@flying-sheep
Copy link
Member Author

I don't like the use of branches with isinstance because it breaks polymorphism, which is a key part of pythonic code to me.

That’s what ABCs are for. isinstance(thing, Mapping) works beautifully for everything that has the Mapping protocol, no matter if it’s a dict or not.

@flying-sheep
Copy link
Member Author
flying-sheep commented Dec 6, 2018

Btw: IPython 7.2 is live and contains my signature rendering prettificattion… except that I made a mistake and it renders wrong. Anyway, I fix that in ipython/ipython#11525 and things should be pretty in IPython 7.2.1:

grafik

@ivirshup
Copy link
Member
ivirshup commented Dec 7, 2018

Are you talking about the collections.abc.Mapping in this case? From the 3.7 docs:

In general, isinstance() and issubclass() should not be used with types.

Additionally, those functions just throw an error for subscripted generics, so you definitely can't do isinstance(m, Mapping[str, int]). I don't think I'm totally clear on the differences in intended use cases for ABCs vs Types. Are types only for annotation? Should ABCs not be used for annotations?

@flying-sheep
Copy link
Member Author
flying-sheep commented Dec 7, 2018

I only just now got the distinction between types and classes in python. So when they talk about “types”, they mean stuff in the typing module, got it. So

  • Types (typing.*), ABCs, and regular classes can be used for type annotation
  • ABCs and regular classes can be used for isinstance and issubclass checking
  • ABCs and mixins can be mixed in to enhance a class you defined

where a mixin is simply a regular class that happens to rely on some properties of the class it can be mixed with, and a regular class being any class that’s not a type or an ABC.

  • collections.abc.Mapping is an ABC and can be mixed in to enhance your basic mapping class with some convenience methods, or used to check if something has the basic mapping protocol (no matter if it was mixed in or not). What’s the basic protocol and what will be mixed in is nicely documented.
  • typing.Mapping is a generic type, to be used in annotations only. There’s a few projects implementing type checking using them, e.g. mypy or typecheck-decorator.

Check out the docs for abstract base classes, they explain how ABCs work. (namely by registering virtual subclasses and/or implementing __subclasshook__)

Mixin example:

class EnumerableMixin:
    """silly mixin class for iterables"""
    def enumerate(self, start=0):
        yield from enumerate(self, start)

class EnumerableList(list, EnumerableMixin):
    pass

for i, e in EnumerableList.enumerate(): print(i, e)

ABC example:

class PositiveNumbers(collections.abc.Set):
    def __contains__(self, i):
        return isinstance(i, int) and i >= 0
    def __iter__(self): return itertools.count()
    def __len__(self): return float('inf')

# __lt__ is mixed in!
print({0, 1, 10_000} < PositiveNumbers())

# `set` doesn’t inherit from collections.abc.Set, the __subclasshook__ does its magic here
isinstance({}, collections.abc.Set)

@ivirshup
Copy link
Member

I've played around with ABCs for defining interfaces, but had thought the typing module was more closely tied with them. Do you understand the rationale for typing classes being separate when ABCs and regular classes can be used for annotation?

As far I can tell, it's for subscripted (parametric?) annotations. For many python classes those would have to be a runtime check – e.g. List[int] – which ABCs don't do. Not that typing does these checks, but it allows a way to express those constraints.

@flying-sheep
Copy link
Member Author
flying-sheep commented Dec 10, 2018

the runtime checks would be too costly or impossible. to test for List[int], you’d have to traverse the whole thing and check every element. And for e.g. Iterator[int], you just can’t test it at all – if you’d iterate the thing to check the objects it yields, you exhaust it and it’s no longer usable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants