Radical suggestion for python API: integers versus objects #2755

hyanwong · 2023-05-16T10:57:42Z

hyanwong
May 16, 2023
Maintainer

Here's a rather radical idea that I came across when looking at our teaching material while wearing my "newbie to tskit" hat. Perhaps there's something in it, or perhaps it would cause too much code churn or maintenance burden: either way it could be worth a preliminary discussion before shooting it down in flames.

The trigger for the idea is that I find it rather confusing that e.g. to get the position for a mutation, or the time for an edge parent, you need to repeat the name of the tree sequence (with the potential to get that wrong if you have a set of similarly named tree sequences lying around, and the issue of using the wrong integer source, e.g. the node id rather than the individual ID, etc):

mutation_pos = ts.site(ts.mutation(m).site).position
parent_time = ts.node(ts.edge(e).parent).time
wrong_location = ts.individual(ts.mutation(m).node).location  # This is a hidden bug
right_location = ts.individual(ts.node(ts.mutation(m).node).individual).location

This also forces beginners to know all about tskit ids from the start, and be comfortable bandying them around. It would be less error prone, and make teaching easier if we could do

mutation_pos = ts.mutation(m).site.position
parent_time = ts.edge(e).parent.time
wrong_location = ts.mutation(m).node.location  # this should fail: nodes don't have a location ...
right_location = ts.mutation(m).node.individual.location  # ... but individuals do

In other words, it would be most helpful if the python API returned objects instead of integers for Mutation.site, Edge.parent, etc. There are also some other places where it's unclear to a beginner whether an ID or an object is being returned: e.g. TreeSequence.nodes() returns an iterator over node objects, whereas Tree.nodes() returns an iterator over node IDs. My potential suggestion here for the high-level API is that we should always return objects rather than simply IDs where possible.

It seems at first glance that if we wish to retain backwards compatibility, the boat sailed a long time ago on this possibility. But I think there may be a solution, although I don't know how much performance loss there would be. That is to treat a Node object as a special type of integer, with additional properties that return the correct items. The id for a node object could then be obtained either simply by treating the object as an integer, or (more clearly) by accessing (say) a .id attribute. A sketch for something that would work might look a bit like this:

class Node(int):  # Todo: make this work with a dataclass with slots and the metadata_decoder functionality
    def __new__(cls, id_, flags, time, population_id, individual_id, metadata, metadata_decoder, tree_sequence)
        # we could probably enforce IDs to be nonnegative here...
        obj = int.__new__(cls, id_)
        obj.id = id_
        obj.flags = flags
        obj.time = time
        obj.population_id = population_id  # Can now access the plain ID using my_node.population_id
        obj.individual_id = individual_id
        obj.metadata = metadata
        obj.metadata_decoder = metadata_decoder  # probably need something else here
        obj.tree_sequence = tree_sequence

        @property
        def population(self, p):
            # This now returns a population object (which can be treated as an int), rather than a plain int.
            return self.tree_sequence.population(p)

        @property
        def individual(self, p):
            return self.tree_sequence.individual(p)


class Population(int)
    def __new__(cls, id, metadata, metadata_decoder, tree_sequence)
        obj = int.__new__(cls, id)
        obj.id = id
        obj.metadata = metadata
        obj.metadata_decoder = metadata_decoder  # probably need something else here
        obj.tree_sequence = tree_sequence

...

class TreeSequence:
     def population(self, id_):
        id_ = self.check_index(id_, self.num_populations)
        (metadata,) = self._ll_tree_sequence.get_population(id_)
        return Population(
            id=id_,
            metadata=metadata,
            metadata_decoder=self.table_metadata_schemas.population.decode_row,
            tree_sequence= tree_sequence,
        )

As you can see, it's necessary to keep a reference to the containing tree sequence in each object, which could turn out to hit performance badly, I guess, especially when doing e.g. for node in ts.nodes():? However, I think we are moving to a world in which the python API using ts.node(u) etc is for simple stuff, basic algorithm development, etc, and for really performant code you should go with ts.nodes_flags, ts.nodes_time, etc. So maybe a performance hit is acceptable?

It's a large (huge?) change to standard tskit usage, but could, I think, both reduce errors when using tskit and help newcomers. Thoughts?

jeromekelleher · 2023-05-16T11:35:09Z

jeromekelleher
May 16, 2023
Maintainer

I think this would just create more confusion, and it would be better to teach people to think about the data model in terms of arrays in the first place (like ts.mutations_node etc) .

mutation_pos = ts.sites_position[ts.mutations_site[m]]

The advantage being that it's a straightforward change to writing things efficiently using numpy vector operations (or numba), vs a full rewrite if using things like ts.node(u). I almost never use the ts.node(u) form any more (other than getting at metadata).

My vote would be to spend (a small fraction) of the effort refocusing teaching materials to use the array-oriented API instead.

5 replies

hyanwong May 16, 2023
Maintainer Author

I can certainly see the benefit to teaching people to use the array access stuff, for efficiency. I suspect that's quite a steep learning curve, however?

I see the ts.node() API as providing a gentle slope in, for people who want to get information out of a tree sequence, but aren't bothered about efficiency or developing algorithms, and haven't much time to learn heavily about tree sequences. There's still the problem about using an integer in the wrong place (e.g. providing node IDs rather than individual IDs, which ISTR was a source of newcomer bugs in the past, and which we never properly solved).

hyanwong May 16, 2023
Maintainer Author

Also there's still the annoyance about repeating the ts variable, which has caught me out a number of times, when I am keeping e.g. ts1, ts2, and ts3 all around at the same time, to compare with each other. But that's a minor point, I guess.

molpopgen May 16, 2023
Maintainer

IIRC, this came up rather early in the discussion of the C API. There, it is tricky to do and didn't seem worth the tradeoff of the extra complexity.

The rust interface, however, does do this. Using the wrong "id type" in the wrong place is a compiler error. The only way to send the wrong "id type" in to a function is to use unsafe code, which means your errors will be pretty easy to localize.

But for the Python side of things, I agree that the array interface should be the primary one that we talk about at this point.

hyanwong May 16, 2023
Maintainer Author

The non-array Python interface provides syntactic sugar for sourcing per-object information that isn't directly in that object array, e.g. site.mutations (an array of the mutations), nodes(order="timeasc"), individual.nodes (so you don't have to do a reverse lookup), potentially mutation.inherited_state (see #2631). What's the newbie alternative for getting those?

molpopgen May 16, 2023
Maintainer

nodes is qualitatively different, returning an iterator over nodes in a given order. For the rest, go straight to the array interface and the reverse lookup. (Not sure about the inherited state thing and no time to read up.)

benjeffery · 2023-05-17T09:12:48Z

benjeffery
May 17, 2023
Maintainer

I admire the creativity. I think this would be quite a large undertaking. It would give code that looks nice, but it isn't obvious to the user that each . involves at least two or three function calls. I'm with JK here that we should be focusing on training on the array-based methods as you need to use these to do anything serious, so might as well learn them from the start.

1 reply

hyanwong May 17, 2023
Maintainer Author

Fair points all. If we are wholesale moving to the array methods, though, what's the replacement for the syntactic sugar such as mutation.inherited_state that can turn out to be very handy? It would be good not to have to switch between API methods just to access these extra functions. E.g. do we provide array-based functions that calculate these for you? Is this inefficient when all you want is a single value for a given object (and are we therefore assuming that people will be wanting to do calculations on the whole tree sequence, rather than on small subsets of some of the objects)?

We tend to focus on whole-genome analysis, but I suspect that many users are only bothered about investigating certain portions of a larger tree sequence (e.g. a few sites, or a subset of samples). If we push people down the array route, do we encourage them to simplify or delete_intervals before doing targeted investigations?

petrelharp · 2023-05-17T16:12:13Z

petrelharp
May 17, 2023
Maintainer

I also admire the creativity but am with JK. But, a tangential note: we could make some of this nicer, e.g.,
we could make position an attribute of the Mutation object, then

mutation_pos = ts.site(ts.mutation(m).site).position

would be

mutation_pos = ts.mutation(m).position

In another direction, we might allow the argument to ts.individual( ) be a Mutation object, so that

right_location = ts.individual(ts.node(ts.mutation(m).node).individual).location

could be

right_location = ts.individual(ts.mutation(m)).location

I think the mixing up of ID types, particularly between nodes and individuals, is a potentially major source of bugs; so if there's good ways to avoid that, it'd be nice.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Radical suggestion for python API: integers versus objects #2755

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Radical suggestion for python API: integers versus objects #2755

hyanwong May 16, 2023 Maintainer

Replies: 3 comments · 6 replies

jeromekelleher May 16, 2023 Maintainer

hyanwong May 16, 2023 Maintainer Author

hyanwong May 16, 2023 Maintainer Author

molpopgen May 16, 2023 Maintainer

hyanwong May 16, 2023 Maintainer Author

molpopgen May 16, 2023 Maintainer

benjeffery May 17, 2023 Maintainer

hyanwong May 17, 2023 Maintainer Author

petrelharp May 17, 2023 Maintainer

hyanwong
May 16, 2023
Maintainer

Replies: 3 comments 6 replies

jeromekelleher
May 16, 2023
Maintainer

hyanwong May 16, 2023
Maintainer Author

hyanwong May 16, 2023
Maintainer Author

molpopgen May 16, 2023
Maintainer

hyanwong May 16, 2023
Maintainer Author

molpopgen May 16, 2023
Maintainer

benjeffery
May 17, 2023
Maintainer

hyanwong May 17, 2023
Maintainer Author

petrelharp
May 17, 2023
Maintainer