Replies: 3 comments 6 replies
-
I think this would just create more confusion, and it would be better to teach people to think about the data model in terms of arrays in the first place (like mutation_pos = ts.sites_position[ts.mutations_site[m]] The advantage being that it's a straightforward change to writing things efficiently using numpy vector operations (or numba), vs a full rewrite if using things like My vote would be to spend (a small fraction) of the effort refocusing teaching materials to use the array-oriented API instead. |
Beta Was this translation helpful? Give feedback.
-
I admire the creativity. I think this would be quite a large undertaking. It would give code that looks nice, but it isn't obvious to the user that each |
Beta Was this translation helpful? Give feedback.
-
I also admire the creativity but am with JK. But, a tangential note: we could make some of this nicer, e.g.,
would be
In another direction, we might allow the argument to
could be
I think the mixing up of ID types, particularly between nodes and individuals, is a potentially major source of bugs; so if there's good ways to avoid that, it'd be nice. |
Beta Was this translation helpful? Give feedback.
-
Here's a rather radical idea that I came across when looking at our teaching material while wearing my "newbie to tskit" hat. Perhaps there's something in it, or perhaps it would cause too much code churn or maintenance burden: either way it could be worth a preliminary discussion before shooting it down in flames.
The trigger for the idea is that I find it rather confusing that e.g. to get the position for a mutation, or the time for an edge parent, you need to repeat the name of the tree sequence (with the potential to get that wrong if you have a set of similarly named tree sequences lying around, and the issue of using the wrong integer source, e.g. the node id rather than the individual ID, etc):
This also forces beginners to know all about tskit ids from the start, and be comfortable bandying them around. It would be less error prone, and make teaching easier if we could do
In other words, it would be most helpful if the python API returned objects instead of integers for Mutation.site, Edge.parent, etc. There are also some other places where it's unclear to a beginner whether an ID or an object is being returned: e.g.
TreeSequence.nodes()
returns an iterator over node objects, whereasTree.nodes()
returns an iterator over node IDs. My potential suggestion here for the high-level API is that we should always return objects rather than simply IDs where possible.It seems at first glance that if we wish to retain backwards compatibility, the boat sailed a long time ago on this possibility. But I think there may be a solution, although I don't know how much performance loss there would be. That is to treat a
Node
object as a special type of integer, with additional properties that return the correct items. The id for a node object could then be obtained either simply by treating the object as an integer, or (more clearly) by accessing (say) a.id
attribute. A sketch for something that would work might look a bit like this:As you can see, it's necessary to keep a reference to the containing tree sequence in each object, which could turn out to hit performance badly, I guess, especially when doing e.g.
for node in ts.nodes():
? However, I think we are moving to a world in which the python API usingts.node(u)
etc is for simple stuff, basic algorithm development, etc, and for really performant code you should go withts.nodes_flags
,ts.nodes_time
, etc. So maybe a performance hit is acceptable?It's a large (huge?) change to standard tskit usage, but could, I think, both reduce errors when using
tskit
and help newcomers. Thoughts?Beta Was this translation helpful? Give feedback.
All reactions