Skip to content
This repository has been archived by the owner on May 27, 2024. It is now read-only.

WIP - Fix #154 start time of freq based trip should be immutable. #336

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

scrudden
Copy link

Summary:
This is to add the rule detailed in issue #154

Expected behavior:
It should record an E053 error if the start time for a trip does not stay the same between updates.

It looks for the latest stop sequence in the respective (previous/current) trip updates for a vehicle and if the current one is the same or after the previous one it then checks that the start time is the same. If they are not the same it raises an E053 error.

@CLAassistant
Copy link

CLAassistant commented Sep 15, 2018

CLA assistant check
All committers have signed the CLA.

@scrudden
Copy link
Author

Build was too slow and timed out on travis.

@barbeau
Copy link
Member

barbeau commented Sep 15, 2018

Strange - I just restarted it. Thanks for working on this!

@scrudden
Copy link
Author

Travis build timed out again.

@barbeau
Copy link
Member

barbeau commented Sep 17, 2018

Ok, Travis builds seem to be working again. I think downloading some of the artifacts was timing out.

Copy link
Member

@barbeau barbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scrudden Thanks for working on this! Generally it looks good, you've gotten most of the rule implementation correct. I've added some requests for formatting change (just use IntelliJ format) and fairly straightforward text changes in-line.

The one major change I'd request is that instead of doing the nested FOR loop, let's set up a HashMap of vehicle_id->entity within the validate() method for the previous message before looping through the current message. This will reduce the run time complexity from O(n^2) to O(n) for a relatively minor increase in memory, as we'll be able to fetch entities with matching vehicle_ids between the previous and current feeds from the HashMap in O(1) time.

Please let me know if you have any questions!

RULES.md Outdated Show resolved Hide resolved
RULES.md Outdated Show resolved Hide resolved
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because a VehiclePositions feed can contain a TripDescriptor too, we should also check for this same rule within the below if (entity.hasVehicle) { IF block.

See TripDescriptorValidator.java for examples of how I've factored out the code to methods for rules that are applied to TripUpdates and VehiclePositions entities.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more, I think you'd need the current_stop_sequence field to be populated in order for this to be checked against a VehiclePosition entity...

@@ -75,9 +82,34 @@
// W005 - Missing vehicle_id in trip_update for frequency-based exact_times = 0
RuleUtils.addOccurrence(W005, "trip_id " + tripUpdate.getTrip().getTripId(), errorListW005, _log);
}
if(previousFeedMessage!=null)
{
for (GtfsRealtime.FeedEntity previousEntity : previousFeedMessage.getEntityList())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to avoid the O(n^2) run time complexity that results from the FOR loop here (for each current entity, loop through each previous entity). This could get very expensive when running on a large number of large feeds in parallel, or when running batch validation on archived data.

Instead I think sacrificing some memory to loop through the previous entities and store references to them in a HashMap (vehicle_id -> GtfsRealtime.FeedEntity) before starting the for (GtfsRealtime.FeedEntity entity : feedMessage.getEntityList()) { loop would be my preferred approach. Then, for each current message, check the HashMap to see if a reference is stored for the current vehicle ID, and if so compare the start_times and other data to see if there is an error. This is how the duplicate vehicle_id rule E052 in VehicleValidator is implemented.

Apologies if your previous implementation was like this before we talked on Slack - I may have misunderstood your message.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted to the original way I had done this.

Copy link
Author

@scrudden scrudden Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to move Hashmap into validate method and populate with previousFeedMessage at the start of the method.

{
if(tripUpdate.getStopTimeUpdateCount()>0 && previousEntity.getTripUpdate().getStopTimeUpdateCount()>0)
{
if(tripUpdate.getStopTimeUpdate(tripUpdate.getStopTimeUpdateCount()-1).getStopSequence()>=previousEntity.getTripUpdate().getStopTimeUpdate(previousEntity.getTripUpdate().getStopTimeUpdateCount()-1).getStopSequence())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems to work well when you have a set of stop_time_updates in the feed that's less than the total number of records in stop_times.txt for that trip. In other words, when you have predictions for only a subset of the stops for a trip in a TripUpdate. Most of feeds I've seen in the wild fall into this category. The only place I see this falling apart is if the producer provides stop_time_updates for all stops in the trip, which has been advocated as a "strict" feed by Andrew Byrd and Stefan de Konink in their work with OTP. I say we go with this implementation for now and then handle strict feeds later when we have real examples of strict feeds with frequency exact_times=0 trips.

@@ -75,9 +83,25 @@
// W005 - Missing vehicle_id in trip_update for frequency-based exact_times = 0
RuleUtils.addOccurrence(W005, "trip_id " + tripUpdate.getTrip().getTripId(), errorListW005, _log);
}

if (tripUpdate.hasVehicle() && tripUpdate.getVehicle().hasId() && previousTripUpdates.get(tripUpdate.getVehicle().getId()) != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this in context of TripUpdates, I think that we need to use a composite key for the HashMap - vehicle_id + trip_id.

In a TripUpdates feed, you can have multiple TripUpdate records that have the same vehicle_id. For example, if a block is composed of two trips:

  • trip_A
  • trip_B

You can have a TripUpdate for trip_A and a TripUpdate for trip_B in the feed at the same time for the same vehicle if the same vehicle is serving both trips in the block.

All exact_times=0 trips that I've seen in real GTFS datasets have been loops, so they don't have the above issue (they just repeat the same trip_id over and over). But, I believe it's possible to have two trips defined in trips.txt and frequencies.txt that would have two different frequencies for exact_times=0 trips that would be served by the same vehicle back-to-back.

@scrudden Let me know what you think of the above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also have two trip updates in a single feed with the same trip id/vehicle id and different start times and be correct.

Previous updates vehicle 1001

Trip_A_08:00
Trip_A_09:00

Possible current updates vehicle 1001

1 Updates for same trips as last update

Trip_A_08:00
Trip_A_09:00

2 Trip_A_08:00 completed so has updated for next two trips

Trip_A_09:00
Trip_A_10:00

So we need logic for these two scenarios. 1 is simple as if all match then OK. 2 Trip_A_10:00 is OK but how do we know?

Do we check that the start time (10:00) is after the latest matching start_time that is in both feeds for this vehicle?

Copy link
Member

@barbeau barbeau Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are 25 stops in a trip, and we assign a trip_instance value in this example just to keep track of them in our discussion, the TripUpdate data would probably look something like:

Previous updates vehicle 1001

Trip_A_08:00 - last stop_sequence = 20 (trip_instance=0)
Trip_A_09:00 - last stop_sequence = 10 (trip_instance=1)

Possible current updates vehicle 1001

1 Updates for same trips as last update

Trip_A_08:00 - last stop_sequence = 25 (trip_instance=0)
Trip_A_09:00 - last stop_sequence = 15 (trip_instance=1)

2 Trip_A_08:00 completed so has updated for next two trips

Trip_A_09:00 - last stop_sequence = 20 (trip_instance=1)
Trip_A_10:00 - last stop_sequence = 5 (trip_instance=2)


So for scenario 2, you'd be comparing:

  • Trip_A_09:00 - last stop_sequence = 20 (trip_instance=1) against
    • Trip_A_08:00 - last stop_sequence = 20 (trip_instance=0) // This would incorrectly log an error
    • Trip_A_09:00 - last stop_sequence = 10 (trip_instance=1) // This correctly does not log an error
  • Trip_A_10:00 - last stop_sequence = 5 (trip_instance=2) against
    • Trip_A_08:00 - last stop_sequence = 20 (trip_instance=0) // This correctly does not log an error
    • Trip_A_09:00 - last stop_sequence = 10 (trip_instance=1) // This correctly does not log an error

I think I got that right?

One issue with the above, though is that currently my suggestion of using vehicle_id + trip_id -> TripUpdate as a composite key would create collisions in a HashMap when executing scenario 2 (or any case where there are multiple trip_id + vehicle_ids in the previous message). So we'd have to create a HashMap of vehicle_id + trip_id -> List<TripUpdate> or something else to even be able to track this in code.

It would definitely be helpful to define these scenarios in the unit tests to see which ones we're covering and which we aren't. I'm fine with covering the cases we can, and then in the rule description acknowledging which cases we may have false positive or false negatives on.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See items changed in bold. The last stop sequence will be the last stop of the trip if there is an update for a subsequent trip.

Previous updates vehicle 1001

Trip_A_08:00 - last stop_sequence = 25 (trip_instance=0)
Trip_A_09:00 - last stop_sequence = 10 (trip_instance=1)

Possible current updates vehicle 1001

1 Updates for same trips as last update

Trip_A_08:00 - last stop_sequence = 25 (trip_instance=0)
Trip_A_09:00 - last stop_sequence = 15 (trip_instance=1)

2 Trip_A_08:00 completed so has updated for next two trips

Trip_A_09:00 - last stop_sequence = 25 (trip_instance=1)
Trip_A_10:00 - last stop_sequence = 5 (trip_instance=2)

I have already locally added a sorted list of trip updates to the HashMap for the current and previous messages (currently with a vehicle_id key but I will create a composite key with trip_id). I now need to consider the logic of the checks again. I think adding tests to match the above scenarios will be a start to this.

I'll check in what I have changed but it is very much a WIP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last stop sequence will be the last stop of the trip if there is an update for a subsequent trip.

That's not guaranteed, but I agree that will be typical behavior.

I have already locally added a sorted list of trip updates to the HashMap for the current and previous messages

For the strategy I was thinking of, I don't think we need a HashMap for the current message - we should just be able loop through the current message once. I'm not sure I understand why you're sorting by start_time, either - if possible I'd prefer to avoid sorts as we pay another runtime penalty of at least log(n).

I think adding tests to match the above scenarios will be a start to this.

I agree, that's the only way I'll be able to convince myself it works :). I had the same issue with some other tests that had a number of different permutations of field values.

I'll check in what I have changed but it is very much a WIP.

Ok, sounds good. Let me know when you want me to go through it again.

@scrudden scrudden changed the title Fix #154 start time of freq based trip should be immutable. WIP - Fix #154 start time of freq based trip should be immutable. Sep 20, 2018
tripUpdatesMap.put(new TripUpdateKey(tripUpdate.getVehicle().getId(), tripUpdate.getTrip().getTripId()), tripUpdates);
}
}
private class TripUpdateKey
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could just concatenate the vehicle_id and trip_id strings and that by itself would be fine for a composite key? So "1234_400" or something similar. Having an entire class for this seems overblown to me.


private boolean isNewTrip(HashMap<TripUpdateKey, List<TripUpdate>> previousTripUpdates, TripUpdate tripUpdate) {
TripUpdate latest = null;
List<TripUpdate> tripUpdates = previousTripUpdates.get(new TripUpdateKey(tripUpdate.getVehicle().getId(), tripUpdate.getTrip().getTripId()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some comments here to make it a bit easier to follow the logic?

for (GtfsRealtime.FeedEntity entity : previousFeedMessage.getEntityList()) {
if (entity.hasTripUpdate()) {
GtfsRealtime.TripUpdate tripUpdate = entity.getTripUpdate();
addTripUpdate(previousTripUpdates, tripUpdate);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to have addTripUpdate() return the HashMap, rather than passing it as a parameter.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants