Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modeling and functions #829

Merged
merged 45 commits into from
Aug 19, 2021
Merged

Conversation

corinne-hcr
Copy link
Contributor

Move codes from private data repo to here
These are what I have for now.
When I am done with refactoring, I will update the codes here.

@shankari
Copy link
Contributor

@corinne-hcr can you clarify that this is the most recent version of the code? Note that this is different from #826

@corinne-hcr
Copy link
Contributor Author

Yes, the files in this PR are copied from the private data repo. They are what I have and am working on now.

@shankari
Copy link
Contributor

When I am done with refactoring, I will update the codes here.

Note that I plan to significantly refactor this code as part of my cleanup. I would suggest importing directly from this repo instead of making a copy, because you will not be able to simply copy your changes over.

corinne-hcr and others added 3 commits July 19, 2021 14:55
…on_pipeline, modify a parameter in the plot code
- Replace the participant filter and the direct call to `edb` with
    the standard call to the abstract timeseries

- Change the imports to the standard format with standard imports first, and
  our imports later
@shankari
Copy link
Contributor

You are reading all the user trips twice while determining the parameters.
First, in get_user_ls (emission/analysis/modelling/tour_model/get_users.py) and second, in emission/analysis/modelling/tour_model/evaluation_pipeline.py as part of the main function. Not surprising that everything is super slow....

@corinne-hcr
Copy link
Contributor Author

I use get_user_ls in the notebook code, but it runs fast.
In the clustering code, around 8 mins for 1 user. Around 50 mins to run all users. The tuning part is slow.

@shankari
Copy link
Contributor

@corinne-hcr please do not make any additional changes.

I am refactoring the code much more substantially, and if you push changes as well, I am going to get a bunch of merge conflicts.

I should have taken over the codebase a lot earlier, but I waited for you to push all your changes so that we would not be working on the same codebase at the same time.

At this point, I'm going to have to pull an all-nighter to clean up this code, so please don't make my life harder.

@shankari
Copy link
Contributor

@corinne-hcr Please work on the unit tests instead so I can know that my refactoring is correct.

shankari added 7 commits July 19, 2021 21:48
The previous implementation was in a situation where we didn't know for sure
whether or not there would be UUID entries for each user. But that has been
true by default for years now.

If we ever got into a situation in which we didn't have the uuid_db, we would
just create the missing entries.

Switching to the UUID DB avoids weird UUIDs like `None` and `null`, and also
is significantly faster.

```
%timeit edb.get_uuid_db().distinct("uuid")
1.29 ms ± 12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

```
%timeit edb.get_timeseries_db().distinct("user_id")
3.68 s ± 105 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
- Enable logging
- Add logging for each stage of the pipeline
- change all `for i in range(len(x))` to `for y in x` or `for i, y in
  enumerate(x)` as needed
- Change the logging in `map_labels_mode` to log all the rows instead of one
  row at a time, to be consistent with the log message
+ enable logging
+ replace `count` with `uniqcount` so it doesn't conflict with the count() function
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.count.html

Without this fix, we run into:
#2 (comment)
Instead of running on only the first one
This required some changes to the intake pipeline code:
- add a new algorithm type
- add a new non-placeholder function that calls the actual load prediction code

This also required changes to the load_predict code to match the intake pipeline:
- accept a single trip instead of an array as input
    - change the trip access functions accordingly
- enable logging + add additional log statements to know what we are doing
- convert the `print` statements in the ad-hoc unit testing to assert
  statements so we could actually see if they failed
- improve exception handling by catching and printing the exception

Testing done:
- Ran the intake pipeline; it passed
- Connected a UI client from master, checked that there were no errors
@shankari
Copy link
Contributor

At the current point, everything seems to work end-to-end for the minipilot dataset.
@corinne-hcr congratulations! It looks like your code mostly worked; I only had to spend ~ 5 hours and some changes to integrate it. It still needs a lot of cleanup, both in terms of code structure and in terms of performance, but it does work in terms of functionality, so we can get the end-to-end demo working before the meeting.

shankari and others added 5 commits July 20, 2021 11:49
We have a cold start problem with demos - we want to be able to showcase the
labeling, but it won't work until we have sufficient labels.

This commit introduces a new placeholder that will help with demos. It pulls
the unique set of existing labels and randomly returns one of them 9/10th of
the time. The remaining 1/10th of the time, it returns no inference.

This lets us prototype our goal, where most of the trips are labeled, and the
user only occasionally has to input labels.

However, if the user does have sufficiently good labels, we do want to use
that.  So we use our built-in ensemble method to generate two labels, pick the
"real", non-placeholder if if exists, and fallback to the placeholder if it
doesn't.

Note that if we enhanced the placeholder to use the actual distribution of the
existing labels, this could arguably be a very simple randomized algorithm.

+ added a super simple script to reset only the label inference stage and the
only the confirmed trip stage so that we can experiment with this without
having to re-run the complete pipeline

+ added a super simple script to print out label stats - how many of each kind of trip, to debug pipeline operation

Testing done:
- Reset the inference pipeline
- Ran the intake pipeline

Placeholder was invoked

```
2021-07-20 10:54:54,758:DEBUG:4457897408:In placeholder_predictor_demo: ound 85 for user 1d292b85-c549-409a-a10d-746e957582a0, returning value {'mode_confirm': 'shared_ride', 'purpose_confirm': 'meal', 'replaced_mode': 'same_mode'}
2021-07-20 10:54:54,866:DEBUG:4457897408:In placeholder_predictor_demo: ound 85 for user 1d292b85-c549-409a-a10d-746e957582a0, returning value {'mode_confirm': 'walk', 'purpose_confirm': 'exercise', 'replaced_mode': 'bus'}
2021-07-20 10:54:54,972:DEBUG:4457897408:In placeholder_predictor_demo: ound 85 for user 1d292b85-c549-409a-a10d-746e957582a0, returning value {'mode_confirm': 'skateboard', 'purpose_confirm': 'exercise', 'replaced_mode': 'same_mode'}
```

Final result was a combination of real and placeholder

```
2021-07-20 10:54:55,182:DEBUG:4457897408:Found real prediction Labelprediction({'trip_id': ObjectId('60f70def49a6d6de8e8ef803'), 'algorithm_id': 4, 'prediction': [{'labels': {'mode_confirm': 'not_a_trip', 'purpose_confirm': 'home', 'replaced_mode': 'not_a_trip'}, 'p': 1.0}], 'start_ts': 1612036947.149, 'end_ts': 1612037455.46}), using that preferentially
2021-07-20 10:54:55,286:DEBUG:4457897408:No real prediction found, using placeholder prediction Labelprediction({'trip_id': ObjectId('60f70def49a6d6de8e8ef807'), 'algorithm_id': 5, 'prediction': [{'labels': {'mode_confirm': 'walk', 'replaced_mode': 'same_mode', 'purpose_confirm': 'meal'}, 'p': 0.9517281073260278}], 'start_ts': 1612132424.3783095, 'end_ts': 1612133524.871})
```

- Reset confirmed trips

```
all inferred trips 215
all confirmed trips 0
confirmed trips with inferred labels 0
confirmed trips without inferred labels 0
confirmed trips with expectation 0
```

- Reran the pipeline

```
all confirmed trips 215
confirmed trips with inferred labels 215
confirmed trips without inferred labels 0
confirmed trips with expectation 215
```
Before fix:
#2 (comment)

After fix:

- Before reset
```
Inferred trips with inferences 215
All expected trips 430
Expected trips with inferences 430
all inferred trips 215
all confirmed trips 215
confirmed trips with inferred labels 215
confirmed trips without inferred labels 0
confirmed trips with expectation 215
```

- After reset

```
$ ./e-mission-py.bash bin/debug/reset_partial_label_testing.py -c
storage not configured, falling back to sample, default configuration
Connecting to database URL localhost
{'n': 430, 'ok': 1.0}
{'n': 215, 'ok': 1.0}
{'n': 1, 'ok': 1.0}

storage not configured, falling back to sample, default configuration
Connecting to database URL localhost
All inferred trips 215
Inferred trips with inferences 215
All expected trips 0
Expected trips with inferences 0
...
confirmed trips with inferred labels 0
confirmed trips without inferred labels 0
confirmed trips with expectation 0
```

- After re-run

```
All expected trips 215
Expected trips with inferences 215
all inferred trips 215
all confirmed trips 215
confirmed trips with inferred labels 215
confirmed trips without inferred labels 0
confirmed trips with expectation 215
```
+ Add the ability to select an ensemble function like primary_algorithms does for normal inference algorithms
+ Prediction algorithms now go in inferrers.py
+ Ensemble algorithms now go in ensembles.py
+ Just needed to update the fully qualified names of the predictors
- Note that the ensemble test is weak enough that it did not need to be
  changed -- this may or may not be something to work on in the future
shankari and others added 8 commits July 27, 2021 01:13
We copy them instead of modifying the existing models so that we can run them
side by side for a comparison.

TODO: Unify with parameters (at least for the auxiliary files) instead of
making copies.

This is currently unchanged so that we can have a clean diff when we do apply
the changes.
Configuration changes include:
- `data_preprocessing.py`: don't drop user input where any fild is None, only
  filter if all fields are not defined.
- `evaluation_pipeline.py`: turn off filtering and cutoffs for the similarity code
- `get_users.py`: don't require that 50% of the trips are labeled

Then make significant changes to `build_save_model`. Concretely:
- save the models after the first round, which significantly simplifies the
  prediction code
- pull out the location map generation and user input generation code into
  separate functions to simplify the code invocation.

Testing done:
- Ran it against the staging database
- Several users did not have any labeled trips

```
$ grep "enough valid trips" /tmp/staging_run.log
2021-07-27 01:31:02,224:DEBUG:Total: 0, labeled: 0, user 9cfc3133-3505-4923-b6fb-feb0e50a5aba doesn't have enough valid trips for further analysis.
2021-07-27 01:31:02,246:DEBUG:Total: 0, labeled: 0, user cc114670-ef99-4247-a3a2-47529247d8ac doesn't have enough valid trips for further analysis.
2021-07-27 01:31:02,267:DEBUG:Total: 0, labeled: 0, user 2bc8ca71-7d0f-4930-ba2c-cf97f7dceaea doesn't have enough valid trips for further analysis.
2021-07-27 01:31:05,340:DEBUG:Total: 171, labeled: 6, user 6373dfb8-cb9b-47e8-8e8f-76adcfadde20 doesn't have enough valid trips for further analysis.
```

They were filtered correctly

```
$ grep "enough valid trips" /tmp/staging_run.log | wc -l
      21
```

Models for 20 users were created

```
$ ls locations_first_round_* | wc -l
      20
```

No errors
If the user has never submitted an input, we can't infer anything.
So we return early.
This fixes e-mission#829 (comment)
Concretely:
- pull out the matching code into `find_bin`
- load only locations and user inputs
- from trip -> bin -> user_labels

Change filenames for consistency

Added additional log statements for clarity

Testing done:
- Ran the intake pipeline
- Ran `label_stats` for a single user
- Ran the intake pipeline again
- Confirmed that the confirmed trips had inferences
Previous commit: 71e299e
fixes: model filenames, load process and variable name fix

Testing done: really works, even with reset
- Remove it from the list of algorithms
- Switch the ensemble to the one that returns the first value only
Since this is a normal use case and not an error.
Simplified and reconfigured model building and application
@corinne-hcr
Copy link
Contributor Author

@shankari it is not successful. Is it affected by the previous PR?

@shankari
Copy link
Contributor

If you look at the logs, all the

    random_user_input = random.choice(unique_user_inputs) if random.randrange(0,10) > 0 else []
  File "/home/runner/miniconda-4.8.3/envs/emissiontest/lib/python3.7/random.py", line 261, in choice
    raise IndexError('Cannot choose from an empty sequence') from None
IndexError: Cannot choose from an empty sequence

errors are gone, both because of f2060cc which handled the corner case, and because of e666d22 which removed the random placeholder.

There is still failure due to

======================================================================
FAIL: testGetUUIDList (storageTests.TestTimeSeries.TestTimeSeries)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/e-mission-server/e-mission-server/emission/tests/storageTests/TestTimeSeries.py", line 39, in testGetUUIDList
    self.assertIn(self.testUUID, uuid_list)
AssertionError: UUID('25c227ba-fefe-4c5f-9b01-f5bbcf1617e7') not found in []

----------------------------------------------------------------------

Since that is in the storage tests, it must be because of 7fac452

Let me take a look

@shankari
Copy link
Contributor

shankari commented Aug 3, 2021

We currently have hard-coded radius and other parameters for the first round.

So the instructions for generating labels are:

  • generate the model files: ./e-mission-py.bash emission/analysis/modelling/tour_model_first_only/build_save_model.py
    • run one time
    • only re-run if labels have changed
  • reset the inference pipeline: ./e-mission-py.bash bin/debug/reset_partial_label_testing.py -i
  • reset the confirmed trips pipeline: ./e-mission-py.bash bin/debug/reset_partial_label_testing.py -c
  • re-run the intake pipeline: ./e-mission-py.bash bin/intake_multiprocess.py 3

So that we can collapse the different categories and increase the percentage
e-mission/e-mission-docs#656 (comment)

+ also check to see whether the replacement mode actually exists before
performing the mapping since we are checking this into master and it may not
always exist.

And even on CanBikeCO, we may have users who have never filled in a replaced mode.
And a launch script that takes multiple possible arguments.
This allows us to build individual models instead of building them all at the
same time.
@shankari
Copy link
Contributor

shankari commented Aug 4, 2021

@GabrielKS I have now enabled the same_mode to actual mode mapping to fix e-mission/e-mission-docs#656 (comment)

Can you see if this makes a difference?

New instruction to build the model: ./e-mission-py.bash bin/build_label_model.py -a to rebuild all models. There are other options to rebuild models for only subsets of users, see --help for those options.

This ensures that we can match trips properly
(for fix, see e-mission/e-mission-docs#656 (comment))
@shankari
Copy link
Contributor

shankari commented Aug 5, 2021

Deployed this to staging.

  • Rebuilt the models (./e-mission-py.bash bin/build_label_model.py -a > /var/log/better_model.log 2>&1)
  • Confirmed that all the model files had been updated (timestamp = Aug 5 05:52)
  • Reset the inference trips
./e-mission-py.bash bin/debug/reset_partial_label_testing.py -i

Note that because of the large number of expected trips, the deletion is taking 2 minutes and involves multiple thousand expected trips.

{'n': 5922, 'ok': 1.0}
{'n': 1094421, 'ok': 1.0}
{'n': 5922, 'ok': 1.0}
{'n': 5922, 'ok': 1.0}
{'n': 78, 'ok': 1.0}
  • By this time, the regular pipeline run had started, so waited until it was complete, and re-ran the reset
./e-mission-py.bash bin/debug/reset_partial_label_testing.py -i
{'n': 3872, 'ok': 1.0}
{'n': 3872, 'ok': 1.0}
{'n': 3872, 'ok': 1.0}
{'n': 3872, 'ok': 1.0}
{'n': 38, 'ok': 1.0}
./e-mission-py.bash bin/debug/reset_partial_label_testing.py -c
{'n': 0, 'ok': 1.0}
{'n': 5922, 'ok': 1.0}
{'n': 39, 'ok': 1.0}
  • Re-ran the pipeline

  • Checking for one user; everything looks good

all inferred trips 264
all confirmed trips 264
confirmed trips with inferred labels 150
confirmed trips without inferred labels 114
confirmed trips with expectation 264

shankari and others added 6 commits August 10, 2021 11:56
As a performance improvement, in 7fac452 we
switched to reading the uuids from the UUID database instead of looking at the
unique entries in the timeseries.

This resulted in an order of magnitude improvement

```
%timeit edb.get_uuid_db().distinct("uuid")
1.29 ms ± 12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

```
%timeit edb.get_timeseries_db().distinct("user_id")
3.68 s ± 105 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

But it required that all users have an entry in the UUID DB to show up in the
`get_all_uuids()` call.

So we change this test to ensure that:
- change the existing test loading code to also register users if a
  `registerEmail` is filled in; this required some refactoring
- for this test, pass in a registerEmail in the test object

Testing done:

```
$ ./e-mission-py.bash emission/tests/storageTests/TestTimeSeries.py TestTimeSeries.testGetUUIDList
storage not configured, falling back to sample, default configuration
Connecting to database URL localhost:27017
2021-08-10 12:38:33,424:INFO:4669529536:Before loading from emission/tests/data/real_examples/shankari_2015-aug-21, timeseries db size = 4527962
2021-08-10 12:38:33,471:INFO:4669529536:registering email = user1
Setting up real example for 4405c6ce-f8c8-4e37-b5b7-c69c0072135e
2021-08-10 12:38:37,001:INFO:4669529536:After loading, timeseries db size = 4530087
2021-08-10 12:38:37,019:DEBUG:4669529536:First few entries = ['2015-08-21T07:55:13.408000-07:00', '2015-08-21T07:55:13.408000-07:00', '2015-08-21T07:55:01.638000-07:00', '2015-08-21T07:55:07.414000-07:00', '2015-08-21T07:55:01.638000-07:00', '2015-08-21T07:54:51.095000-07:00', '2015-08-21T07:54:51.095000-07:00', '2015-08-21T07:54:51.095000-07:00', '2015-08-21T07:54:52.153000-07:00', '2015-08-21T07:55:07.414000-07:00']
2021-08-10 12:38:37,030:INFO:4669529536:Before loading from emission/tests/data/real_examples/shankari_2015-aug-27, timeseries db size = 4530087
2021-08-10 12:38:37,063:INFO:4669529536:registering email = user1
Setting up real example for 4405c6ce-f8c8-4e37-b5b7-c69c0072135e
2021-08-10 12:38:39,418:INFO:4669529536:After loading, timeseries db size = 4531569
2021-08-10 12:38:39,433:DEBUG:4669529536:First few entries = ['2015-08-21T07:55:13.408000-07:00', '2015-08-21T07:55:13.408000-07:00', '2015-08-21T07:55:01.638000-07:00', '2015-08-21T07:55:07.414000-07:00', '2015-08-21T07:55:01.638000-07:00', '2015-08-21T07:54:51.095000-07:00', '2015-08-21T07:54:51.095000-07:00', '2015-08-21T07:54:51.095000-07:00', '2015-08-21T07:54:52.153000-07:00', '2015-08-21T07:55:07.414000-07:00']
.
----------------------------------------------------------------------
Ran 1 test in 6.193s

OK
```
 + Implements the algorithm described in e-mission/e-mission-docs#663 (comment)
 + Uses constant values `A=0.01`, `B=0.75`, `C=0.25`
 + Changes eacilp.primary_algorithms to use the new algorithm

No unit tests yet (working on a tight timeline), but tested as follows:
 1. Run `[eacili.n_to_confidence_coeff(n) for n in [1,2,3,4,5,7,10,15,20,30,1000]]`, check for reasonableness, compare to results from plugging the formula into a calculator
 2. Run the modeling and intake pipeline with `eacilp.primary_algorithms` set to the old algorithm
 3. Run the first few cells of the "Explore label inference confidence" notebook for a user with many inferrable trips to get a list of unique probabilities and counts for that user
 4. Set `eacilp.primary_algorithms` to the new algorithm, rerun the intake pipeline (modeling pipeline hasn't changed)
 5. Rerun the notebook as above, examine how the list of probabilities and counts has changed
 + Useful for playing around with other constants in notebooks
With input:

```
A, B, C
D, E, F
G, N/A, N/A
```

Before (doesn't add up to 1!!):

```
A, B, C      0.3
D, E, F      0.3
```

After (adds up to 1!!)

```
A, B, C      0.5
D, E, F      0.5
```

Hopefully this doesn't happen very often
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants