-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up the TRB_label_assist code #35
Comments
The current implementation of the mode building part of label assist code is at: There are previous implementations in https://github.com/e-mission/e-mission-server/tree/master/emission/analysis/modelling/tour_model and https://github.com/e-mission/e-mission-server/tree/master/emission/analysis/modelling/tour_model_first_only
The custom branch uses some code from So we should:
Eventually we should take the new and improved model(s) from TRB label assist and put the into For a high level summary of what the label assist model does, please see this poster: |
Two files, Clustering.py and models.py needs to be updated for us to port the entire code to just our main repo. Q1 Among the vast number of models in the models.py file, we are currently using ClusterExtrapolationClassifier, ForestClassifier and NaiveBinningClassifier in our performance_eval.py file. Out of these three , only NaiveBinningClaasifier has dependence on the code from tour_model. But there's one another model ( which we are currently not using) RefactoredNaiveCluster which is dependent on tour_model. Should we, for now, let model be dependent on tour_model or port this to trip model ? Q2 Is there any way I can check the the working of trip_model by running it ? Is there any module that's using it that I can/should run ?? NOTE : regenerate_classification_performance_results.py. is currently standalone and is not being used by any of the notebooks. |
@humbleOldSage I just want to clarify that, while we will eventually port the code around the new models in the
|
I apologize for not reading your comment more carefully. You are correct that if there is a mode (e.g. RefactoredNaiveCluster) which is dependent on the tour model, we could choose to remove it instead of migrating it. So I would suggest:
Let's not keep unused, bitrotted code around |
Q2 Is there any way I can check the the working of trip_model by running it ? Is there any module that's using it that I can/should run ?? Trip model has unit tests, and a script to launch it against a database.
NOTE : That is by design - if you check the history:
|
I missed an indirect dependence of
|
I have been working with the sample data that the script generates for testing for now. It should be enough for my purpose. If it doesn't I'll run it on full data set over the weekend. Also, won't need to run all of them, just need 1 test file, the |
Curious - what are you running the existing tests for? Is it to test the new functionality that you are moving into |
While I move the functionalities, I am just using them to verify my understanding of the code-flow and I/O to |
The dependence of the 2 files as mentioned previously(
|
For i. , I probed the I can say this because to understand the functionalities and flow of data on the custom branch side, I ran the dependent notebooks on the CEO dataset - i.e. I then looked at the list of diffs in the extended branch This notebook only uses To match this understanding on the main branch side, I ran the existing unit test specifically for the modeling modules using the below command.
The However, both these functions are working on different data-forms, i.e., custom branch uses data in data-frame format while processing whereas the main branch uses them in So we just need to change the way data is being passed to functions, without breaking any other functionalities. We have not yet evaluated what are the other changes between
|
To begin with the changes for (i) at #35 (comment) , we start with the
where
These group labels are generated using the These generated labels are stored in two places:
The additional column (columns if we pass multiple r values), that I referred to in the beginning of this comment, copies labels from (i) (above) and appends into the data frame. IF we are able to generate these labels from the existing |
And what would it take to generate these labels from the existing trip_model in the main branch? |
In the main branch , However, rather than storing bin labels of each trip next to it,
For e.g., below is one such groups saved by
Here 0,1,2 and 3 are bin labels which have trips ( actually they are features of trips ) belonging to that bin, grouped as a list. We need |
One possible way is ( to be included in PR ) the line that gets the matching bin for the point is :
Beyond this line, grouping of trips a.t. labels take place. I can introduce a |
Changes that need to be done in the
Currently,
|
After discussions on this PR, particularly this suggestion focuses on reducing load on production side More precisely, our aim, now, would be on
like mappings from
like structure which For e.g. : From grouped bins which look like this
we retrieve tripFeature with theirs bin ids as below.
Once we have tripFeature-bin mappings, we'll have to search for each trip (using tripFeature that we have) in the datagram This would result in no changes in |
In the On the other hand , the Looks like we can connect these two by additionally passing around |
In the process to link them , I find that the Once back from these calls in the |
After searching for a while, I can conclude that there is no functionality implemented to convert df to This is still a hypothesis and needs to be tested. |
To test this hypothesis, I can check the column names in trips generated by |
@humbleOldSage one of the entries in If we are going to use dataframes extensively, we can also consider changing |
If we do decide to move to df in Using object id, post filtering, seems like a good alternate to conversions. I'll pick this route for now and move forward. |
Just want to clarify that:
again fine with taking the object id post filtering is fine for now; just realize that you might need to polish it again downstream |
Then I think it makes more sense to switch to df based processing from now itself. I'll rollback on the |
@shankari I might be wrong here , but does the following make sense ? From my understanding, I feel if we take the df way, then from However, It is ESSENTIAL to point out that my understanding above is from It might be possible that there are dependencies on
The task, I believe, then MIGHT translate to upgrading 'tour_model_extended' to be an exact df version of |
thank you for the visual representation. I agree with the first two images, but I don't understand/don't agree with the third. The interface of trip_model will take a dataframe, so I guess they are upgraded in that sense. Again, having the label assist use trip_model instead of tour model was intended as an easy-to-implement first step to get you familiar with the codebase. From the original project definition:
Trip model is structured so that it can support multiple algorithms in parallel in a modular fashion. The bin-based algorithm that is currently implemented in expanded tour model is just one of them. The goal is to just change that implementation to the expanded version, not replace trip_model entirely |
The figures generated after 97406c4 , are different from the paper. Below are the screenshots . To the left are screenshots from the commit I mentioned. To the right are the ones from the paper SUBURBAN AREA. : 50m 100m 150m |
COLLEGE CAMPUS : In case of college campus, I am unable to determine what area of the map is the screenshot (in the paper) from. However, even if I compare the old custom branch and the recent commit ones, they differ. |
It is from near LA - between LA and Irvine |
I think one of the reason, why the labels are different is due to the format of the input to the |
The other thing that was causing the issues is the On looking through these two things collectively, its just the second reason that was causing the problem. The first one is correct in itself. Once, I do the just the second changes, the suburban results match for r= 50,100,150 match with the paper. Let me check the college ones as well. |
It works for college campus as well. |
In general, for this, it is fine to make the change in the master branch because we don't have to use it in production, but we can if we want to. We should just make sure that it is configurable so we can continue using O-D on production but use the other options for evaluation only for now. In the timeseries, we did not want to make the change in master because there isn't only one set of entries we want to read so the change was not modular enough. In this case, |
With the changes in
has dependence on I've shift this to The other and better way is to check if the model_ is of instance |
While running
On investigating, it seems model was fit and saved correctly. |
To tackle this, we can use |
I am not sure that this is better. Your instinct is to have |
So have you actually done this yet? |
Yeah. This is already done in commit : a34836f. |
This will:
The code is in: https://github.com/e-mission/e-mission-eval-private-data/tree/master/TRB_label_assist
It works, but it requires us to use a custom fork of the e-mission-server repo to work
We should see the changes between the custom fork and the current e-mission-server master/gis branch
See if the changes are still necessary or whether there are implementations in e-mission-server that have superceded them
If they have not been superceded, we need to incorporate them in e-mission-server
So at the end, this should be reproducible against a core e-mission-server repo
Stretch goal: change this repo to work with docker containers that are built with the base e-mission-server container so that people don't have to install e-mission-server and set the EMISSION_SERVER_PATH and all the funky setup steps.
After we are done with this, we should integrate the random forest model + prediction into the e-mission server code for label assist.
The text was updated successfully, but these errors were encountered: