-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Label list stats and cleanup #287
Comments
I think that should be doable. Would an API endpoint be okay for that?
That's a great idea! Would you prefer the cleanup job to be a one-shot mechanism (i.e rename the misspelled labels) or should that be a recurring job (e.g: "every time someone adds the label 'carr' rename it automatically in the background to 'car'")? The former shouldn't be that hard to implement, the latter most probably requires more work to get right. It's probably a pretty tedious job to go through all the labels and fix them..but if you would like to compile a list I would be very grateful for that! Regarding the format: I don't have any particular format in mind. Of course, JSON would be great, as it's easy to parse. But writing JSON by hand is probably a nightmare. ;) So, I don't mind if it's something else (yaml, csv, or maybe a custom text protocol with your own custom separator, etc. ), as long as it's somehow parse-able without ambiguity. |
writing JSON by hand is ok . An endpoint is fine for the stats. (could just add a note in the developer section to document it) What I’m hoping is the cleanup will increase the number of “hits” you’ll get for label suggestions. have to think a bit about how the suggestions will fit with the original plan (properties). Maybe you could map them with aliases, or cleanup with Remapping . (Things like “luxury car” etc. They do often overlap , eg “luxury convertible car” etc, which does make sense with the properties idea. What I’m hoping is some combinations like that could be exposed directly as single labels so they can be annotated in one step.. but the properties system would still allow saying more later) |
Perfect!
What we could also consider is doing that in a two step process. i.e: first fix all the misspelled labels, remove the placeholder labels (I think there are some placeholder labels like |
There's now a new API endpoint: https://api.imagemonkey.io/v1/label/suggestions/usage which returns a list of all label suggestions + the number of labels/annotations (I am really bad at naming stuff, so it's possible that I'll rename the |
Nice, that’ll help a lot guiding cleanup and looking for tasks |
Just found my previous experiments using the label list .. I had something reading in “GloVe” word vectors , and someth8ing to find the unique words from compound labels EDIT: and yes it finds the unmatched words ie spelling mistakes - aprox 1300 although there may be a lot more label suggestions using them It found a list of 4000 unique words used from all the suggestions (could just try to train using the images and that word list from all the split labels, ie 4000d output ) Looks like the spelling mistakes are all quite low frequency (eg less than 10 each) It’s definitely nice being able to sort the suggestions by frequency now, I can also look for “the most popular unannoted labels” etc |
Few more stats .. I’ve tried filtering out the suggestions that come with graph nodes (eg there are unfortunately some spelling mistakes in the graph, but far fewer than in the full suggestion list some of these are combinations that would be mappable to properties -potential properties have nodes, eg “luxury sports car” is documented as |
Awesome, thanks for sharing these stats!
I think removing materials from label names is probably a good first candidate for the properties system. e.g we could map the label name |
this might take some thought.. In favour of properties:
in favour of the graph nodes:
Is there a way to get the best of both ? I’ve used “/“ for general label blending a lot.. I’m hoping a parts list (head wheel hand foot handle door etc) will allow filtering those out if we need it, otherwise I hope treating parts as yet another blendable word works fine (“head/cat” pixels are valid for a general “head” detector, or a general “cat” detector) “/“ blending can combine individual property combinations eg “luxury_car/parked_car”, “sportscar/derelict_car” etc . I’ve tried to do this in some places. What are people most likely to use? personally I find the label graph more appealing overall, eg being able to place abribtrary depth organisation over the existing labels (eg taxonomy of life.. “life form->animal->vertebrate->mammal->feline->domestic cat” .. let’s you group cat,lion,tiger to make a trainable output “feline”, one for “all vertebrates” combining what’s in common between lizards,mammals , etc), and the arbitrary blend offers a way to express multi property blend. But ultimately both the UI and data should be convertible both ways . |
A few spelling mistakes in graph nodes, one important semantic mistake: “toy->toy->toy->vehicle->toy_bus” should be “toy_vehicle->toy_bus” box->carbboard_box. = cardboard_box there’s a few more spelling mistakes here and there but glancing through thats the only important “semantic mistake” I found so far |
Many thanks, I'll write a small script to fix those issues in the database! Regarding the graph nodes/properties discussion: That's a really tough call. My main argument for the properties system is that it allows to incrementally improve existing annotations without drawing polygons over and over again. So e.g someone starts by annotating a If you want to refine a annotation done in the graph nodes style, I think the only option is to copy the existing label name, add the missing information and draw the polygon again. But I can also see that the graph nodes style has some advantages too. As you don't need to jump between the labels and the properties list, it's easier to spot where information is missing. It's basically just a flat list which can be scanned pretty easily by eye. For me personally that's one of the biggest weak points of the properties approach. It's not easily possible to see how well the image is covered. Another weakness of the properties system is probably that it's not that comfortable to use at the moment. I think a bunch of hotkeys wouldn't hurt to make it accessible more easily and more convenient to use. The properties system was basically just a small experiment to see whether something like that could work...I guess there are probably a few things we could tweak. For me personally both styles (graph node and properties) are fine. I think in the end it's not only important how the data is stored, but also how easy it is for people to contribute. As you are still by FAR the most active user, I don't want to implement something that kills your workflow. I've read so many articles over the years where developers implemented some cool sounding features which completely killed the service. Simply because they didn't listen to the needs of their users. |
Not sure if this is doable in spare moments..
Would it be possible to (I) get stats ie number of labels and number of annotations per label suggestion
(I can see the list in api.imagemonkey.io/v1/label/suggestions)
.. maybe this calculation already exists to figure out trending labels
(Ii) submit replacements to clean up the database (maybe in a JSON file, {“mistake”:”replacement”,...} ?
(Eg cleanup typing errors and simplify alternative suggestions for conventions. I speculate the “/“ hard seperators for combining will make life easier .. there’s a bunch of “Foo or bar” type workarounds that would make the labels harder to use)
I see the stats page list “20561 suggestions” .. I’d guess between mistakes and duplicates (spaces vs _ etc) that could be halved, and when extracting “/“ combinations maybe halved again
when making suggestions lately I’ve tried to stick to underscores and slashes eg “sports_car/luxury_car” makes the blend of 2 labels clearer than “sports car/luxury car” where a parser has to consider “(sports)(car/luxury)(car)” but there are older suggestions with spaces
The text was updated successfully, but these errors were encountered: