ms1vm3 dataset problem #79

nadongjin · 2022-01-30T10:33:35Z

nadongjin
Jan 30, 2022

Hello
There are many problems with ms1vm3 dataset.
There are more than a thousand datasets classified as different classes even though they are the same person as below.
it takes too much time to organize these datasets.
Of course, glint360k and web600k have the same problem.
The bigger problem is that there are many cases where different characters are included in the same class.
I'm currently organizing this.

In particular, Asian data is serious.
I think these problems are one of the reasons for the poor accuracy of Asians.

[sample]

leondgarse · 2022-01-30T15:42:31Z

leondgarse
Jan 30, 2022
Maintainer

You are checking this by hand? That could be a large mission. I must admit, I cannot tell if these 2 u exampled a same person or not, lol. A detail, by your folder name faces_emore, I think that's the MS1MV1 dataset, right? By MS1MV3, I think it should be ms1m-retinaface. As the label names are not randomly generated, I checked in my ms1m-retinaface-t1_112x112_folders:

Infightface has a strategy called sub-center-arcface, acts like a dataset cleaner, that mostly like training 3 sub-classes for a same label, but needs more resources. It performs better on my basic CASIA tests, but never took a further test.
Also Github cavalleria/cavaface once provided a so called cleaned ms1mv3 dataset, but works worse than original MS1MV3 in my multiple tests.
Anyway, it can be an improvement if we clean the dataset more deeply.

5 replies

nadongjin Jan 30, 2022
Author

I think I'm going crazy : )
As you said, it was MS1MV1 dataset.
Thank you once again.

nadongjin Jan 30, 2022
Author

I downloaded ms1m-retinaface-t1 and created training data.
Do I have to download the insightface's MS1M-ArcFace dataset to create the evaluate dataset?

I am puzzled because there is lfw.bin / cfp_fp.bin / agedb_30.bin in the downloaded ms1m-retinaface-t1 folder.

thanks..

leondgarse Jan 31, 2022
Maintainer

Don't have to, those bin files in ms1m-retinaface-t1 are same with ArcFace ones. Just process them by prepare_data.py -T xxx.bin.

nadongjin Feb 3, 2022
Author

ms1mv3 dataset was more disastrous than ms1mv1.
In fact, I think this is why companies with sota-accuracy have their own datasets.
As you say, it is really difficult to verify all the data by hand. ^^
[sample]

leondgarse Feb 7, 2022
Maintainer

Just end of our new year vacation. :)
Ya, you are right on these samples. I once have a cleaning test using trained model, that deleting samples in a same class with a small similarity, and merging classes with a large similarity. As the trained model will have accuracy > 99% on training dataset, may kind of ease this issue. What do you think of this idea?

nadongjin · 2022-02-07T01:23:24Z

nadongjin
Feb 7, 2022
Author

hello... As a result of the similarity evaluation of the ms1mv3 dataset, approximately more than 1000 classes overlap. I will have my subordinates do this, and I will send it to you when it is completed. I've received a lot of help from you, and I think it's worth it enough. I hope you achieve an engine of sota performance in the tensorflow framework. thanks. 2022년 2월 7일 (월) 오전 10:02, leondgarse ***@***.***>님이 작성:

…

Just end of our new year vacation. :) Ya, you are right on these samples. I once have a cleaning test using trained model, that deleting samples in a same class with a small similarity, and merging classes with a large similarity. As the trained model will have accuracy > 99% on training dataset, may kind of ease this issue. What do you think of this idea? — Reply to this email directly, view it on GitHub <#79 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADRFGOHG6PLKYFCVHDHAXLDUZ4K3PANCNFSM5NECQPYQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.*** com>

-- 유빈아빠

1 reply

leondgarse Feb 7, 2022
Maintainer

That would be great. I may finish my cleaning test, as a compare then.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ms1vm3 dataset problem #79

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

ms1vm3 dataset problem #79

nadongjin Jan 30, 2022

Replies: 2 comments · 6 replies

leondgarse Jan 30, 2022 Maintainer

nadongjin Jan 30, 2022 Author

nadongjin Jan 30, 2022 Author

leondgarse Jan 31, 2022 Maintainer

nadongjin Feb 3, 2022 Author

leondgarse Feb 7, 2022 Maintainer

nadongjin Feb 7, 2022 Author

leondgarse Feb 7, 2022 Maintainer

nadongjin
Jan 30, 2022

Replies: 2 comments 6 replies

leondgarse
Jan 30, 2022
Maintainer

nadongjin Jan 30, 2022
Author

nadongjin Jan 30, 2022
Author

leondgarse Jan 31, 2022
Maintainer

nadongjin Feb 3, 2022
Author

leondgarse Feb 7, 2022
Maintainer

nadongjin
Feb 7, 2022
Author

leondgarse Feb 7, 2022
Maintainer