Skip to content
This repository has been archived by the owner on Feb 8, 2018. It is now read-only.

roll our own automatic fraud prevention #360

Closed
chadwhitacre opened this issue Nov 5, 2012 · 9 comments
Closed

roll our own automatic fraud prevention #360

chadwhitacre opened this issue Nov 5, 2012 · 9 comments

Comments

@chadwhitacre
Copy link
Contributor

ht @exratione on their blog via hn

Note on that HN thread the suggestion, that by hashing the IP address we can forestall privacy concerns (#345).

@chadwhitacre
Copy link
Contributor Author

Counter-point re: IP hashing:

The only problem here is, IP addresses are such a small space (4 billion addresses) that it's so easy to brute-force the entire database that I don't see it offering any protection. If the data is stolen it will be cracked in no time, and if the data is subpoenaed that cost will likely be ruled as insufficiently "onerous". Even IPv6 doesn't save you, since the space is sparsely populated.

No, with IP logging it's all-or-nothing. You might as well store them as uint32/uint128.

@chadwhitacre
Copy link
Contributor Author

From @sigmavirus24 on #354:

What about creating some kind of metric as to how suspicious someone is? e.g.,

delpan has 0 repositories, has 0 events, has 0 forks, has 0 followers, follows 0 people and created his account on Gittip almost immediately after creating the GitHub account, therefore he is 99% (or 100%, whatever) suspicious.

However, @kennethreitz, has a crazy amount of repositories, events, forks, followers and followees and has had his GitHub account for probably 2 years before making a Gittip account. He's 1% suspicious. And set a threshold, however high or low for is_suspicious to be True or False. This can float back and forth and update (ideally) long before payouts, say the night before. Most of this information can be found via the user endpoint on the GitHub API. Then again, instead of crafting these requests yourself, you could use one of the awesome python wrappers for the GitHub API. @copitux and I both have wrappers, it just depends on which style you prefer. (py-github3 is @copitux's iirc and mine is github3.py)

@sigmavirus24
Copy link
Contributor

I assume you meant here instead of #60 ;).

But yeah based off of @dangerdave's blog post that you posted elsewhere, I would say just store the IP addresses if you're so inclined. I'm not a fan in general, but I use other services that store data I object to them keeping.

And like I said, it's really a simple API call. You might want to upgrade the version of requests you're using so that you can just use the .json attribute on the Response object though (also github3.py requires 0.14.1, not sure about py-github3). It would be a fairly simple call to the API.

Like I said accessing the data is pretty simple, it would be a simple GET /username. The information I list above isn't all one request though, which could be a problem/bottleneck. So from /username we could pull the number of public repositories (although they're giving you OAuth access so maybe you could also find out the number of private repos they have, not sure off the top of my head if this is plausible). You would also have their number of people they follow and the number of people following them, e.g., /kennethreitz would have the values 60, 200, 2286 respectively.

For their number of forks, you would have to unfortunately iterate over /users/username/repos which can be several pages long (exactly 2 in Kenneth Reitz's case). This would tell you if it's a fork or not. If you wanted to use the ratio of forks to non-forks, you might also want to look at the created_at information. Forks will tell you the date they were forked and non-forks the day the person made the repo on the site. So this request (depending on the user) would use up quite a few requests alone (# of public repos / 30).

And finally events would only need one request (or as many as you'd like). You could gauge dates and whatnot. Then again if you see everything is from one day you can't draw any conclusions from that. Why? Because you have some uber active users who's entire first 30+ events were done that day. Events include issue comments, code comments (review comments), issue comments, repo creation, following a user, starring a repository, etc. You already have the repo creation dates from iterating over the repositories they have so that wouldn't be of interest. The others don't imply anything other than they know how to use an API to create random comments on each other's repositories. I guess just checking for the existence of an event would be ok. This is easy to game though and should have a very low weight.

This on it's own though is (as I explain more of how it would work) insufficient.

If they had a large enough number of accounts here, they could easily follow each other which will artificially boost their followers/following number. They could create their own repos and fork each other's but then you'd see that that was all done on the same day so that would take more work/time/planning on their part.

So this could certainly be used in conjunction with something else, but I'm not sure what. Another thing that would be easy to game (but only vein humans do) is set their avatar_url and their real name. Again, this would not work especially since I only recently added my real name to my GitHub account and there may exist users who were just too lazy to do so.

@sigmavirus24
Copy link
Contributor

@shawndavenport for any accounts that are detected/suspected to be fraudulent, how would @whit537 go about reporting them to you guys? Perhaps automate an email to [email protected]?

@lyndsysimon
Copy link
Contributor

We could talk to the guys at Work for Pie on how to traverse and score the Github API, or perhaps even integrate their score (or a portion thereof).

From a bigger picture though, Gittip isn't limited to developers. No matter how fancy we get in processing info from Github, we're not going to be able to use it to score someone who is, for instance, primarily a political activist/journalist.

@sigmavirus24
Copy link
Contributor

Their scoring takes a short while and mainly goes over the repos to use stargazers and forks as a means of scoring your code skills.

And yeah, I never claimed this would be good for all GitTippers, only those with GitHub accounts. Considering you can register via Twitter, this would fail for those users. And since it seems like there is going to be a modicum of human review, activists and journalists would be caught. My scoring is just to give @whit537 a way to tackle the seemingly most/least suspicious users first, instead of having just going user to user.

@shawndavenport
Copy link

@sigmavirus24, hopefully the volume will be low, so for the time bring feel free to email me directly.

@sigmavirus24
Copy link
Contributor

Awesome. Thanks @shawndavenport

@chadwhitacre
Copy link
Contributor Author

Closing as stale.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants