Reputation based anti-spam
- By: Qwaider
- On:Tuesday, November 25, 2008 8:23:22 PM
- In:Science & Technology
- Viewed: (5195) times
- Currently 4.6/5 Stars.
- 1
- 2
- 3
- 4
- 5
Rated 4.6/5 stars (84 votes cast)
As a new addition to the anti-spamming system that I'm using exclusively on this blog. I decided to incorporate a reputation based system that looks deeper into the reputation of the person submitting a comment before deciding whether or not that comment is more "likely" to be good or bad, then assigns a score for it.
The idea is really simple, it's based on fending off attacks by depending on social engineering techniques, here's how it works:
The user submits a comment.
The system evaluates the comment and IP address and decides whether it's spam or not. (this is old), a minor change was added to "score" each stage
The system looks at the identity of the person who commented. The identity is a union of the user's name, email, website, IP address and the history of names, and IP addresses. THIS is where it gets interesting
By looking back at the history of an IP, weighing the good vs the bad, we can know if the IP is more likely to be generating SPAM, that's part one.
Update:
For a limited time, you can see your comment SpamScore when you comment here. Red=Bad, Green=Good. But Unfortunately, I don't have these values for all the comments around here. So apologies if you don't see it
Now by looking at the Name, email, Website, and IP in a historical fashion. We get to know if this "identity" and deal with it as such. Or more specifically, by looking at the number of valid comments with those parameters, we can add a probability score of a specific comment being spam or not.
Of course the greater the number of valid comments the better the overall results for that specific identity. However, this will need a user to have previously commented. But, I guess that goes without saying since we started this whole article with the word, "reputation". In other words, history.
The question that arises here is, what if the spammers fake specific previously known identities? wouldn't that side track this whole mechanism, even worse, use it to get better spamscore than before using it?
Very good and valid question. In fact, this is one side that I considered. The elements I considered to create a distinguished Identity is Email. Email, is never communicated or disclosed. Although, it's not a big secret, but it's STILL too much work for a spammer to do (figure out the name, email AND website) a specific user is using to comment. But it's possible and therefore EVEN with well known identities other measures will provide additional spam values that will outweigh such an attack
So surface of attack is: New users with little or no history. Attacker who knows the combination of Name, Email and Website of users with good scores (like say, Hani Obaid on this blog).
The hard part to figure out, is how to translate spam-score to an actionable item
Thoughts? issues? concerns? Criticism? let me know what you think. Oh and ask me about your scores if you're interested :)
Memories....
(thanks for testing :))
Besides, the system is still in early beta :)
This is a very interesting field. You might wanna check out " collaborative filtering" or " recommender algorithms"
or the forums of www.netflixprize.com if you have time you could win a million dollars !
let's see ......
I think using the IP address along with the name/email/website info is a great idea because it will protect bloggers from each other. While it is simple for bloggers to impersonate other bloggers (we know each other's info from our blogs' comments), it is considerably harder to spoof the *correct* IP address without easy detection.
I left 5 comments (Uno -> Cinco). Two of them got caught for moderation. However, clearing the site cookies between comments fixed the problem easily. I also noticed that all the scores that i got were pretty similar. Which i think is a good sign that your spam checker is working correctly.
Additionally, i got a good score when i commented as Summer (Cinco), and i think that is attributed to the change of IP address.
So all in all, it seems that the IP address info trumps everything else in your system. I think this is a great thing. However, i wonder what would happen to legitimate people who use a compromised IP address.
Great job on this system!
I already made sure that my registered users don't face such issue (no one can comment with the name Maioush for example on this blog) Even if they used the right user/email/website combination.
However, Summer (in particular) opted out of this (she used to be protected with this system, but didn't like logging in)
Anyway, I have the system in "observation" mode right now. It gives me recommendations ONLY without taking any action because. Frankly, I don't know how to score things yet. I mean, What does it mean? How can I give positive marks for certain things and negative for others. Then have the stuff cancel out for good users in bad networks :)
What you didn't know is that for the single score that you're seeing, there are 4 others, and additional details that I'm capturing. One for the combination: IP, NAME, EMAIL, website, Valid comments. One for combination: Name, Email, Website, another for: Name, Email, Website valid comments, and total comments. These all participate in the spam score, but there are almost 28 other factors at play in the spam score :(
Thanks for testing Za3tar, I really appreciate it.
I have to admit, i am getting sick of the Captchas and will be counting for the day i can remove it with my mind put on ease.. and you are demonstrating that that day is drawing ever more closer.
What people don't seem to realize is that some of the world most brilliant minds are working to break these rudimentary methods of false security and I have no doubt in my mind that they will succeed. For god's sake we landed a man on the moon with processing power that wouldn't power a modern entry level cellphone.
Thanks again Za3tar, you always help me keep my brain cells busy, like today I saw a greater need to implement my new Volume based anti-spamming mechanism which I'm only thinking of. It goes something like this. If a commentor sends few comments in a row, this might mean that he's a type of a commentor that is probing the system. Therefore, there needs to be an upper limit for the number of comments from a [user, ip] tuple, ip, and user. Throttled to within a specific period of time.