Anti-spam
Fri, 8th September 2006
As seen on WPPlugins.org.
Ahhh… the “war on spam“… will it never end?
I think not, but here’s my contribution to the battle. I have a theory that what makes a spam message is repetitiveness. My plugin calculates how repetitive a message is and marks it as ’spam’ if the repetitiveness is too high. It cannot and will not catch all spam, but it covers those messages that contain long lists of links.
There are other spam-combating plugins, for instance Bad Behavior, Spam Karma 2 and Akismet. I don’t know if they work as I have never used them, nor is my plugin intended to replace any of them, but on examination of their code I found that none of them (perhaps excepting Akismet, which is a web service) attempt a compressibility analysis of the text. So, I started with the Akismet source code, denatured it, added my own analysis code and created the “Anti-spam” plugin. It should play nice with the others, but I have not tested that, nor do I intend to. I do hope the other, better anti-spam plugin authors can adapt, adopt or improve upon my compression theory.
With thanks to Matt Mullenweg, I present for download Libertus’ Anti-spam (Compression) Plugin for WordPress. Enjoy, and please let me know how you get on. It works exactly like Akismet, but doesn’t sound like Akismet, and it doesn’t use the Akismet web service.
Libertus said: December 11th, 2006 at 21:26
Evolution
Several months, several spam attacks and three plugin revisions have passed.
In α3 I made the compression threshold for spam configurable, defaulting at 5. My personal setting started at 5 and slowly dropped to 2 until lately even that was being evaded by 209.160.65.92. I also improved the statistical display.
In α4 the compression analysis was improved to push my American friend over the 2 threshold. Meanwhile, a rare sincere comment was received, that passed through the moderation queue nicely. I also added a means to analyse individial comments on demand.
As of now, on this blog, the comment spammer tripper counter is 6,112 and the spam filter has caught 220 comments but let through 75.
ReplyLibertus said: December 16th, 2006 at 09:42
Survival Pressures
Slight mistake in α4 - I missed out underscore from the list of spammy punctuation. Not that it matters much, as the latest spam came in at 1.7 and 1.8 anyway. It is tempting to pursue perfection but what I now have is good enough. I don’t mind block-moderating once in a while.
Latest Activity: The comment spammer tripper count is 6,763 and the spam filter has caught 368 spam comments but let through 82.
I don’t read the wp-hackers mailing list these days, so I missed this thread.
ReplyLibertus said: December 22nd, 2006 at 13:44
Playing
Latest Activity: The comment spammer tripper count is 7,291 and the spam filter has caught 633 spam comments.
I played around making a compression plugin forSpam Karma 2 , which was easy to get working, but is ultimately not something I will use, because I agree with Matt Mullenweg, who said:
Writing the SK2 plugin forced me to decide on a weighting for my compression algorithm result, which is illogical, so I had to make a guess, which is bad. I’m a rotten guesser.
ReplyLibertus said: December 22nd, 2006 at 16:29
Analysing
I discovered that MySQL has a
COMPRESSfunction, which made me think, and all I can say is “Hmmm…”! What would you say if you took:To avoid permission issues, I ran the query against an old local snapshot of the blog database and I got these results.
Hmmm…
ReplyLibertus said: December 22nd, 2006 at 17:28
Intelligent Adaptation
I’m always happy to take advantage of changes in the environment and hitherto unknown environmental features. UsingMySQL 5 affords me views, and the
COMPRESSfunction was sitting there all along. Combining the two enables an irresistible global statistical analysis of the compressibility of the comment database, to replace the broken statistics I have in the plugin. It is α code, after all.An irresistible evolution must still be treated with care and professionalism. Will the result be an improvement? If so, will the benefit outweight all the costs?
My primary concern is the heaviness of the analytical query. One immediate optimisation is that the results to any global analysis can only change with the comments database, so I shall cache the results against the
DEFAULT(comment_ID)and present an “Update available” button on a cache miss. The user may then choose to update the statistics, which would probably be a good time to prune old spam andOPTIMIZE TABLE wp_comments, rather than at random.My secondary concern is that of accuracy. In my world, registered users cannot post spam and may repeat themselves as much as they wish, so the results for non-spam may not include comments by logged-in users, and a logged-in user’s comment marked as spam is spam.
Finally, should I link the operation of the spam filter to the results of the statistical analysis, or leave the plugin as-is and responsive solely to the threshold set by the user?
Hmmm…
ReplyLibertus said: December 23rd, 2006 at 12:59
Spam I Am. What Am I?
I don’t want to store spam. Why should I? The only thing useful about spam is the knowledge that it is spam. It is of statistical value only. Therefore, I want to store some spam in the database for statistical purposes, but not all, especially not those messages which are copies of previously held spam.
Spam about particular topics seems to come in waves. There are bursts of similar messages, not always from the same source, lasting usually short periods, sometimes hours, sometimes days. This makes perfect sense. Spam is a broadcast advertising method, and quite expensive, so it should naturally follow the pattern of marketing campaigns in other expensive media: maximal impact in minimal time.
I need some statistical evidence for this proposition, as any plan that involves refusing to accept a comment based on a priori knowledge should be designed to accomodate the possibility of false positives. My “comment spammer tripper” already does so, even though it is astonishingly unlikely that any sincere commenter would ever trip over it accidentally. Shoot the message, not the messenger.
Spam logically uses every particle of space available to promote a message, which in practice means the author name, URL and content fields. The author name is usually the product being promoted, the URL has to be valid to be of any impact, and both are usually made public. E-mail addresses are provided with spam as a necessity for comment acceptance, but since they are rarely if ever publicised have limited promotional impact and are usually valid but bogus.
Some queries on my old snapshot, for sanity’s sake.
How many spam comments have e-mail addresses, and how many have not?
How many spam comments from each distinct e-mail address?
Hmmm… not enough variety…
Any relationship between the e-mail addresses and comment authors?
Hmmm… I suspect my dataset is perfectly suited to proving my point, rather than my point being proved. I could do with more spam from the archives.
How many would have evaded the compression filter?
Being conservative on the compression factor reveals that at least one of the messages in the long burst from the same e-mail address might have got through. An analysis of that message against the other spam might have caught it, so long as that message was not thevanguard .
Hmm… still not convinced there’s anything I should do. I’ll let more spam accumulate, analyse and think some more, and await inspiration.
ReplyLibertus said: December 23rd, 2006 at 13:02
Unintentional Humour
My own spam filter caught my previous comment, which is x3.1 compressible. ROFL!
Didn’t I say somewhere that logged-in users cannot post spam? It seems that my plugin does not uphold my own principles. That’s what I deservedly get for being lazy and using other people’s code.
ReplyLibertus said: January 3rd, 2007 at 19:39
Advanced Statistics
Continuing with the idea of the database being able to perform the compression analysis, I have been keeping some advanced spam statistics, just for fun.
The comment spammer tripper count is 8,476 and the spam filter has caught 850 comments. but 84 got through. Spam compressibility is 1.5≥4.9(±2.3)≥8.1 over 207 samples.
ReplyLibertus said: January 8th, 2007 at 16:59
1,000 Spams And Counting
Congratulations to “mobile home insurance | noonehere@yahoo.com | http://www.planetnana.co.il/in99/67.html | IP: 62.150.40.142″ for being the 1,000th spam to be caught by my devilishly simple plugin. So far, 84 messages have evaded the filter and there has been one false positive - my own.
The comment spammer tripper count is 9,082.
ReplyLibertus said: January 16th, 2007 at 14:44
Now What Exactly Is The Point Of This?
The comment spammer tripper count is 9,887 and the spam filter has caught 1,224 comments but 85 got through. Spam compressibility is 1.5≥5.5(±2.2)≥9.5 over 404 samples.
ReplyLibertus said: May 24th, 2007 at 00:20
Desite My Ignoring It, The Spam Still Comes
The comment spammer tripper count is 36,832 and the spam filter has caught 2,974 comments but 140 got through. Spam compressibility is 1.1≥3.3(±0.7)≥4.0 over 426 samples.
ReplyLibertus said: July 16th, 2007 at 18:12
This Is What It Comes Down To. So Sad.
A compression filter makes spammers less expressive.
The comment spammer tripper count is 48,610 and the spam filter has caught 4,806 spam comments but 146 got through (including the above, unsurprisingly). Spam compressibility is 1.0≥3.7(±0.5)≥10.6 over 514 samples.
Reply