Anti-spam

Fri, 8th September 2006

As seen on WPPlugins.org.

Ahhh… the “war on spam“… will it never end?

I think not, but here’s my contribution to the battle. I have a theory that what makes a spam message is repetitiveness. My plugin calculates how repetitive a message is and marks it as ’spam’ if the repetitiveness is too high. It cannot and will not catch all spam, but it covers those messages that contain long lists of links.

There are other spam-combating plugins, for instance Bad Behavior, Spam Karma 2 and Akismet. I don’t know if they work as I have never used them, nor is my plugin intended to replace any of them, but on examination of their code I found that none of them (perhaps excepting Akismet, which is a web service) attempt a compressibility analysis of the text. So, I started with the Akismet source code, denatured it, added my own analysis code and created the “Anti-spam” plugin. It should play nice with the others, but I have not tested that, nor do I intend to. I do hope the other, better anti-spam plugin authors can adapt, adopt or improve upon my compression theory.

With thanks to Matt Mullenweg, I present for download Libertus’ Anti-spam (Compression) Plugin for WordPress. Enjoy, and please let me know how you get on. It works exactly like Akismet, but doesn’t sound like Akismet, and it doesn’t use the Akismet web service.

12 Responses

  1. Evolution

    Several months, several spam attacks and three plugin revisions have passed.

    In α3 I made the compression threshold for spam configurable, defaulting at 5. My personal setting started at 5 and slowly dropped to 2 until lately even that was being evaded by 209.160.65.92. I also improved the statistical display.

    In α4 the compression analysis was improved to push my American friend over the 2 threshold. Meanwhile, a rare sincere comment was received, that passed through the moderation queue nicely. I also added a means to analyse individial comments on demand.

    As of now, on this blog, the comment spammer tripper counter is 6,112 and the spam filter has caught 220 comments but let through 75.

    Reply
  2. Survival Pressures

    Slight mistake in α4 - I missed out underscore from the list of spammy punctuation. Not that it matters much, as the latest spam came in at 1.7 and 1.8 anyway. It is tempting to pursue perfection but what I now have is good enough. I don’t mind block-moderating once in a while.

    It cannot and will not catch all spam…

    Latest Activity: The comment spammer tripper count is 6,763 and the spam filter has caught 368 spam comments but let through 82.

    I don’t read the wp-hackers mailing list these days, so I missed this thread.

    Reply
  3. Playing

    Latest Activity: The comment spammer tripper count is 7,291 and the spam filter has caught 633 spam comments.

    I played around making a compression plugin for Spam Karma 2, which was easy to get working, but is ultimately not something I will use, because I agree with Matt Mullenweg, who said:

    I don’t like tests that sit on the fence. Either call it or don’t.

    Writing the SK2 plugin forced me to decide on a weighting for my compression algorithm result, which is illogical, so I had to make a guess, which is bad. I’m a rotten guesser.

    Reply
  4. Analysing

    I discovered that MySQL has a COMPRESS function, which made me think, and all I can say is “Hmmm…”! What would you say if you took:

    1. This view of your WordPress comment database
      CREATE VIEW compressed
      SELECT
        comment_ID AS ID,
        comment_post_ID AS post_ID,
        comment_author AS author,
        user_id,
        comment_approved AS `status`,
        (
        (2 + LENGTH(comment_content) + LENGTH(comment_author) + LENGTH(comment_author_url))
         /
        LENGTH(COMPRESS(CONCAT(comment_content,’ ‘,comment_author,’ ‘,comment_author_url)))
        ) AS ratio
      FROM wp_comments;
      
    2. this query
      SELECT
       `status`,
       COUNT(*) AS ct,
       ROUND(AVG(ratio),1) AS av,
       ROUND(MIN(ratio),1) AS mn,
       ROUND(MAX(ratio),1) AS mx,
       ROUND(STD(ratio),1) AS sd
      FROM compressed
      GROUP BY `status`
      
    3. and permission from the database administator?

    To avoid permission issues, I ran the query against an old local snapshot of the blog database and I got these results.

    Results
    status ct av mn mx sd
    1 1389 1.5 0.4 8.6 0.5
    spam 183 3.3 1.2 7.9 1

    Hmmm…

    Reply
  5. Intelligent Adaptation

    I’m always happy to take advantage of changes in the environment and hitherto unknown environmental features. Using MySQL 5 affords me views, and the COMPRESS function was sitting there all along. Combining the two enables an irresistible global statistical analysis of the compressibility of the comment database, to replace the broken statistics I have in the plugin. It is α code, after all.

    An irresistible evolution must still be treated with care and professionalism. Will the result be an improvement? If so, will the benefit outweight all the costs?

    My primary concern is the heaviness of the analytical query. One immediate optimisation is that the results to any global analysis can only change with the comments database, so I shall cache the results against the DEFAULT(comment_ID) and present an “Update available” button on a cache miss. The user may then choose to update the statistics, which would probably be a good time to prune old spam and OPTIMIZE TABLE wp_comments, rather than at random.

    My secondary concern is that of accuracy. In my world, registered users cannot post spam and may repeat themselves as much as they wish, so the results for non-spam may not include comments by logged-in users, and a logged-in user’s comment marked as spam is spam.

    Finally, should I link the operation of the spam filter to the results of the statistical analysis, or leave the plugin as-is and responsive solely to the threshold set by the user?

    Hmmm…

    Reply
  6. Spam I Am. What Am I?

    I don’t want to store spam. Why should I? The only thing useful about spam is the knowledge that it is spam. It is of statistical value only. Therefore, I want to store some spam in the database for statistical purposes, but not all, especially not those messages which are copies of previously held spam.

    Spam about particular topics seems to come in waves. There are bursts of similar messages, not always from the same source, lasting usually short periods, sometimes hours, sometimes days. This makes perfect sense. Spam is a broadcast advertising method, and quite expensive, so it should naturally follow the pattern of marketing campaigns in other expensive media: maximal impact in minimal time.

    I need some statistical evidence for this proposition, as any plan that involves refusing to accept a comment based on a priori knowledge should be designed to accomodate the possibility of false positives. My “comment spammer tripper” already does so, even though it is astonishingly unlikely that any sincere commenter would ever trip over it accidentally. Shoot the message, not the messenger.

    Spam logically uses every particle of space available to promote a message, which in practice means the author name, URL and content fields. The author name is usually the product being promoted, the URL has to be valid to be of any impact, and both are usually made public. E-mail addresses are provided with spam as a necessity for comment acceptance, but since they are rarely if ever publicised have limited promotional impact and are usually valid but bogus.

    Some queries on my old snapshot, for sanity’s sake.

    How many spam comments have e-mail addresses, and how many have not?

    SELECT
     IF(LENGTH(comment_author_email),'has','not') AS grp,
     COUNT(*) AS ct
    FROM wp_comments
    WHERE comment_approved='spam'
    GROUP BY grp
    grp ct
    has 86

    How many spam comments from each distinct e-mail address?

    SELECT
     comment_author_email AS adr,
     COUNT(*) AS ct
    FROM wp_comments
    WHERE comment_approved = 'spam'
    GROUP BY adr
    ORDER BY ct DESC
    adr ct
    viagra@funchain.com 79
    spamming@spamming.spam 2
    fghjfhg@djhgkd.net 1
    gfhjfgh@fkjhldfghkldf.com 1
    jsmith@hotmail.com 1
    libertus@libertini.net 1
    tyhurty@dfkljghn.com 1

    Hmmm… not enough variety…

    Any relationship between the e-mail addresses and comment authors?

    SELECT
     comment_author_email AS adr,
     comment_author AS auth,
     COUNT(*) AS ct
    FROM wp_comments
    WHERE comment_approved = 'spam'
    GROUP BY adr, auth
    ORDER BY ct DESC
    adr auth ct
    viagra@funchain.com cheap levitra 5
    viagra@funchain.com buy viagra 4
    viagra@funchain.com generic viagra 4
    viagra@funchain.com discount viagra 4
    viagra@funchain.com purchase viagra 4
    viagra@funchain.com viagra uk 4
    viagra@funchain.com generic cialis 4
    viagra@funchain.com order cialis 4
    viagra@funchain.com cheap cialis 4
    viagra@funchain.com purchase cialis 4
    spamming@spamming.spam Spamming Spammer 2
    etc. etc.

    Hmmm… I suspect my dataset is perfectly suited to proving my point, rather than my point being proved. I could do with more spam from the archives.

    How many would have evaded the compression filter?

    SELECT
     comment_author_email AS adr,
     comment_author AS auth,
     COUNT(*) AS ct
    FROM wp_comments INNER JOIN compressed ON ID = comment_ID
    WHERE ratio < 2.3 AND comment_approved = 'spam'
    GROUP BY adr, auth
    ORDER BY ct DESC
    adr auth ct
    spamming@spamming.spam Spamming Spammer 2
    jsmith@hotmail.com girls sex 1
    viagra@funchain.com order generic cialis 1

    Being conservative on the compression factor reveals that at least one of the messages in the long burst from the same e-mail address might have got through. An analysis of that message against the other spam might have caught it, so long as that message was not the vanguard.

    Hmm… still not convinced there’s anything I should do. I’ll let more spam accumulate, analyse and think some more, and await inspiration.

    Reply
  7. Unintentional Humour

    My own spam filter caught my previous comment, which is x3.1 compressible. ROFL!

    Didn’t I say somewhere that logged-in users cannot post spam? It seems that my plugin does not uphold my own principles. That’s what I deservedly get for being lazy and using other people’s code.

    Reply
  8. Advanced Statistics

    Continuing with the idea of the database being able to perform the compression analysis, I have been keeping some advanced spam statistics, just for fun.

    The comment spammer tripper count is 8,476 and the spam filter has caught 850 comments. but 84 got through. Spam compressibility is 1.5≥4.9(±2.3)≥8.1 over 207 samples.

    Reply
  9. 1,000 Spams And Counting

    Congratulations to “mobile home insurance | noonehere@yahoo.com | http://www.planetnana.co.il/in99/67.html | IP: 62.150.40.142″ for being the 1,000th spam to be caught by my devilishly simple plugin. So far, 84 messages have evaded the filter and there has been one false positive - my own.

    The comment spammer tripper count is 9,082.

    Reply
  10. Now What Exactly Is The Point Of This?

    Name: Anonymous | E-mail: jsmith@hotmail.com | IP: 85.21.96.239 | Date: Tue, 16th January 2007

    <a href=”"></a>
    [URL=][/URL]

    The comment spammer tripper count is 9,887 and the spam filter has caught 1,224 comments but 85 got through. Spam compressibility is 1.5≥5.5(±2.2)≥9.5 over 404 samples.

    Reply
  11. Desite My Ignoring It, The Spam Still Comes

    The comment spammer tripper count is 36,832 and the spam filter has caught 2,974 comments but 140 got through. Spam compressibility is 1.1≥3.3(±0.7)≥4.0 over 426 samples.

    Reply
  12. This Is What It Comes Down To. So Sad.

    A compression filter makes spammers less expressive.

    Name: author | E-mail: 76c243cd@yahoo.com | URI: http://77ec52bc.com | IP: 222.151.211.93 | Date: Mon, 16th July 2007

    9aa51c3c http://2b4940a5.com <a href=’http://cf754c43.com’>bb7a9a7c</a> [url]http://23c316f2.com[/url] [url=http://9981a361.com]76144369[/url]

    The comment spammer tripper count is 48,610 and the spam filter has caught 4,806 spam comments but 146 got through (including the above, unsurprisingly). Spam compressibility is 1.0≥3.7(±0.5)≥10.6 over 514 samples.

    Reply

Leave a Reply

You may also log in to post a comment.

XHTML:

If you want to <q>tag</q>, please balance these; a, i, em, b, strong, u, blockquote, q, ul, li, ol, abbr, code, pre, sub and sup.