/activityspam

Bayesian spam filter for activitystrea.ms data

Primary LanguageJavaScriptApache License 2.0Apache-2.0

This is an experimental server for filtering Activity Streams (http://activitystrea.ms/) data for spam.

Apache 2.0 license.

More or less copying "a plan for spam" filtering, but make pseudo-tokens for activity streams fields. So something like:

     { id: "urn:uuid:7e4ed55a-2b99-48c8-a274-42819b2ddd39",
       url: "http://example.net/status/35",
       published: "2011-09-23T10:49:00Z",
       actor: { displayName: "John Smith",
       	      	id: "urn:uuid:bff0ecdd-a944-4d92-aed3-d6af8f13d610",
		url: "http://example.net/status/johnsmith" },
       verb: "post",
       object: { id: "urn:uuid:81e43564-c66f-40c5-878b-733275229521",
       	       	 type: "note",
		 content: "<a href='http://example.com/viagra-spam'>Buy Viagra Now!</a>" } }

Would tokenize as:

      id=urn:uuid:7e4ed55a-2b99-48c8-a274-42819b2ddd39
      url=http://example.net/status/35
      published=2011-09-23T10:49:00Z
      actor.displayName=John-Smith
      actor.id=urn:uuid:bff0ecdd-a944-4d92-aed3-d6af8f13d610
      actor.url=http://example.net/status/johnsmith
      verb=post
      object.id=urn:uuid:81e43564-c66f-40c5-878b-733275229521
      object.type=note
      a
      href
      http://example.com/viagra-spam
      Buy
      Viagra
      Now
      a

There may be some value in grabbing the domains of URLs (example.com and example.net here).