SpamBayesIntegration

Fighting Spam with SpamBayes

On roundup instances where anyone can create an account, spam easily becomes a problem. This customization example shows one way to deal with this by integrating with "SpamBayes":http://spambayes.sf.net, a statistical anti-spam filter.

Requirements

You need access to a SpamBayes XMLRPC Server, version 1.1a4 or later. Install the SpamBayes server according to the documentation on http://spambayes.sf.net, and then run it, loading the XMLRPC module. http://mail.python.org/pipermail/tracker-discuss/2007-June/000930.html has some details (although on the core_server.py commandline example, you need to replace "-m" with "-P", making the commandline look like this::

   BAYESCUSTOMIZE=$SBDIR/bayescustomize.ini core_server.py -P XMLRPCPlugin

Theory of Operation

An auditor is added and fired upon 'set' and 'create' actions, on the 'file' and 'msg' classes. This auditor contacts the SpamBayes server via XMLRPC, submits the content of the new file or msg instance together with some extra tokens created from msg/file metadata, and gets a score back. This score is stored as a property ('spambayes_score') on the msg/file instance. Another property, 'spambayes_misclassified' is set to 'False' if the msg/file was successfully score (i.e., if there were no communication error or similar). Else, it's set to True to allow an administrator to search for msg/file instances that are not classified.

Roundup's security system is configured to disallow view of the 'content' and 'summary' properties of file and msg class instances for anonymous users (this is configurable, of course), to make sure that the roundup instance can't be used to boost search results for whatever uninteresting content the spammer tries to add. It is also configured to allow users with a special role (Coordinator, in my schema), to classify messages as spam or non-spam (ham) by pressing a button in roundup. This way, SpamBayes can be trained on your type of data.

Get the Code

Begin by checking out https://github.com/psf/bpo-tracker-cpython.git:

    git clone https://github.com/psf/bpo-tracker-cpython.git

This gives you two python files: detectors/spambayes.py and extensions/spambayes.py (they are attached as detectors_spambayes.py and extensions_spambayes.py). The former is the auditor which scores msg and file instances when they are created. The latter is an extension for doing the classification from the web interface.

Symlink these two files into your instance's detectors and extensions directory

    cd /home/of/my/tracker
    ln -s /path/to/spambayes_integration/detectors/spambayes.py detectors/spambayes.py
    ln -s /path/to/spambayes_integration/extensions/spambayes.py extensions/spambayes.py

Copy /path/to/spambayes/integration/detectors/config.ini.template (attached as config.ini.template) into detectors/config.ini, and adjust the uri to your spambayes server as well as the spam_cutoff value, if needed.

Modify Schema

The schema is modified, adding two properties to the 'file' and 'msg' classes respectively. If your schema is based on the classic template, here's your new 'file' and 'msg' definitions:

     msg = FileClass(db, "msg",
                     author=Link("user", do_journal='no'),
                     recipients=Multilink("user", do_journal='no'),
                     date=Date(),
                     summary=String(),
                     files=Multilink("file"),
                     messageid=String(),
                     inreplyto=String(),
                     spambayes_score=Number(),
                     spambayes_misclassified=Boolean(),)
     
     file = FileClass(db, "file",
                     name=String(),
                     spambayes_score=Number(),
                     spambayes_misclassified=Boolean(),)

Modify Templates

Now modify your html templates. You need to modify 'html/msg.item.html', 'html/file.item.html' and 'html/issue.item.html'. Diff for 'msg.item.html' from classic template:

    Index: msg.item.html
    ===================================================================
    --- msg.item.html   (revision 56578)
    +++ msg.item.html   (working copy)
    @@ -48,12 +48,45 @@
      <th i18n:translate="">Date</th>
      <td tal:content="context/date"></td>
     </tr>
    +
    + <tr>
    +  <th i18n:translate="">SpamBayes Score</th>
    +  <td tal:content="structure context/spambayes_score/plain"></td>
    + </tr>
    +
    + <tr>
    +  <th i18n:translate="">Marked as misclassified</th>
    +  <td tal:content="structure context/spambayes_misclassified/plain"></td>
    + </tr>
    +
     </table>
     
    +<p tal:condition="python:utils.sb_is_spam(context)" class="error-message">
    +   Message has been classified as spam</p>
    +
     <table class="messages">
      <tr><th colspan=2 class="header" i18n:translate="">Content</th></tr>
    +   <th class="header" tal:condition="python:request.user.hasPermission('SB: May Classify')">
    +     <form method="POST" onSubmit="return submit_once()"
    +       enctype="multipart/form-data"
    +       tal:attributes="action context/designator">
    + 
    +      <input type="hidden" name="@action" value="spambayes_classify">
    +      <input type="submit" name="trainspam" value="Mark as SPAM" i18n:attributes="value">
    +      <input type="submit" name="trainham" value="Mark as HAM (not SPAM)" i18n:attributes="value">
    +     </form>
    +   </th>
      <tr>
    -  <td class="content" colspan=2><pre tal:content="structure context/content/hyperlinked"></pre></td>
    +  <td class="content" colspan=2
    +   tal:condition="python:context.content.is_view_ok()"><pre
    +   tal:content="structure context/content/hyperlinked"></pre></td>
    +  <td class="content" colspan=2
    +      tal:condition="python:not context.content.is_view_ok()">
    +            Message has been classified as spam and is therefore not
    +      available to unathorized users. If you think this is
    +      incorrect, please login and report the message as being
    +      misclassified. 
    +  </td> 
      </tr>
     </table>

Diff for 'file.item.html' from classic template::

     Index: file.item.html
     ===================================================================
     --- file.item.html (revision 56578)
     +++ file.item.html (working copy)
     @@ -29,6 +29,16 @@
       </tr>
      
       <tr>
     +  <th i18n:translate="">SpamBayes Score</th>
     +  <td tal:content="structure context/spambayes_score/plain"></td>
     + </tr>
     +
     + <tr>
     +  <th i18n:translate="">Marked as misclassified</th>
     +  <td tal:content="structure context/spambayes_misclassified/plain"></td>
     + </tr>
     +
     + <tr>
        <td>
         &nbsp;
         <input type="hidden" name="@template" value="item">
     @@ -42,10 +52,30 @@
      </table>
      </form>
      
     -<a tal:condition="python:context.id and context.is_view_ok()"
     +<p tal:condition="python:utils.sb_is_spam(context)" class="error-message">
     +   File has been classified as spam.</p>
     +
     +<a tal:condition="python:context.id and context.content.is_view_ok()"
       tal:attributes="href string:file${context/id}/${context/name}"
       i18n:translate="">download</a>
      
     +<p tal:condition="python:context.id and not context.content.is_view_ok()">
     +   Files classified as spam are not available for download by
     +   unathorized users. If you think the file has been misclassified,
     +   please login and click on the button for reclassification.
     +</p>
     +
     +
     +     <form method="POST" onSubmit="return submit_once()"
     +       enctype="multipart/form-data"
     +       tal:attributes="action context/designator"
     +       tal:condition="python:request.user.hasPermission('SB: May Classify')">
     + 
     +      <input type="hidden" name="@action" value="spambayes_classify">
     +      <input type="submit" name="trainspam" value="Mark as SPAM" i18n:attributes="value">
     +      <input type="submit" name="trainham" value="Mark as HAM (not SPAM)" i18n:attributes="value">
     +     </form>
     +
      <tal:block tal:condition="context/id" tal:replace="structure context/history" />
      
      </td>

Diff for 'issue.item.html' from classic template::

     Index: issue.item.html
     ===================================================================
     --- issue.item.html        (revision 56578)
     +++ issue.item.html        (revision 56595)
     @@ -182,7 +182,12 @@
        </tr>
        <tr>
         <td colspan="4" class="content">
     -    <pre tal:content="structure msg/content/hyperlinked">content</pre>
     +    <p class="error-message"
     +       tal:condition="python:utils.sb_is_spam(msg)">
     +       Message has been classified as spam.
     +    </p>
     +    <pre tal:condition="python:msg.content.is_view_ok()"
     +         tal:content="structure msg/content/hyperlinked">content</pre>
         </td>
        </tr>
       </tal:block>

In summary, the 'item' pages for 'file' and 'msg' are modified not

to display the content if this is not allowed, instead displaying a message that the content has been classified as spam. There's also buttons for reclassifications, if the current user is permitted to do reclassification.

The 'item' page for 'issue' is modified the same way - not

displaying content from 'msg' instances marked as spam to users without permission to see the content.

Setup Permissions

Last but not least, we need to configure security. This is done in

'schema.py' as usual.

First, we add a new role, 'Coordinator'. Users with this role are allowed to reclassify messages, training SpamBayes. Then we create two new permissions, and assign one of them to the 'Coordinator' role:

     db.security.addRole(name='Coordinator', description='A coordinator')
     
     db.security.addPermission(name="SB: May Classify")
     db.security.addPermission(name="SB: May Report Misclassified")
     
     db.security.addPermissionToRole('Coordinator', 'SB: May Classify')

Then the security settings for the 'Anonymous role' are configured

as follows:

     for cl in 'file', 'msg':
         p = db.security.addPermission(name='View', klass=cl,
                                       description="allowed to see metadata of file object regardless of spam status",
                                       properties=('creation', 'activity',
                                                   'creator', 'actor',
                                                   'name', 'spambayes_score',
                                                   'spambayes_misclassified',
                                                   'author', 'recipients',
                                                   'date', 'files', 'messageid',
                                                   'inreplyto', 'type',
                                                   ))
     
         db.security.addPermissionToRole('Anonymous', p)
         
         spamcheck = db.security.addPermission(name='View', klass=cl,
                                               description="allowed to see metadata of file object regardless of spam status",
                                               properties=('content', 'summary'),
                                               check=may_view_spam(cl))
         
         db.security.addPermissionToRole('Anonymous', spamcheck)

An Example Instance

The python-dev meta tracker schema is based on the classic template

and has the integration described in this document already built in. Check out as follows:

   svn co http://svn.python.org/projects/tracker/instances/meta

Credits

Erik Forsberg (http://efod.se) wrote the original version of the

SpamBayes integration as well as this document.

Thanks to Skip Montanaro for answering SpamBayes questions.

CategorySchema CategoryDetectors

Roundup Tracker