Oct 20, 2006

Bayesian multi-category classification in Gmail?

What this would do is something like what POPFile does for email clients like Thunderbird or Outlook or whatever: automatically categorise incoming emails based on keywords they contain. POPFile is trainable and it's supposed to reach a pretty high accuracy after a couple of weeks. POPFile doesn't actually put your emails into different folders in your email program; it just marks them as belonging to one category or other (say `Work', `Family', `Junk', `Stamp collecting', etc.). Then you set up your email program so that it puts these emails into different folders, or deletes them, or forwards them, or whatever.

What POPFile does

OK first of all, you're asking why would you want some software like POPFile to classify emails for you when you can just set up filters in your email program to do that based on who they're from, what the subject is, and so on? The reason is your email client's filters are static: they do not learn about new family members who are sending you email, nor about new correspondents from your workplace, nor about new junk mailers you have to deal with all the time. In fact, programs like Thunderbird already have Bayesian filtering to deal with this problem of continually-changing junk mails -- it's just that POPFile goes one step further, to try to identify your mails as belonging to arbitrary categories that you set up.

The idea is, once you've set up POPFile to recognise email from your family, from your work, and from your stamp collecting buddies, it will correctly identify these different types of emails, say, about 99.99% of the time. The rest of the time, which is presumably a piffling amount of time, you'll be telling POPFile something like `no, this isn't junk, it's just my little brother, mark it as ``Family'' '. And POPFile will continue to learn, using the Bayesian statistical analysis.

So the end result is, you can set up your email programs to put email marked `Work' in the right folder, and so on, without having to worry about updating your filters all the time.

Now here's my question: why should email program users get these benefits exclusively? Why can't we have something like this for webmail users? Specifically, Gmail users (like me, and it seems half the world nowadays)? Maybe we can. It boils down to three things: Gmail's JavaScript functions, POPFile's statistical categorisation methods, and Firefox's Greasemonkey extension.

What Greasemonkey does

Basically, Greasemonkey allows you to customise Web pages in Firefox in almost unlimited ways with a little JavaScript programming, using that page's Document Object Model and any JavaScript functions defined in it. Check out http://persistent.info/archives/2005/03/01/gmail-searches for an idea about just how powerful Greasemonkey is, and what it can do to Gmail.

I've tried out the above hack, and it actually does work, with a few hiccups. Furthermore, I've tried programming Greasemonkey scripts myself and I can tell you it's a really powerful way of customising websites which you love to make them even more useful. There are actually a ton of scripts people have written out there, and the best place to get them is userscripts.org. Check it out.

POPFile for Gmail?

OK, now we know we can extend Gmail's functionality in amazing ways with Greasemonkey hacks like the one above. Essentially, Greasemonkey is giving us the means to program a user interface for the new Bayesian classification classification features we want in Gmail. Greasemonkey scripts are written in JavaScript. Now I'm pretty sure the `business logic' of POPFile, which is currently written in Perl, can be ported to JavaScript without too much trouble. The end result: an interface in Gmail that tags incoming messages and quickly allows you to check for and correct mistakes, training it, and bringing the convenience of automatic Bayesian classification to Gmail. Anybody up for it?

No comments: