Apr 2018

Blog Text Classifier

Hello again faithful GeeryDev readers. If you frequent GeeryDev, then it's likely that you also frequent Throwing It Back Weekly (TIBW for the remainder of this post). Some might even say that there is a little, friendly community rivalry here. But I will leave that for another time.
Today, I would like to set the record straight, who said what on these blogs. I have been working on a simple Naive Bayes Text Classifier, to understand whether a text query is more likely to be of GeeryDev or TIBW origin. Before I get into the specifics, or if you'd rather ignore the specifics, have some fun with it. You may come to make some realizations that we will explain later. What good is this classification? Why when searching random words does it seem to just default to GeeryDev? This isn't good enough, how can we make it better? Don't worry, I'll give you my thoughts here at the end of this post.

Bernoulli Naive Bayes Text Classifier

So, what does this mean? Well, if we take this one word at a time, it's really straightforward enough. Let's start with Text Classifier. This seems simple enough, we are going to be parsing text to solve a classification problem. Our classification problem? Given a text query, is it more likely from GeeryDev or TIBW? These will be our two classes, and our blog post history will be our sample text to create a prediction. Now, let's get to the heart of the project. Bayes refers to Bayesian Inference. We use Bayes' formula to make a text classification, using a prior probability (blog history) and likelihood function (occurrences of words). Naive, in our case, means that we are going to assume each word in text query is independent. This isn't true, the occurrence of words 'python' and 'programming' are much more likely to appear with each other then say 'python' and 'redbull', but for the most part this tends not to ruin our results. Lastly, Bernoulli, again in our case, means that we will treat the words as Boolean occurrences when searching blog history rather than frequency occurrences. We only care whether a word occurs in a post, not how many times it may occur.

So... About those questions

What good is this classification?

Although I strongly disagree, some might consider this a useless implementation of Naive Bayes, and a more useful classification problem for it to be applied would be email spam/not spam filtering. This is considered to be one of the great examples of where Naive Bayes Text Classifiers have worked very well.

On more obscure words, why does it default to GeeryDev?

I am glad you asked. As, we discussed earlier, Bayesian inference uses a prior probability as a starting point for calculating a poster probability. Considering the GeeryDev and TIBW blog histories, GeeryDev has written just slightly more than TIBW. This means that given any ext query, GeeryDev is more likely to have written the text. The likelihood function will of course have no problem overcoming this history for cases where TIBW is more likely, but there is a hill to climb.

How can we make this better?

You mean, how can we make this worse!?? Of course, the biggest opportunity to improve this classifier is simply getting more data. GeeryDev and TIBW would need to write considerably more to get this thing to be impressive. I think a multinomial (as opposed to Bernoulli), where word counts are more important, approach may work a little better. And, my wildest dreams would include a world with a word association system such as word2vec to get more comprehension with such little vocabulary. This would further violate our Naive assumption, but I would be excited to see the results. Got any ideas? Help me out. You can see the source here and your knowledge is always greatly appreciated.