Performing naïve Bayes classification

Let’s get right into it! Let’s begin with naïve Bayes classification. This ML model relies heavily on results from previous chapters, specifically with Bayes’ theorem:

Let’s look a little closer at the specific features of this formula:

  • P(H) is the probability of the hypothesis before we observe the data, called the prior probability, or just prior
  • P(H|D) is what we want to compute: the probability of the hypothesis after we observe the data, called the posterior
  • P(D|H) is the probability of the data under the given hypothesis, called the likelihood
  • P(D) is the probability of the data under any hypothesis, called the normalizing constant

Naïve Bayes classification is a classification model, and therefore a supervised model. Given this, what kind of data do we need – labeled or unlabeled data?

(Insert Jeopardy music here)

If you answered labeled data, then you’re well on your way to becoming a data scientist!

Suppose we have a dataset with n features, (x1, x2, …, xn), and a class label, C. For example, let’s take some data involving spam text classification. Our data would consist of rows of individual text samples and columns of both our features and our class labels. Our features would be words and phrases that are contained within the text samples, and our class labels are simply spam or not spam. In this scenario, I will replace the not-spam class with an easier-to-say word, ham. Let’s take a look at the following code snippet to better understand our spam and ham data:


import pandas as pd import sklearn
df = pd.read_table(‘..data/sms.tsv’,
sep=’\t’, header=None, names=[‘label’, ‘msg’])
df

Figure 11.1 is a sample of text data in a row-column format:

Figure 11.1 – A sample of our spam versus not spam (ham) messages

Let’s do some preliminary statistics to see what we are dealing with. Let’s see the difference in the number of ham and spam messages at our disposal:


df.label.value_counts().plot(kind=”bar”)

This gives us a bar chart, as shown in Figure 11.2:

Figure 11.2 – The distribution of ham versus spam

Because we are dealing with classification, it would help to itemize some of the metrics we will be using to evaluate our model.

Leave a Reply

Your email address will not be published. Required fields are marked *