A simple explanation of Naive Bayes Classification [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

This question does not appear to be about programming within the scope defined in the help center. Closed 3 years ago .

I am finding it hard to understand the process of Naive Bayes, and I was wondering if someone could explain it with a simple step by step process in English. I understand it takes comparisons by times occurred as a probability, but I have no idea how the training data is related to the actual dataset. Please give me an explanation of what role the training set plays. I am giving a very simple example for fruits here, like banana for example

training set--- round-red round-orange oblong-yellow round-red dataset---- round-red round-orange round-red round-orange oblong-yellow round-red round-orange oblong-yellow oblong-yellow round-red

81.9k 75 75 gold badges 362 362 silver badges 531 531 bronze badges asked Apr 8, 2012 at 0:56 5,897 3 3 gold badges 15 15 silver badges 8 8 bronze badges

It's quite easy if you understand Bayes' Theorem. If you haven' read on Bayes' theorem, try this link yudkowsky.net/rational/bayes.

Commented Apr 8, 2012 at 3:43

NOTE: The accepted answer below is not a traditional example for Naïve Bayes. It's mostly a k Nearest Neighbor implementation. Read accordingly.

Commented Nov 26, 2013 at 4:01

Here's a quick, visual description of Bayes' Theorem by Oscar Bonilla - oscarbonilla.com/2009/05/visualizing-bayes-theorem

Commented Mar 16, 2014 at 22:37

Well if one sees a graph with some dots that doesn't really mean that it is KNN :) How you calculate the probabilities is all up to you. Naive Bayes calculates it using prior multiplied by likelihood so that is what Yavar has shown in his answer. How to arrive at those probabilities is really not important here. The answer is absolutely correct and I see no problems in it.

Commented Dec 23, 2014 at 13:32 Commented Feb 10, 2021 at 9:51

5 Answers 5

The accepted answer has many elements of k-NN (k-nearest neighbors), a different algorithm.

Both k-NN and NaiveBayes are classification algorithms. Conceptually, k-NN uses the idea of "nearness" to classify new entities. In k-NN 'nearness' is modeled with ideas such as Euclidean Distance or Cosine Distance. By contrast, in NaiveBayes, the concept of 'probability' is used to classify new entities.

Since the question is about Naive Bayes, here's how I'd describe the ideas and steps to someone. I'll try to do it with as few equations and in plain English as much as possible.

First, Conditional Probability & Bayes' Rule

Before someone can understand and appreciate the nuances of Naive Bayes', they need to know a couple of related concepts first, namely, the idea of Conditional Probability, and Bayes' Rule. (If you are familiar with these concepts, skip to the section titled Getting to Naive Bayes')

Conditional Probability in plain English: What is the probability that something will happen, given that something else has already happened.

Let's say that there is some Outcome O. And some Evidence E. From the way these probabilities are defined: The Probability of having both the Outcome O and Evidence E is: (Probability of O occurring) multiplied by the (Prob of E given that O happened)

One Example to understand Conditional Probability:

Let say we have a collection of US Senators. Senators could be Democrats or Republicans. They are also either male or female.

If we select one senator completely randomly, what is the probability that this person is a female Democrat? Conditional Probability can help us answer that.

Probability of (Democrat and Female Senator)= Prob(Senator is Democrat) multiplied by Conditional Probability of Being Female given that they are a Democrat.

 P(Democrat & Female) = P(Democrat) * P(Female | Democrat)

We could compute the exact same thing, the reverse way:

 P(Democrat & Female) = P(Female) * P(Democrat | Female)

Understanding Bayes Rule

Conceptually, this is a way to go from P(Evidence| Known Outcome) to P(Outcome|Known Evidence). Often, we know how frequently some particular evidence is observed, given a known outcome. We have to use this known fact to compute the reverse, to compute the chance of that outcome happening, given the evidence.

P(Outcome given that we know some Evidence) = P(Evidence given that we know the Outcome) times Prob(Outcome), scaled by the P(Evidence)

The classic example to understand Bayes' Rule:

Probability of Disease D given Test-positive = P(Test is positive|Disease) * P(Disease) _______________________________________________________________ (scaled by) P(Testing Positive, with or without the disease)

Now, all this was just preamble, to get to Naive Bayes.

Getting to Naive Bayes'

So far, we have talked only about one piece of evidence. In reality, we have to predict an outcome given multiple evidence. In that case, the math gets very complicated. To get around that complication, one approach is to 'uncouple' multiple pieces of evidence, and to treat each of piece of evidence as independent. This approach is why this is called naive Bayes.

P(Outcome|Multiple Evidence) = P(Evidence1|Outcome) * P(Evidence2|outcome) * . * P(EvidenceN|outcome) * P(Outcome) scaled by P(Multiple Evidence)

Many people choose to remember this as:

 P(Likelihood of Evidence) * Prior prob of outcome P(outcome|evidence) = _________________________________________________ P(Evidence)

Notice a few things about this equation:

If the Prob(evidence|outcome) is 1, then we are just multiplying by 1.
If the Prob(some particular evidence|outcome) is 0, then the whole prob. becomes 0. If you see contradicting evidence, we can rule out that outcome.
Since we divide everything by P(Evidence), we can even get away without calculating it.
The intuition behind multiplying by the prior is so that we give high probability to more common outcomes, and low probabilities to unlikely outcomes. These are also called base rates and they are a way to scale our predicted probabilities.

How to Apply NaiveBayes to Predict an Outcome?

Just run the formula above for each possible outcome. Since we are trying to classify, each outcome is called a class and it has a class label. Our job is to look at the evidence, to consider how likely it is to be this class or that class, and assign a label to each entity. Again, we take a very simple approach: The class that has the highest probability is declared the "winner" and that class label gets assigned to that combination of evidences.

Fruit Example

Let's try it out on an example to increase our understanding: The OP asked for a 'fruit' identification example.

Let's say that we have data on 1000 pieces of fruit. They happen to be Banana, Orange or some Other Fruit. We know 3 characteristics about each fruit:

Whether it is Long
Whether it is Sweet and
If its color is Yellow.

This is our 'training set.' We will use this to predict the type of any new fruit we encounter.

Type Long | Not Long || Sweet | Not Sweet || Yellow |Not Yellow|Total ___________________________________________________________________ Banana | 400 | 100 || 350 | 150 || 450 | 50 | 500 Orange | 0 | 300 || 150 | 150 || 300 | 0 | 300 Other Fruit | 100 | 100 || 150 | 50 || 50 | 150 | 200 ____________________________________________________________________ Total | 500 | 500 || 650 | 350 || 800 | 200 | 1000 ___________________________________________________________________

We can pre-compute a lot of things about our fruit collection.

The so-called "Prior" probabilities. (If we didn't know any of the fruit attributes, this would be our guess.) These are our base rates.

 P(Banana) = 0.5 (500/1000) P(Orange) = 0.3 P(Other Fruit) = 0.2

Probability of "Evidence"

p(Long) = 0.5 P(Sweet) = 0.65 P(Yellow) = 0.8

Probability of "Likelihood"

P(Long|Banana) = 0.8 P(Long|Orange) = 0 [Oranges are never long in all the fruit we have seen.] . P(Yellow|Other Fruit) = 50/200 = 0.25 P(Not Yellow|Other Fruit) = 0.75

Given a Fruit, how to classify it?

Let's say that we are given the properties of an unknown fruit, and asked to classify it. We are told that the fruit is Long, Sweet and Yellow. Is it a Banana? Is it an Orange? Or Is it some Other Fruit?

We can simply run the numbers for each of the 3 outcomes, one by one. Then we choose the highest probability and 'classify' our unknown fruit as belonging to the class that had the highest probability based on our prior evidence (our 1000 fruit training set):

P(Banana|Long, Sweet and Yellow) P(Long|Banana) * P(Sweet|Banana) * P(Yellow|Banana) * P(banana) = _______________________________________________________________ P(Long) * P(Sweet) * P(Yellow) = 0.8 * 0.7 * 0.9 * 0.5 / P(evidence) = 0.252 / P(evidence) P(Orange|Long, Sweet and Yellow) = 0 P(Other Fruit|Long, Sweet and Yellow) P(Long|Other fruit) * P(Sweet|Other fruit) * P(Yellow|Other fruit) * P(Other Fruit) = ____________________________________________________________________________________ P(evidence) = (100/200 * 150/200 * 50/200 * 200/1000) / P(evidence) = 0.01875 / P(evidence)

By an overwhelming margin ( 0.252 >> 0.01875 ), we classify this Sweet/Long/Yellow fruit as likely to be a Banana.

Why is Bayes Classifier so popular?

Look at what it eventually comes down to. Just some counting and multiplication. We can pre-compute all these terms, and so classifying becomes easy, quick and efficient.

Let z = 1 / P(evidence). Now we quickly compute the following three quantities.

P(Banana|evidence) = z * Prob(Banana) * Prob(Evidence1|Banana) * Prob(Evidence2|Banana) . P(Orange|Evidence) = z * Prob(Orange) * Prob(Evidence1|Orange) * Prob(Evidence2|Orange) . P(Other|Evidence) = z * Prob(Other) * Prob(Evidence1|Other) * Prob(Evidence2|Other) .

Assign the class label of whichever is the highest number, and you are done.

Despite the name, Naive Bayes turns out to be excellent in certain applications. Text classification is one area where it really shines.

43.6k 21 21 gold badges 142 142 silver badges 740 740 bronze badges answered Dec 12, 2013 at 23:54 Ram Narasimhan Ram Narasimhan 22.4k 5 5 gold badges 53 53 silver badges 56 56 bronze badges

Thanks for the very clear explanation! Easily one of the better ones floating around the web. Question: since each P(outcome/evidence) is multiplied by 1 / z=p(evidence) (which in the fruit case, means each is essentially the probability based solely on previous evidence), would it be correct to say that z doesn't matter at all for Naïve Bayes? Which would thus mean that if, say, one ran into a long/sweet/yellow fruit that wasn't a banana, it'd be classified incorrectly.

Commented Dec 21, 2013 at 2:30

@E.Chow Yes, you are correct in that computing z doesn't matter for Naive Bayes. (It is a way to scale the probabilities to be between 0 and 1.) Note that z is product of the probabilities of all the evidence at hand. (It is different from the priors which is the base rate of the classes.) You are correct: If you did find a Long/Sweet/Yellow fruit that is not a banana, NB will classify it incorrectly as a banana, based on this training set. The algorithm is a 'best probabilistic guess based on evidence' and so it will mis-classify on occasion.

Commented Dec 21, 2013 at 6:35

Absolutely great explanation. I coudn't understand this algoritm from academical papers and books. Because, esoteric explanation is generally accepted writing style maybe. That's all, and so easy. Thanks.

Commented Jan 19, 2015 at 19:42

Thanks, I prefer this answer more than the textbook does, they try to use tons of maths notation to confuse us rather than really teach us the knowledge, but you, sir, you are the hero!

Commented Apr 2, 2016 at 15:03

Why don't the probabilities add up to 1? The evidence is 0.26 in the example (500/100 * 650/1000 * 800/1000), and so the final P(banana|. ) = 0.252 / 0.26 = 0.969, and the P(other|. ) = 0.01875 / 0.26 = 0.072. Together they add up to 1.04!

Commented Apr 8, 2016 at 0:41

Your question as I understand it is divided in two parts, part one being you need a better understanding of the Naive Bayes classifier & part two being the confusion surrounding Training set.

In general all of Machine Learning Algorithms need to be trained for supervised learning tasks like classification, prediction etc. or for unsupervised learning tasks like clustering.

During the training step, the algorithms are taught with a particular input dataset (training set) so that later on we may test them for unknown inputs (which they have never seen before) for which they may classify or predict etc (in case of supervised learning) based on their learning. This is what most of the Machine Learning techniques like Neural Networks, SVM, Bayesian etc. are based upon.

So in a general Machine Learning project basically you have to divide your input set to a Development Set (Training Set + Dev-Test Set) & a Test Set (or Evaluation set). Remember your basic objective would be that your system learns and classifies new inputs which they have never seen before in either Dev set or test set.

The test set typically has the same format as the training set. However, it is very important that the test set be distinct from the training corpus: if we simply reused the training set as the test set, then a model that simply memorized its input, without learning how to generalize to new examples, would receive misleadingly high scores.

In general, for an example, 70% of our data can be used as training set cases. Also remember to partition the original set into the training and test sets randomly.

Now I come to your other question about Naive Bayes.

To demonstrate the concept of Naïve Bayes Classification, consider the example given below:

As indicated, the objects can be classified as either GREEN or RED . Our task is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the currently existing objects.

Since there are twice as many GREEN objects as RED , it is reasonable to believe that a new case (which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED . In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based on previous experience, in this case the percentage of GREEN and RED objects, and often used to predict outcomes before they actually happen.

Thus, we can write:

Prior Probability of GREEN : number of GREEN objects / total number of objects

Prior Probability of RED : number of RED objects / total number of objects

Since there is a total of 60 objects, 40 of which are GREEN and 20 RED , our prior probabilities for class membership are:

Prior Probability for GREEN : 40 / 60

Prior Probability for RED : 20 / 60

Having formulated our prior probability, we are now ready to classify a new object ( WHITE circle in the diagram below). Since the objects are well clustered, it is reasonable to assume that the more GREEN (or RED ) objects in the vicinity of X, the more likely that the new cases belong to that particular color. To measure this likelihood, we draw a circle around X which encompasses a number (to be chosen a priori) of points irrespective of their class labels. Then we calculate the number of points in the circle belonging to each class label. From this we calculate the likelihood:

From the illustration above, it is clear that Likelihood of X given GREEN is smaller than Likelihood of X given RED , since the circle encompasses 1 GREEN object and 3 RED ones. Thus:

Although the prior probabilities indicate that X may belong to GREEN (given that there are twice as many GREEN compared to RED ) the likelihood indicates otherwise; that the class membership of X is RED (given that there are more RED objects in the vicinity of X than GREEN ). In the Bayesian analysis, the final classification is produced by combining both sources of information, i.e., the prior and the likelihood, to form a posterior probability using the so-called Bayes' rule (named after Rev. Thomas Bayes 1702-1761).

Finally, we classify X as RED since its class membership achieves the largest posterior probability.

2,119 1 1 gold badge 8 8 silver badges 22 22 bronze badges answered Apr 8, 2012 at 12:13 11.9k 5 5 gold badges 33 33 silver badges 63 63 bronze badges isn't this algorithm above more like k-nearest neighbors? Commented Jun 12, 2013 at 8:46 This answer is confusing - it mixes KNN (k nearest neighbours) and naive bayes. Commented Sep 1, 2013 at 16:39

The answer was proceeding nicely till the likelihood came up. So @Yavar has used K-nearest neighbours for calculating the likelihood. How correct is that? If it is, what are some other methods to calculate the likelihood?

Commented Jan 31, 2014 at 6:24

You used a circle as an example of likelihood. I read about Gaussian Naive bayes where the likelihood is gaussian. How can that be explained?

Commented May 7, 2015 at 3:30

Actually, the answer with knn is correct. If you don't know the distribution and thus the probability densitiy of such distribution, you have to somehow find it. This can be done via kNN or Kernels. I think there are some things missing. You can check out this presentation though.

Commented Feb 9, 2017 at 9:21

Naive Bayes comes under supervising machine learning which used to make classifications of data sets. It is used to predict things based on its prior knowledge and independence assumptions.

They call it naive because it’s assumptions (it assumes that all of the features in the dataset are equally important and independent) are really optimistic and rarely true in most real-world applications.

It is classification algorithm which makes the decision for the unknown data set. It is based on Bayes Theorem which describe the probability of an event based on its prior knowledge.

Below diagram shows how naive Bayes works

enter image description here

Formula to predict NB:

enter image description here

How to use Naive Bayes Algorithm ?

Let's take an example of how N.B woks

Step 1: First we find out Likelihood of table which shows the probability of yes or no in below diagram. Step 2: Find the posterior probability of each class.

enter image description here

Problem: Find out the possibility of whether the player plays in Rainy condition? P(Yes|Rainy) = P(Rainy|Yes) * P(Yes) / P(Rainy) P(Rainy|Yes) = 2/9 = 0.222 P(Yes) = 9/14 = 0.64 P(Rainy) = 5/14 = 0.36 Now, P(Yes|Rainy) = 0.222*0.64/0.36 = 0.39 which is lower probability which means chances of the match played is low.

For more reference refer these blog.

59.8k 29 29 gold badges 149 149 silver badges 170 170 bronze badges answered Feb 17, 2017 at 4:02 Jitesh Mohite Jitesh Mohite 33.6k 15 15 gold badges 171 171 silver badges 156 156 bronze badges

enter image description here

Ram Narasimhan explained the concept very nicely here below is an alternative explanation through the code example of Naive Bayes in action
It uses an example problem from this book on page 351
This is the data set that we will be using

In the above dataset if we give the hypothesis = then what is the probability that he will buy or will not buy a computer.
The code below exactly answers that question.
Just create a file called named new_dataset.csv and paste the following content.

Age,Income,Student,Creadit_Rating,Buys_Computer 40,medium,no,fair,yes >40,low,yes,fair,yes >40,low,yes,excellent,no 31-40,low,yes,excellent,yes 40,medium,yes,fair,yes 40,medium,no,excellent,no

Here is the code the comments explains everything we are doing here! [python]

import pandas as pd import pprint class Classifier(): data = None class_attr = None priori = <> cp = <> hypothesis = None def __init__(self,filename=None, class_attr=None ): self.data = pd.read_csv(filename, sep=',', header =(0)) self.class_attr = class_attr ''' probability(class) = How many times it appears in cloumn __________________________________________ count of all class attribute ''' def calculate_priori(self): class_values = list(set(self.data[self.class_attr])) class_data = list(self.data[self.class_attr]) for i in class_values: self.priori[i] = class_data.count(i)/float(len(class_data)) print "Priori Values: ", self.priori ''' Here we calculate the individual probabilites P(outcome|evidence) = P(Likelihood of Evidence) x Prior prob of outcome ___________________________________________ P(Evidence) ''' def get_cp(self, attr, attr_type, class_value): data_attr = list(self.data[attr]) class_data = list(self.data[self.class_attr]) total =1 for i in range(0, len(data_attr)): if class_data[i] == class_value and data_attr[i] == attr_type: total+=1 return total/float(class_data.count(class_value)) ''' Here we calculate Likelihood of Evidence and multiple all individual probabilities with priori (Outcome|Multiple Evidence) = P(Evidence1|Outcome) x P(Evidence2|outcome) x . x P(EvidenceN|outcome) x P(Outcome) scaled by P(Multiple Evidence) ''' def calculate_conditional_probabilities(self, hypothesis): for i in self.priori: self.cp[i] = <> for j in hypothesis: self.cp[i].update(< hypothesis[j]: self.get_cp(j, hypothesis[j], i)>) print "\nCalculated Conditional Probabilities: \n" pprint.pprint(self.cp) def classify(self): print "Result: " for i in self.cp: print i, " ==> ", reduce(lambda x, y: x*y, self.cp[i].values())*self.priori[i] if __name__ == "__main__": c = Classifier(filename="new_dataset.csv", class_attr="Buys_Computer" ) c.calculate_priori() c.hypothesis = c.calculate_conditional_probabilities(c.hypothesis) c.classify()

Priori Values: Calculated Conditional Probabilities: < 'no': < ', 'yes': < '> Result: yes ==> 0.0720164609053 no ==> 0.0411428571429