A lot of common problems in machine learning involve classification of isolated data points that are independent of each other. For instance, given an image, predict whether it contains a cat or a dog, or given an image of a handwritten character, predict which digit out of 0 through 9 it is.
It turns out that a lot of problems do not fit into the above framework. For example, given a sentence, “I like machine learning,” tag each word with its part-of-speech (a noun, a pronoun, a verb, an adjective, etc.). As even this simple example demonstrates, this task cannot be solved by treating each word independently — “learning” could be a noun or a verb depending on its context. This task is important for a lot of more complex tasks on text, such as translating from one language to another, text to speech, etc.
It is not obvious how you would use a standard classification model to handle these problems. A powerful framework which can be used to learn such models with dependency is probabilistic graphical models (PGM). For this post, the Statsbot team asked a data scientist, Prasoon Goyal, to make a tutorial on this framework to us.
Before talking about how to apply a probabilistic graphical model to a machine learning problem, we need to understand the PGM framework. Formally, a probabilistic graphical model (or graphical model for short) consists of a graph structure. Each node of the graph is associated with a random variable, and the edges in the graph are used to encode relations between the random variables.
Depending on whether the graph is directed or undirected, we can classify graphical modes into two categories — Bayesian networks and Markov networks.
Bayesian networks: Directed graphical models
A canonical example of Bayesian networks is the so-called “student network,” which looks like this:
This graph describes a setting for a student enrolled in a university class. There are 5 random variables in the graph:
Difficulty (of the class): Takes values 0 (low difficulty) and 1 (high difficulty)
Intelligence (of the student): Takes values 0 (not intelligent) and 1 (intelligent)
Grade (the student gets in the class): Takes values 1 (good grade), 2 (average grade), and 3 (bad grade)
SAT (student’s score in the SAT exam): Takes values 0 (low score) and 1 (high score)
Letter (quality of recommendation letter the student gets from the professor after completing the course): Takes values 0 (not a good letter) and 1 (a good letter)
The edges in the graph encode dependencies in the graph.
- The “Grade” of the student depends on the “Difficulty” of the class and the “Intelligence” of the student.
- The “Grade,” in turn, determines whether the student gets a good letter of recommendation from the professor.
- Also, the “Intelligence” of the student influences their “SAT” score, in addition to influencing the “Grade.”
Note that the direction of arrows tells us about cause-effect relationships — “Intelligence” affects the “SAT” score, but the “SAT” score does not influence the “Intelligence.”
Finally, let’s look at the tables associated with each of the nodes. Formally, these are called conditional probability distributions (CPDs).
Сonditional probability distributions
The CPDs for “Difficulty” and “Intelligence” are fairly simple, because these variables do not depend on any of the other variables. The tables basically encode the probabilities of these variables, taking 0 or 1 as values. As you might have noticed, the values in each of the tables must sum to 1.
Next, let’s look at the CPD for “SAT.” Each row corresponds to the values that its parent (“Intelligence”) can take, and each column corresponds to the values that “SAT” can take. Each cell has the conditional probability p(SAT=s | Intelligence=i), that is, given that the value of “Intelligence” is i, what is the probability of the value of “SAT” being s.
For instance, we can see that p(SAT=s¹ | Intelligence = i¹) is 0.8, that is, if the intelligence of the student is high, then the probability of the SAT score being high as well is 0.8. On the other hand, p(SAT=s⁰ | Intelligence = i¹), which encodes the fact that if the intelligence of the student is high, then the probability of the SAT score being low is 0.2.
Note that the sum of values in each row is 1. That makes sense because given that Intelligence=i¹, the SAT score can be either s⁰ or s¹, so the two probabilities must add up to 1. Similarly, the CPD for “Letter” encodes the conditional probabilities p(Letter=l | Grade=g). Because “Grade” can take three values, we have three rows in this table.
The CPD for “Grade” is easy to understand with the above knowledge. Because it has two parents, the conditional probabilities will be of the form p(Grade=g | Difficulty=d, SAT=s), that is, what is the probability of “Grade” being g, given that the value of “Difficulty” is d and that of “SAT” is s. Each row now corresponds to a pair of values of “Difficulty” and “Intelligence.” Again, the row values add up to 1.
An essential requirement for Bayesian networks is that the graph must be a directed acyclic graph (DAG).
Markov networks: undirected graphical models
Here’s a simple example of a Markov network:
In the interest of brevity, we will look at this abstract graph, where nodes A, B, etc. do not directly correspond to real-world quantities, as above. Again, the edges represent the interaction between variables. We see that while A and B directly influence each other, A and C do not. Note that Markov networks do not need to be acyclic, as was the case with Bayesian networks.
Just as we had CPDs for Bayesian networks, we have tables to incorporate relations between nodes in Markov networks. However, there are two crucial differences between these tables and CPDs.
First, the values do not need to sum to one, that is, the table does not define a probability distribution. It only tells us that configurations with higher values are more likely. Second, there is no conditioning. It is proportional to the joint distribution of all the variables involved, as opposed to conditional distributions in CPDs.
The resulting tables are called “factors” or “potential functions,” and denoted using the Greek symbol 𝜑. So, for example, we can use the following potential function to capture the relation between three variables A, B, and C, where C is the “soft” XOR of A and B, that is, C is likely to be 1 if A and B are different, and C is likely to be 0 if A and B are identical:
In general, you will have a potential function defined for each maximal clique in the graph.
The graph structure and the tables succinctly capture the joint probability distribution over the random variables.
A question you may have at this point is, why do we need directed as well as undirected graphs? The reason is that for some problems, it is more natural to represent them as directed graphs, such as the student network above, where it is easy to describe the causal relationship between variables — the intelligence of the student affects the SAT score, but the SAT score does not affect the intelligence of the student (although it may tell us about the intelligence of the student).
For other kinds of problems, say, images, you may want to represent each pixel as a node, and we know that neighboring pixels affect each other, but there is no cause-effect relationship between the pixels; instead, the interaction is symmetric. So, we use undirected graphical models in such cases.
The problem setting
We’ve been discussing graphs and random variables and tables so far, and you may be thinking what the point of all this is? What are we actually trying to do? Where is machine learning in all this — data, training, prediction, etc.? This section will address these questions.
Let’s go back to the student network. Suppose we have the graph structure with us, which we can create from our knowledge of the world (referred to as “domain knowledge” in machine learning). But we don’t have CPD tables, only their sizes. We do have some data though — for ten different classes in a university, we have a measure of their difficulty.
Also, we have data about each student in each of the courses — their intelligence, what their SAT score was, what grade they got, and whether they got a good letter from the professor. From this data, we can estimate the parameters of the CPDs. For instance, the data might show that students with high intelligence often get good SAT scores, and we might be able to learn from it that p(SAT=s1 | Intelligence = i1)is high. This is the learning phase. We will soon see how we do this parameter estimation in Bayesian and Markov networks.
Now, for a new data point, you will observe some of the variables, but not all. For example, in the graph below, you know about the difficulty of a course and a student’s SAT score, and you want to estimate the probability of the student getting a good grade. (By now, you already have values in the tables from the learning phase.)
While we don’t have a CPD that gives us that information directly, we can see that a high SAT score from the student would suggest that the student is likely intelligent, and consequently, the probability of a good grade is high if the difficulty of the course is low, as shown using the red arrows in the above image. We may also want to estimate the probability of multiple variables simultaneously, like what is the probability of the student getting a good grade and a good letter?
The variables with known values are called “observed variables,” while those whose values are unobserved are called “hidden variables” or “latent variables.” Conventionally, observed variables are denoted using grey nodes, while latent variables are denoted using white nodes, as in the above image. We may be interested in finding the values of some or all of the latent variables.
Answering these questions is analogous to prediction in other areas of machine learning — in graphical models, this process is called “inference.”
While we used Bayesian networks to describe the terminology above, it applies as is to Markov networks as well. Before we get into the algorithms for learning and inference, let’s formalize the idea that we just looked at — given the value of some node, which other nodes can we get information about?
The graph structures that we’ve been talking about so far actually capture important information about the variables. Specifically, they define a set of conditional independences between the variables, that is, statements of the form — “If A is observed, then B is independent of C.” Let’s look at some examples.
In the student network, let’s say you know that a student had a high SAT score. What can you say about her grade? As we saw earlier, a high SAT score suggests that the student is intelligent, and therefore, you would expect a good grade. What if the student has a low SAT score? In this case, you would not expect a good grade.
Now, let’s say that you also know that the student is intelligent, in addition to her SAT score. If the SAT score was high, then you would expect a good grade. What if the SAT score was low? You would still expect a good grade because you know that the student is intelligent, and you would assume that she just didn’t perform well enough on the SAT. Therefore, knowing the SAT score doesn’t tell us anything if we see the intelligence of the student. To put this as a conditional independence statement, we would say — “If Intelligence is observed, then SAT and Grade are independent.”
We got this conditional independence information from the way these nodes were connected in the graph. If they were connected differently, we would get different conditional independence information.
Let’s see this with another example.
Let’s say you know that the student is intelligent. What can you say about the difficulty of the course? Nothing, right? Now, what if I tell you that the student got a bad grade on the course? This would suggest that the course was hard because we know that an intelligent student got a bad grade. Therefore we can write our conditional independence statement as follows — “If Grade is unobserved, then Intelligence and Difficulty are independent.”
Because these statements capture an independence between two nodes subject to a condition, they are called conditional independences. Note that the two examples have opposite semantics — in the first one, the independence holds if the connecting node is observed; in the second one, the independence holds if the connecting node is unobserved. This difference is because of the way the nodes are connected, that is, the directions of arrows.
We will not go into all possible cases in the interest of brevity, but it is straightforward to come up with these using intuition as above.
In Markov networks, we can use a similar intuition, but because there are no directional edges (arrows), the conditional independence statements are relatively simple — if there is no path between nodes A and B such that all nodes on the path are unobserved, then A and B are independent. Said differently, if there is at least one path from A to B where all intermediate nodes are unobserved, then A and B are not independent.
We will look at the details of how parameter estimation and inference are done in the second part of this blog post. For now, let’s look at an application of Bayesian networks, where we can use the idea of conditional independence we just learned.
Application: Monty Hall Problem
You must have seen some version of this in a TV game show:
The host shows you three closed doors, with a car behind one of the doors and something invaluable behind the others. You get to pick a door. Then, the host opens one of the remaining doors, and shows that it does not contain the car. Now, you have an option to switch the door, from the one you picked initially to the one that the host left unopened. Do you switch?
Intuitively, it appears that the host did not divulge any information. It turns out that this intuition is not entirely correct. Let’s use the new tool in our arsenal — graphical models — to understand this.
Let’s start by defining some variables:
- D : The door with the car.
- F : Your first choice.
- H : The door opened by the host.
- I : Is F = D?
D, F, and H take values 1, 2, or 3 and I takes values 0 or 1. D and I are unobserved, while F is observed. Until the host opens one of the doors, H is unobserved. Therefore, we get the following Bayesian network for our problem:
Note the directions of arrows — D and F are independent, I clearly depends on D and F, and the door picked by the host H also depends on D and F. So far, you don’t know anything about D. (This is similar to the structure in the student network, where knowing the intelligence of the student does not tell you anything about the difficulty of the course.)
Now, the host picks a door H and opens it. So, H is now observed.
Observing H does not tell us anything about I, that is, whether we have picked the right door. That is what our intuition tells us. However, it does tell us something about D! (Again, drawing analogy with the student network, if you know that the student is intelligent, and the grade is low, it tells you something about the difficulty of the course.)
Let’s see this using numbers. The CPD tables for the variables are as follows (This is when no variables have been observed.):
The tables for D and F are straightforward — the door with the car could be any door with equal probability, and we pick one of the doors with equal probability. The table for I simply says that I=1 when D and F are identical, and I=0 when D and F are different. The table for H says that if D and F are equal, then the host picks one door from the other two with equal probability, while if D and F are different, then the host picks the third door.
Now, let’s assume that we have picked a door, that is, F is now observed, say F=1. What are the conditional probabilities of I and D, given F?
Using these equations, we get the following probabilities:
These numbers make sense — so far, the probability that we have picked the correct door is ⅓ and the car could still be behind any door with equal probability.
Now, the host opens one of the doors other than F, so we observe H. Assume H=2. Let’s compute the new conditional probabilities of I and D given both F and H.
Using the above equations, we get the following probabilities:
Therefore, we don’t know anything additional about I — our first choice is correct still with probability ⅓, and this is what our intuition tells us. However, we now know that the car is behind door 3 with probability ⅔, instead of ⅓.
So, if we switch, we get the car with probability ⅔; if we don’t, we get the car with probability ⅓.
We could have gotten the same answer without using graphical models too, but graphical models give us a framework that scales well to larger problems.
In this PGM tutorial, we looked at some basic terminology in graphical models, including Bayesian networks, Markov networks, conditional probability distributions, potential functions, and conditional independences. We also looked at an application of graphical models on the Monty Hall problem.
In the second part of this blog post, we will look at some algorithms for parameter estimation and inference, and another application. So, stay tuned!