Wednesday, March 21, 2012

Google PageRank Equation Derivation


We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:


PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.

This is taken from Google's original paper: The Anatomy of a Large-Scale Hypertextual Web Search Engine. They later go on to give an intuitive explanation of the formula:

PageRank can be thought of as a model of user behavior. We assume there is a "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back" but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank. And, the d damping factor is the probability at each page the "random surfer" will get bored and request another random page.


So how does one get from their intuition to the formula? From their description, a user gets to A by either randomly choosing A from anywhere, or clicking a link to A from pages T1...Tn. We can state that as:
Probability of being at page A = Probability of: being anywhere and randomly choosing A, or being at page T1 and clicking link to A, or being at T2 and clicking link to A, ..., or being at Tn and clicking link to A.

Probability theory states that if you have two situations where either one or the other occurs, then the probability of either occurring is the sum of the probability of each occurring:

Pr(A or B occurring) = Pr(A occurring) + Pr(B occurring)

Also, the probability of both situations occurring is the product of each individually occurring:

Pr(A and B occurring) = Pr(A occurring) * Pr(B occurring)

Now we are ready to build the equation:
  1. Pr(A) = Pr(randomly clicking and reaching A or not randomly clicking and reaching A)
  2. = Pr(randomly clicking and reaching A) + Pr(not randomly clicking and reaching A)
  3. = 1-d + d * Pr(reaching A from the current page without randomly clicking)
  4. = 1-d + d * Pr(being at page T1 and clicking link to A, or being at T2 and clicking link to A, ..., or being at Tn and clicking link to A)
  5. = 1-d + d ( Pr(being at page T1 and clicking link to A) + Pr(being at T2 and clicking link to A) + ... + Pr(being at Tn and clicking link to A) )
  6. = 1-d + d ( Pr(being at page T1) * Pr(clicking link to A from T1) + Pr(being at page T2) * Pr(clicking link to A from T2) + ... + Pr(being at page Tn) * Pr(clicking link to A from Tn) )
  7. = 1-d + d ( Pr(T1)/C(T1) + Pr(T2)/C(T2) + ... + Pr(Tn)/C(Tn))