Is Trust Rank a Markov chain?

Posted by Haoxuan (Horace) on 2020-03-06

So literally this is a note for answering a Zhihu question (the question itself is in Chinese), but hopefully, if it can help someone. The question asked why the “initial seeds” in Trust Rank could affect the final ranking results. Because we know that if it is a Markov chain, the final convergence value of a Markov process should only depending on the transition matrix, which normally considered as the link topology in between web pages (note this “which”, the question comes from here).

Probably this is a boring topic with no-body-care models, but still, we can have a quick discussion on this. Before start, I assume the reader has some basic knowledge in linear algebra, and know the basics about page rank and trust rank.

The short answer to this question: yes (if I do not make any mistakes), Trust Rank is a Markov chain, with the selected initial seed vector as an essential part to determine its transition matrix.

I would start with page rank to have uniformed math annotations and then move to trust rank very briefly. The Page Rank part comes from the link.

Credit to Prof. Ryan Tibshirani in Stanford

Suppose we have $n$ websites, then what a search engine does is to sort these $n$ websites. Assuming that $p_i$ is the score (weight) of the website, $p = \left(\begin{array}{cc} p_1 \\ p_2 \\ \vdots \\ p_n \\ \end{array} \right)$. Naturally, search engines can rank the websites according to this score $p$.

Broken Rank

A website with many incoming links is likely to be a good website (of course not absolute). So we define the link matrix as follows:

Let $L_{ij} = \begin{cases} 1, \text{there is link from website $j$ pointing to website $i$}\\ 0, \text{otherwise} \end{cases}$,
$ L = \left( \begin{array}{cc} L_{11} & L_{12} & \dots & L_{1n} \\ L_{21} & L_{22} & \dots & L_{2n} \\ \vdots \\ L_{n1} & L_{n2} & \dots & L_{nn} \\ \end{array} \right)$

Let $m_j = \sum_{k = 1}^j L_{kj}$, i.e. all outbound links for site $j$, $M = \left( \begin{array}{cc} m_1 & 0 & \dots & 0 \\ 0 & m_2 & \dots & 0 \\ \vdots \\ 0 & 0 & \dots & m_n \\ \end{array} \right)$

We can think of the weight $p_i$ of a website as the sum of the weights of the contributions of other websites to this website $i$, or the probability of jumping from other websites to this website. The weight of the contribution of other sites $j$ to this site $i$ can be simply deemed as $\frac{L_{ij}}{m_j} p_j$, that is, the ratio of this $L_{ij}$ link to all outbound links of site $j$ multiplied by the site $j$ ’s score. So we can recursively calculate $p_i = \sum_{j = 1}^{n} \frac{L_{ij}}{m_j} p_j$, and this can be written in matrix form $p = LM^{-1} p$

If we let $A = LM^{-1}$, the equation above can be written as $p = Ap$. We can also find in this equation that $p$ as the eigenvector of matrix $A$ when the eigenvalue is $1$. This equation can of course also be treated as a Markov chain. If we define $P(\text{go from } j \text{ to } i) = P_{ij}$, we have $A_ {ij} = L_{ij} / m_j$ i.e. $P_{ij} = \begin{cases} 1 / m_j, \text{if site } j \text{ has a link to } i \\ 0, \text{otherwise} \end{cases}$. Then $p^{(i + 1)}$ = $Ap^{(i)}$ and $A$ is the transition matrix. Note that here we are assuming that users click on links in a uniformly distributed way, or say, in a random way.

As we all know (or you may not know), the linking graph of the Internet is not strongly connected (octopus model), that is, some of the websites are self-contained, i.e. they have no links to external websites; and in some cases, some websites are incoming only or outgoing only. When the graph is not strongly connected, the solution for the eigenvector of the matrix $A$ given the eigenvalue of 1 is non-unique. (There is an example on Stanford’s slide… but I am not going to copy it here…)

Page Rank

What we can do to fix the issue in our “Broken Rank” is to assume that there is a small probability that the user may not jump to a new website through the outgoing link of the current website, but enter directly from the address bar. This allows that any websites can be reached directly and hence fixes the non-strong connectivity graph problem in the Broken Rank. The improved algorithm has a famous name called Page Rank. Broken Rank’s $p$ is calculated as follows: $p_i = \sum_{j = 1}^{n}\frac{L_{ij}}{m_j} p_j$, and the Page Rank correction for $p$ is $p_i = \frac{\left(1-d \right)}{n} + d \sum_{j = 1}^{n}\frac{L_{ij}}{m_j} p_j$, where $d$ is a constant less than $1$ and greater than $0$. Write it in matrix form: $p = (\frac{1-d}{n}E + dLM^{-1}) p$, where $E$ is an all $1$ matrix of size $n \times n$. The matrix form holds obviously. Because if we treat $p$ as a (discrete) probability distribution, then $Ep = \left(\begin{array}{cc} 1 \\ 1 \\ \vdots \\ 1 \end{array} \right) \stackrel{def}{=} e $. Here we define a $n \times 1$ all $1$ vector, $e$.
We can still think of the above equation as $p^{(i + 1)} = Ap^{(i)}$, but here the “new” $A_{\text{page rank}}$ is slightly changed compared with the $A_{\text{broken rank}}$ in the Broken Rank.

Trust Rank

Sorry for my long rambling, and the boring copying & pasting above. Finally, we could mention the Trust Rank. Still, I think those rambling is essential because it is important to unify the math symbols. Honestly, I have not read the Trust Rank paper carefully, so I would sincerely apologize for any mistakes or omissions below :)

The original papers of Trust Rank and almost all articles mentioning Trust Rank on Internet will use the following equation (in matrix form):$p = (1-d)t + dLM^{-1} p$, where $t$ is the seed vector, and other parts of the equation are same as the equation defined in Page Rank above. Here I am not going to explore how the $t$ vector comes. (well, I did not take a closer look… TBH.)
A fact is that we know $t^{t} E = e^{t}$. (Because the sum of the components of the seed vector $t$ is also $1$. Here the superscript $^{t}$ means transpose, for clear.)
Hence,
$$
p = (1 - d) t (e^t e / n) + dLM^{-1} p \\
p =\frac{(1 - d)}{n} t t^t E E p + dLM^{-1} p \\
p =(\frac{(1 - d)}{n} t t^t E E + dLM^{-1}) p \\
p =((1 - d) t t^t E + dLM^{-1}) p \\
p =((1 - d) t e^t + dLM^{-1}) p
$$
Obviously, for Trust Rank, we can also write the equation as $p^{(i + 1)}= A p^{(i)}$ ,and the value of $A$ needs to be determined by $t$.


留言: