Page Rank Algorithm
The PageRank algorithm is a central part of Google’s search engine. It was developed by Larry Page and Sergey Brin, who were then Stanford graduate students. The algorithm was designed to help Google Index web pages and provide quality search results. In a nutshell, PageRank is a “vote” regarding a page’s importance cast by all the other pages on the Internet. Support for a page is shown by links to that page. Without a link, there is no support (albeit it is more of a vote against the page than an abstention).
The PageRank algorithm assigns a numerical “weight” to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of “measuring” its relative importance within the set. A document’s inherent importance is determined by the sum total of weights of all the documents that link to it. This article will explore how the PageRank algorithm works and its implications for search engine optimization (SEO).
What it Is
Google’s PageRank algorithm is one of the methods the search engine uses to determine a page’s relevance and importance. Named after Google co-founder Larry Page, the PageRank algorithm uses a system of weights and measures to assign a numerical weighting to each element of a webpage. This weighting is then used to calculate a page’s overall importance.
The PageRank algorithm looks at the quality and quantity of links pointing to a webpage as well as the quality of the pages linking to it. A link from a high-quality, relevant website will carry more weight than a link from a low-quality or irrelevant website. The algorithm also takes into account the anchor text used in links pointing to a webpage — the text that appears between the opening and closing tags. The anchor text can give clues about the topic of the linked-to page, which can further help Google determine how relevant and important that page is.
Understanding The Algorithm
In order to calculate PageRank, Google uses a complex algorithm that takes into account a number of factors. Some of these factors include the number and quality of inbound links to a page, the outbound link structure of a page, and the overall structure of the website.
Google’s PageRank algorithm is constantly being updated and refined, so it’s difficult to say exactly how it works. However, we do know that Google looks at a variety of factors when determining PageRank.
Some of the things that Google takes into consideration include:
- The number of inbound links to a page
- The quality of inbound links to a page
- The outbound link structure of a page
- The overall structure of the website
PageRank examines the page that casts the vote as well as the total number of votes, though. Votes made by “important” pages carry more weight and influence the “importance” of other pages. This is exactly how social network rank and status work.
One thing of note is with each link to any location on Page B, Page A loses some of the PageRank that Page B originally gave it. This means that a page’s PageRank is effectively a measurement of its vote; even if that vote is distributed over one link, two links, or many more, the overall voting power of the page will remain constant.
Problems in the Algorithm
Dangling Nodes:
All pages’ PageRank values are initially set to 1. The calculation of PageRank is continued until PageRank scores converge to the final stable values because PageRank is an iterative process. However, there is an issue with this definition of PageRank. In reality, some websites will reject to share their PageRank scores with other websites as they continue to amass them after each iteration. As a result, PageRank will eventually drop to zero at the end of the iterative phase. The web sites (nodes) that don’t have any outbound links in the graph are the cause of this issue. These are referred to as dangling nodes i.e. nodes with zero outdegree (also known as hanging pages, zero-out-link and web frontier).
To address dangling nodes, Larry Page and Sergey Brin make two changes: first, dangling nodes are replaced with nodes linked to every other node; and second, a damping factor is added, which reflects random walks of the graph’s random surfer process.
Where to start?
Each page’s PageRank (PR) is influenced by the PR of the pages that point to it. But until the pages leading to them have their PR calculated and so on, we won’t know what PR those pages have. Additionally, it seems hard to perform this computation when you consider that page links can create circles!
So how exactly is it implemented?
If the algorithm an create an infinite circle where we are not sure how to begin- it may seem impossible to implement this is practical life.
However, things aren’t all that horrible. Observe this passage from the Google paper about the same:
A straightforward iterative approach can be used to calculate PageRank, also known as PR(A), which is the primary eigenvector of the web’s normalized link matrix.
That allows us to determine a page’s PR without having to wait to find out what the final PR of the other pages will be. That may seem unusual, but in essence, each time we perform the computation, we have a better idea of the eventual result. Therefore, all we need to do is keep track of each value we compute and keep repeating the calculations until the numbers stop fluctuating significantly.
Any number of documents in a collection can be used to calculate PageRank. Several research publications make the supposition that, at the start of the computational process, the distribution is distributed equally among all documents in the collection. In order for the approximation PageRank values to more accurately reflect the theoretical actual value, the PageRank computations must make numerous passes, referred to as “iterations,” across the collection.
The algorithm is actually quite simple: it uses a recursive process to analyze web pages and their links. The starting point is a set of web pages that we call the “corpus”. For each page in the corpus, the algorithm looks at all the links on that page, and calculates a “score” for each page that is linked to. The score for a given page is simply the number of links on that page divided by the total number of links on all pages.
So, if Page A has two links to Page B, and Page B has four links to Page C, then the score for Page C would be 2/4, or 0.5.
The algorithm then “damps” these scores, so that they don’t fluctuate too much. The damping factor is typically set to 0.85. This means that the score for any given page is equal to 0.85 times the sum of the scores of all the pages that link to it.
The formula for the same is:
PR(A) = (1-d) + d (PR(T1)/C(T1) + … + PR(Tn)/C(Tn)) where,
- PR(A) — Page Rank of page A
- PR(Ti) — Page Rank of pages Ti which link to page A
- C(Ti) — number of outbound links on page Ti
- d — damping factor which can be set between 0 and 1
Consider the example below:
In the aforementioned example, web page A has a hyperlink pointing to web pages B and C. Web page C lacks any outward links, yet web page B contains a backlink pointing to it. We already know that C will have the highest PageRank and A will have the lowest PageRank based on this.
It’s critical to keep in mind that the PageRank algorithm is iterative. This is so because each page’s PageRank is influenced by the PageRank of the pages that point to it. You get a little bit closer to the solution with each iteration of the calculation.
Here would be the results from applying the page rank formula for the first iteration assuming damping factor d as 0.85:
- Page A: (1–0.85) = 0.15
- Page B: (1–0.85) + (0.85) * (0.15 / 2) = 0.213745
- Page C: (1–0.85) + (0.85) * (0.15 / 2) + (0.85) * (0.21375 / 1) = 0.3954375
The calculation has only undergone a single iteration. The process must be repeated until the average PageRank for all pages is 1.0 in order to determine the final PageRank of each page.
Improvements in the Algorithm
The Page Rank algorithm is an essential part of the search engine ranking process. However, it is not perfect. There are a number of ways that the algorithm can be improved.
One way to improve the Page Rank algorithm is to incorporate user feedback into the ranking process. This can be done by tracking the click-through rates of results and using this information to adjust the rankings.
Another way to improve the Page Rank algorithm is to better handle web pages that contain a lot of content that is not easily parsed by the algorithm. This can be done by increasing the amount of weight given to links from these types of pages.
Finally, the Page Rank algorithm can be improved by making use of social media signals. This can be done by incorporating likes, shares, and other social interactions into the ranking process.
Current Usage
Over the years, the Page Rank algorithm has undergone several improvements. These improvements are designed to make the algorithm more accurate and effective at ranking websites.
In general, websites with a higher PageRank will appear higher up in Google’s search results than those with a lower PageRank. However, there are other factors that also affect ranking, so a high PageRank does not guarantee a top spot on Google.
Some of the recent improvements to the Page Rank algorithm include:
- improved handling of “nofollow” links
- increased weight given to backlinks from high-quality websites
- penalization of sites that engage in link buying or selling
These improvements have made the Page Rank algorithm more effective at ranking websites in Google’s search engine results.