Abstract- PageRank is a numeric value that represents how

important a page is on the web. They are used for assigning numerical

weightings to hyperlinked documents (or web pages) indexed by a search engine.

A page rank results from a ballot among all the other pages on the World Wide

Web about how important a page is. Here, we examine all the hyperlinks in a web

page which is

counted as a vote of support. The PageRank algorithm page

is defined recursively and depends on the number and PageRank metric of all

pages that link to it called as incoming links. A page that is linked by many

pages with high rank receives a high rank itself. If there are no links to a

web page there is no support of this specific page. It uses some logarithmic

scale. Here, we program in such a way so that it crawls the entire website and

pulls a series of pages into the database, recording the links between the

pages. The spider chooses randomly amongst all the non-visited links across all

the webs. It matters because it is one

of the factors that

determines a page’s ranking in the search results.

Keywords- PageRank, World Wide Web, Links, Pages.

I. INTRODUCTION

With a sudden

change in growth of the world-wide web exceeding 800 million pages. Web pages

are increasing day by day. These web pages create many challenges for

information retrieval. It is very huge and heterogeneous. In Current scenario

there are over 150 million web pages with a doubling life of less than one

year. More importantly, the web pages are extremely diverse.

In

addition, Page Rank is extensively used for ranking web pages in order of

relevance 1. PageRank works by counting the number of links to a page to

determine a rough estimate of how important the website is. Page Rank (determines

the importance of webpages based on link structure). Solves a complex system of

score equations. The PageRank algorithm outputs a probability distribution used

to represent the likelihood that a person randomly clicking on links will

arrive at any particular page 1. PageRank can be calculated for collections

of documents of any size. The main objective of our project is to determine the

rank of the webpage and thus, determine the importance of a web page. PageRank

works by counting the number of links to a page to determine a rough estimate

of how important the website is.

However, unlike flat document

collections, the World Wide Web is hypertext and provides considerable

auxiliary information on top of the text of the web pages, such as link

structure and link text. In this mini project, we take advantage of the link

structure of the Web to produce a global importance ranking of every web page.

This ranking, called PageRank, helps search engines and users quickly make

sense of the vast heterogeneity of the World Wide Web. There are different

PageRank algorithms like PageRank (PR), WPR (Weighted PageRank), HITS

(Hyperlink- Induced Topic Search) algorithms these algorithms used for

Information Retrieval 2.

Page

ranking Algorithm:

This algorithm

is used by all the search engines. It is a method to rank web pages giving to

it a numeric value that represents their importance. Based on the link

structure of the web a page X has a

high rank if:

1)

It has many In-links

or few but highly ranked.

2)

Has few Out-links.

II. LITERATURE SURVEY

Although there is already

a large literature on academic citation analysis, there are a number of

significant differences between web pages and the academic publications.

Referred from “Jugo Cho, Hector Garcia-Molina, and

Lawrence Page. Efficient crawling through the URL ordering. In to Appear:

Proceedings of the Seventh International Web

Conference (WWW 98), 1998”. Unlike

academic papers which are scrupulously reviewed, web pages proliferate free of

quality control or publishing costs. With a simple program, huge numbers of

pages can be created easily, artificially inflating citation counts. Referred

from “Ron Weiss, Bienvenido Velez, Mark A. Sheldon, Chanathip Manprempre, Peter

Szilagyi, Andrzej Duda, and David K. Gifford. HyPursuit: A hierarchical network

search engine that exploits content-link hypertext clustering. In Proceedings

of the 7th ACM Conference on Hypertext, pages 180-193, New York, 16-20 March

1996. ACM Press.”. Because the Web environment contains competing profit

seeking ventures, attention getting strategies evolve in response to search

engine algorithms. For this reason, any evaluation strategy which counts

replicable features of web pages is prone to manipulation. Ellen Spertus.

Parasite: Mining structural information on the web. In Proceedings of the Sixth

International WWW Conference, Santa Clara USA, April, 1997, 1997. It is obvious to try to apply standard

citation analysis techniques to the web’s hyper textual citation structure. So,

a major page like http://www.yahoo.com/ will have tens of thousands of

backlinks (or citations) pointing to it. This fact that the Yahoo home page has

so many backlinks generally imply that it is quite important. Indeed, many of

the web search engines have used backlink count as a way to try to bias their

databases in favour of higher quality or more important pages. However, simple

backlink counts have a number of problems on the web. Some of these problems

have to do with characteristics of the web which are not present in normal

academic citation databases. International Journal of Advanced Research in

Computer Engineering & Technology (IJARCET) Volume 4 Issue 6, June 2015. Page

Rank (determines the importance of webpages based on link structure). Solves a

complex system of score equations. The PageRank algorithm outputs a probability

distribution used to represent the likelihood that a person randomly clicking

on links will arrive at any particular page. Further, academic papers are well

defined units of work, roughly similar in quality and number of citations, as

well as in their purpose – to extend the body of knowledge. Web pages vary on a

much wider scale than academic papers in quality, usage, citations, and length.

This is because the simplicity of creating and publishing web pages results in a

large fraction of low quality web pages that users are unlikely to read.

III. PROPOSED METHODOLOGY

The proposed

system is based on hyperlinks. Here, we examine all the hyperlinks in a page

which is counted as a vote of support. The Page Rank of a page is defined

recursively and depends on the number and Page Rank metric of all pages that

link to it called as incoming links. A page that is linked by many pages with

high rank receives a high rank itself. If there are no links to a web page

there is no support of this specific page. It uses some logarithmic scale. Page

Rank of a web page is a numerical number representing the importance of that

web page based on the number of inbound links. The basic concept of PageRank is

that the importance of a page is directly proportional to the number of web

pages linking to that page. So, Page Rank algorithm considers a page more

important if large number of other web pages are linking to that page or if

links are coming from some of most important and popular web pages. Page Rank

of whole website is not valid because page rank is associated with every web

page on the web.

1) We

have implemented PageRank using traditional PageRank algorithm.

2) Technology

used to develop PageRank is python.

3) Apart

from ranking web pages we have created a graph where we can visualize the

webpages which are interlinked with other pages.

4)

We can identify all the web pages which has highest and

lowest importance and we can also open them.

User

Query

Fig. 1.

System Architecture

We convert

each URL into a unique integer, and store each hyperlink in a database using

the integer IDs to identify pages 3. First, we sort the link structure by

Parent ID. Then dangling links are removed from the link database for reasons

discussed above (a few iterations removes the vast majority of the dangling

links). We need to make an initial assignment of the ranks. This assignment can

be made by one of several strategies. If it is going to iterate until

convergence, in general the initial values will not affect final values, just

the rate of convergence. But we can speed up convergence by choosing a good

initial assignment. We believe that careful choice of the initial assignment

and a small finite number of iterations may result in excellent or improved

performance.

Memory is allocated for the weights for every

page. If insufficient RAM is available to hold all the weights, multiple passes

can be made (our implementation uses half as much memory and two passes). The

weights from the current time step are kept in memory, and the previous weights

are accessed linearly on disk. Also, all the access to the link database, A, is

linear because it is sorted. Therefore, A can be kept on disk as well. Although

these data structures are very large, linear disk access allows each iteration

to be completed in about 6 minutes on a typical workstation. After the weights

have converged, we add the dangling links back in and precompute the rankings.

Note after adding the dangling links back in, we need to iterate as many times

as was required to remove the dangling links. Otherwise, some of the dangling

links will have a zero weight. With less strict convergence criteria, and more

optimization, the calculation could be much faster. Or, more efficient

techniques for estimating eigenvectors could be used to improve performance.

However, it should be noted that the cost required to compute the PageRank is

insignificant compared to the cost required to build a full text index. One of

the interesting ramifications of the fact that the PageRank calculation

converges rapidly is that the web is an expander-like graph.

Fig. 2.

Links structure of the webpage

Every page has some number of

forward links (out edges) and backlinks. We can never know whether we have

found all the backlinks of a particular page but if we have downloaded it, we

know all its forward links at that time. Web pages vary greatly in terms of the

number of backlinks they have. For example, the Netscape home page has 62,804

backlinks in our current database compared to most pages which have just a few

backlinks. Generally, highly linked pages are more important” than pages

with few links. The reason that PageRank is interesting is that there are many

cases where simple citation counting does not correspond to our common-sense

notion of importance. For example, if a webpage has a link other Yahoo home

page, it may be just one link but it is a very important one. This page should

be ranked higher than many pages with more links but from obscure places.

PageRank is an attempt to see how good an approximation to importance”

can be obtained just from the link structure.

A. Propagation of Ranking through Links

1) Inbound links:

Inbound links (links into the site from the outside) are one way to

increase a site’s total Page Rank. The other is to add more pages. The linking

page’s Page Rank is important, but so is the number of links going from that

page. Once the Page Rank is injected into your site, the calculations are done again

and each page’s Page Rank is changed. Depending on the internal link structure,

some pages’ Page Rank is increased, some are unchanged but no pages lose any

Page Rank. It is beneficial to have the inbound links coming to the pages to

which you are channelling your Page Rank. A Page Rank injection to any other

page will be spread around the site through the internal links. The important

pages will receive an increase, but not as much of an increase as when they are

linked to directly. The page that receives the inbound link makes the biggest

gain.

2) Outbound links:

Outbound

links are a drain on a site’s total Page Rank. They leak Page Rank. To counter

the drain, try to ensure that the links are reciprocated. Because of the Page

Rank of the pages at each end of an external link, and the number of links out

from those pages, reciprocal links can gain or lose Page Rank. We need to take

care when choosing where to exchange links. When Page Rank leaks from a site

via a link to another site, all the pages in the internal link structure are

affected. The page that you link out from makes a difference to which pages

suffer the most loss. Without a program to perform the calculations on specific

link structures, it is difficult to decide on the right page to link out from,

but the generalization is to link from the one with the lowest Page Rank. Many

websites need to contain some outbound links that are nothing to do with Page

Rank. Unfortunately, all ‘normal’ outbound links leak Page Rank. But there are

‘abnormal’ ways of linking to other sites that don’t result in leaks. Page Rank

is leaked when Google recognizes a link to another site. The answer is to use

links that Google doesn’t recognize or count. These include form actions and

links contained in JavaScript code.

Fig. 3. Propagation of Links

3) Dangling Links:

One issue with this model is dangling

links. Dangling links are simply links that point to any page with no outgoing

links. They affect the model because it is not clear where their weight should

be distributed, and there are a large number of them. Because dangling links do

not affect the ranking of any other page directly, we simply remove them from

the system until all the PageRank’s are calculated. After all the page Ranks

are calculated, they can be added back in, without affecting things

significantly. Notice the normalization of the other links on the same page as

a link which was removed will change slightly, but this should not have a large

effect.

4) Damping factor:

The Page Rank theory holds that

even an imaginary surfer who is randomly clicking on links will eventually stop

clicking. The probability, at any step, that the person will continue is a

damping factor d. various studies

have tested different damping factors, but it is generally assumed that the

damping factor will be set around 0.8. The damping factor is subtracted from 1

(and in some variations of the algorithm, the result is divided by the number

of documents in the collection) and this term is then added to the product of

the damping factor and the sum of the incoming Page Rank scores.

So,

any page’s Page Rank is derived in large part from the Page Ranks of other

pages. The damping factor adjusts the derived value downward. Google

recalculates Page Rank scores each time it crawls the Web and rebuilds its

index. The Page Rank value of a page reflects the chance that the random surfer

will land on that page by clicking on a link. If a page has no links to other

pages, it becomes a sink and therefore terminates the random surfing process.

However, the solution is quite simple. If the random surfer arrives at a sink

page, it picks another URL at

random and continues surfing again. When calculating Page Rank, pages with no

outbound links are assumed to link out to all other pages in the collection.

Their Page Rank scores are therefore divided evenly among all other pages. In

other words, to be fair with pages that are not sinks, these random transitions

are added to all nodes in the Web, with a residual probability of usually d = 0.85, estimated from the frequency

that an average surfer uses his or her browser’s bookmark feature. So, the

equation is as follows:

where p1,p2,…,pN are the pages under

consideration, M(pi) is the set of pages that link to pi, L(pj) is the number of outbound

links on page pj, and N is the total number of page. Based on

the discussion above, we give the following intuitive description of PageRank:

a page has high rank if the sum of the ranks of its backlinks is high. This

covers both the case when a page has many backlinks and when a page has a few

highly ranked backlinks.

B. Computing Page Rank

Assume a small universe of four web

pages: A, B, C and D. The initial approximation of Page

Rank would be evenly divided between these four documents. Hence, each document

would begin with an estimated Page Rank of 0. In the original form of Page Rank

initial values were simply 1. This meant that the sum of all pages was the

total number of pages on the web. Later versions of Page Rank would assume a

probability distribution between 0 and 1. Here we’re going to simply use a

probability distribution hence the initial value of 0. If pages B, C,

and D each only link to A, they would each confer 0 Page Rank

to A. All Page Rank i.e. PR () in this simplistic system would

thus gather to A because all links

would be pointing to A

PR (A) = PR (B) +PR(C) +PR (D).

Fig. 4. Webpages and

its links

The formula of calculating the points is as following

where

• PR

(pi) is the page rank of page pi.

• PR

(pj) is page rank of page pj which link to page pi.

• L

(pj) is number of outbound links on page pj.

• N

is the number of webpages.

• D

is a damping factor usually set to 0.85.

Fig. 5. Calculating rank using formula (Iteration-1)

If you apply formula or our example using the page rank

algorithm the following operations takes place PageRank of

A = 0.15 + 0.85 * (PageRank (B)/outgoing

links (B) + PageRank (…)/outgoing link (…)

• Calculation

of A with initial ranking 1.0 per page:

• If

we use the initial rank value 1.0 for A, B and C we would have the following

output

• We

have skipped page D in the result, because it is not an existing page.

A:1.425

B:0.15

C: 0.15

Fig. 6. Calculating rank using

PageRank formula (iteration-2)

• Calculation

of A with ranking from ITERATION-

1:

• If

we use these ranks as input and calculate it again:

A:0.34125

B:0.15

C: 0.15

We see that the page rank of page A

is reduced. The PageRank is based on previous calculations and will get more

accurate after more runs. You can add new pages, new links in the future and

calculate the new ranks. This is

one of the tools which search engines use to create their

index. We are going to do this with a set of Web pages.

IV. CONCLUSION

We implemented PageRank project using python

programming. In this paper, we have taken on the audacious task of condensing

every page on the World Wide Web into a single number, its PageRank. Using

PageRank, we are able to order search results so that more important and

central Web pages are given preference. In experiments, this turns out to

provide higher quality search results to users. The intuition behind PageRank

is that it uses information which is external to the Web pages themselves –

their backlinks, which provide a kind of peer review. Furthermore, backlinks

from “important” pages are more significant than backlinks from average

pages.

V. FUTURE SCOPE

Presently our PageRank is limited to calculate the rank of

different web pages in a single website, later if required we can extend it

further by calculating rank globally.

We can develop different ranking algorithm and we can also develop different

connectors. PageRank algorithm can be implemented in different languages like

Hadoop, Data mining, and Java. User can use different connectors and ranking

algorithms to rank different types of data present in different webpages.