1. Introduction In this paper, we discuss a graphical presentation of the similarity relationship between web pages. There have been several studies on the notion of similarity between web pages, but the similarity of web pages has not been widely utilized by users. We attribute this to the lack of an appropriate interface to navigate similar pages. We propose using a hyperbolic tree to represent the similarity relationship to allow users to see how the results of a search query are related to each other.
Currently Google provides the user with a simple search interface. When search results are presented to the user, each result also has a “similar pages” link that will display similar results. We build our interface on top of Google to take advantage of this similarity search feature. We present the results in a hyperbolic tree, using an interface that allows the user to navigate through the tree of similar pages.
2. GoogleAPI The googleapi provides a convenient way of accessing the Google search engine. It allows us to send queries, and retrieve the results through a well defined interface, saving us the task of parsing search results from a web page. The interface does not directly support searching for similar pages, it allows us to send queries formatted as “related:www.somewebsite.com” to fetch similar pages.
3. Page Similarity Assumptions In our study, we do not actually compute the similarity between 2 pages, but instead, make use of Google’s similarity functions to retrieve our results. Since we are more concerned with the representation of the similarity relationship than with the actual computation of the relationship, we do not concern ourselves with the actual methods that are used to compute the similarity. We do, however, make a few assumptions about the similarity relationships. In particular, we assume that:
i) A web page is maximally similar to itself.
Assumption (ii), which is a reflexivity assumption, is non-trivial, and not always true. For example, a query for similar pages of www.sausage.com yields Microsoft Frontpage’s website as the top result. However, querying Frontpage’s similar pages will not find sausage.com in the top 10 results.
As we will discuss later, it is not clear what the correct behavior of our application ought to be under these circumstances; we choose to leave this for future investigation.
4. Program Internals 4.1 Introduction Our application consists of an internal engine and a user interface. The engine uses Google’s api to build a graph of similar pages. It is parameterized by 2 values:
1. Fanout: Fanout limits the number of results we look at. While the absolute maximum limit on fanout is 10, there is little practical purpose to set fanout greater than 6 because that causes the graph to be too dense and hard to read.
2. Depth: The depth of the tree is kept to a certain value. When the depth of the tree falls below this value, the engine will automatically run queries to find pages similar to the leaves, to increase its depth. If the depth of the tree is greater than this value, the nodes that appear beyond this depth is not displayed.
The tree is created by an initial query that the user provides. This query forms the root node of the tree and is the only node in the tree that is not actually a search result. The engine then goes off and recursively builds the tree until its depth reaches the specified depth.
We built a simple java application on top of the engine that allows the user to perform simple searches and display a tree. The interface also allows the user to manipulate the tree by making a certain node become the root node, thus, rotating the tree.
4.2 Search/Rotation Algorithm The application supports the navigation of the hyperbolic tree via a MakeRoot method. MakeRoot makes one of the non-root nodes the root of the tree through a series of rotation/growth iterations. When the user invokes MakeRoot on a node, nodeA, MakeRoot is recursively called on the parent of nodeA, all the way up to the root node. This eventually makes the nodeA the root of the tree.
Consider the example illustrated in Figure 1. In Figure 1.1, we start off with a tree of depth 3 and fanout 3. Now suppose MakeRoot is called on node 1. The tree is rotated to make node 1 the parent of node 0, and node 0 the child of node 1, as shown in Figure 1.2. At this point, the rotation is complete. However, the depth of the tree is no longer uniform, so we need to grow the nodes 4, 5, and 6. Nodes 7, 9, 9, 10, 11 and 12 are then pruned from the tree because they exceed the required depth. Note that these nodes are only hidden, but not actually removed from the data structures, so they do not need to be rebuilt if the tree is rotated back to its original form. Note that the number of children of a node may be more than the specified fanout due to a rotation operation.
During a rotation, one or more child-parent relationship in the tree is reversed. This is only strictly correct when the reflexivity assumption of the similarity relationship mentioned earlier is true. We make this assumption because without it, the tree would become disjoint after a rotation. We feel it is important to keep the tree intact to allow the user to rotate the tree back into its original form later on.
4.3 Graphical Representation The similar pages are represented graphically using a simple tree structure. Nodes that are similar to each other are linked together using black lines. However, a complication arises due to the tendency of the similarity tree to have cycles.
When cycles occur, there are 2 choices, either actually drawing the cycle, or ignoring the existence of the cycle and drawing duplicate copies of nodes instead. We feel that since cycles are an innate properly of similarity relations, it is unacceptable to ignore them. However, the naïve method of just drawing all cycles uniformly makes the tree appear very cluttered.
Thus, we propose a representation of the tree where each node is linked by a solid line the first time it is drawn. Thereafter, each time another node points to it (thus creating a cycle), a lighter, dotted line is drawn based on the distance between the 2 nodes in the tree. Hence, long cycles will be represented by lighter lines, while short cycles will be represented by darker lines. We choose to define the distance between 2 nodes as the difference in the depth of their first appearance. This is one of many possible definitions and is a possible subject of future investigation.
5. Functionality 5.1 Installation The application can be downloaded from www.stanford.edu/~yeejiun/276project. All the contents of the file should be unzipped into a directory and the program is started with the following command:
5.2 User Interface Figure 2 shows the interface that is presented to the user. The user types a query into the query bos and clicks on the “Seach” button to initialize the tree. The application the draws the tree after performing the appropriate queries using the google api. The depth and fanout of the tree is parametized by the user options in the depth and fanout boxes.
Each node of the tree is shown with the URL of that node and a unique ID identifying the node. The user can navigate the tree by typing the node’s ID into the rightmost textbox and clicking “Make Root”. This causes the application to make the appropriate rotations and possibly run additionally queries to make that node the root of the tree.
6. Results Our project allows the user to see the similarity relationship between different search results. This provides a surprising insight about the kinds of results search engines provide. This is best illustrated through an example. We refer figure 2 again: this tree was created using a query on “dog”. We see from the figure that there is a large number of dotted lines between the nodes of the topmost branch of the tree (stemming from dogpile.com). This shows us that all the results in that branch are strongly “similar” to one another. By looking at the different nodes of that part of the tree, we quickly realize that they consist of search engines.
The other two top level search results also have some backward links, although less so. The tree thus tells the user some information about the search results that could not have been obtained through, or at least, would not have been as obvious in, a traditional search engine interface.
Suppose that the user finds dog.com to be the most interesting and wants to expand on that part of the tree. He types 3, the ID of dog.com, in the rightmost textbox and clicks “Make Root”, making dog.com the root of the tree, as shown in Fig 3. Now we observe that there are a lot more cross links throughout the tree, suggesting that the search results shown on the screen now are more similar to each other.
Now we consider another example. Searching for “Stanford University” yields the results as shown in Figure 4. Notice that there are a lot more dotted links in this example than in the previous one. This suggests that all the results shown are closely related to each other. This may allow the user to infer more information about the search results. One possible interpretation is that the search engine was able to find only one topic that maches the query. This may in turn suggests that the user can place greater confidence in the search results.
7. Related Work Hyperbolic trees have been employed commercially to visually display information in a tree form for easy navigation. An example of this is the hyperbolic tree product developed by Inxight Software which builds a hyperbolic tree on a website to allow users to navigate the website through the tree . However, there has not been any application of hyperbolic trees to display the similarity relationships between web pages.
Munzer and Buchard discusses using a 3 dimensional hyperbolic tree to visualize the structure of the world wide web in . However, they do not examine the specifics of visualizing similarity relationships. They solve the problem of visualizing cycles in trees through their 3D visualization. Also, the use of a 3D interface has significantly higher overhead than a 2D interface and might not be suitable in all applications.
There has also been other work done on visually displaying the results of web searches. Kartoo.com allows users to perform seaches and provides a graphical representation of the results Websites are represented by nodes and lines are drawn between nodes to show how they are semantically related to each other . However, it does not allow the user to examine or search for results similar to the ones that are already displayed on screen.
8. Conclusion We have proposed and implemented a new graphical representation of the similarity relationships between web search results. In doing so, we have found that there is additional information in search results which are not being used because of the lack of an appropriate interface. By providing the user with the tools to visualize the relationship between search results, we feel that we can enable the user to find a group of relevant search results more readily.
While both similarity relationships and hyperbolic trees have been studied in the past, to the best of our knowledge, our use of hyperbolic trees to represent similarity relationships is new. We have also proposed a novel method of reducing the clutter imposed by cycles in drawing hyperbolic trees by using lines of differing weights based the length of the cycles.
9. Future Work In our project, we made assumptions about the similarity relationships. The validity of these assumptions need to be studied. In particular, our tree representation of the similarity relationship is not strictly valid if the reflexivity assumption fails. Under such circumstances, other methods of modifying the tree may be more appropriate.
Our interface can also be improved. Due to time constraints, we were only able to build a simple interface to illustrate the concepts we have proposed. A more usable interface would allow the user to click directly on the tree to manipulate it. The user should also be able to open up the web results shown in the tree by directly clicking on those nodes.
References 1. Inxight Software. http://www.inxight.com/news/usability_study.html 2. Visualizing the Structure of the World Wide Web in 3D hyperbolic space. Tamara Munzner and Paul Burchard. ACM SIGGRAPH, New York, 1995.
3. Juan C. Dürsteler. Info@Vis. http://www.kartoo.fr/en/press/2002/press_aout/infovis.htm