Text Corpus Exploration

We used AbstractsViewer in the process of writing the present paper. These examples show the approach in several tasks. To get an initial foothold into the corpus (T1), we begin by searching the general term text which returns many (289) results, so we need to focus. In the Corpus Map we see regions in the lower left that are dense with hits, and would like to understand these regions (T5) to know if they are likely to be a good source of targets (T5→T3). Hovering over these regions in the Corpus Map shows salient terms as a term-based description of their contents. This quickly confirms that these regions contain documents with text-related terms. To better understand the meanings of the regions, we can use the Region Matrix View to provide a more detailed term-based description that connects the terms to documents, and the Region List to provide an item-based description of the region by listing the documents with the most salient first. Changing the item salience metric Sect. 4.1.4 can tune the descrip- tion: we might use a neighbor list metric to see the most central to the region, or a region words occurrence to see the documents that most connect with the salient words.

For a different entry into the corpus, we enter this paper’s abstract and perform a neighbor search. Single document salient word high- lighting suggests terms (T2) including similarity, document, and collection. A neighbor search (shown in Fig. 1) provides more con- nections. We observe that almost all neighbors are in a region identified above, hovering over the region reveals its relevant stems as document, text, and collect. One of the neighbors is far from the others. Hov- ering over this outlier’s region reveals it to have salient terms related to networking. Selecting it (as the second selection) reveals that all of its neighbors are concentrated in the same network region. However, viewing it in the Document View shows its similarities to our abstract: the salient terms are recommendation and similarity suggesting that this paper [22] applies similar techniques to a different problem and is relevant (T4,T5→T3). Note that we would have been unlikely to find this paper using terms, as recommendation and similarity are common terms in the corpus with many uses.

The Serendip paper [2] appeared in the searches. As it was clearly an influence for the present work, we used it as an anchor for further exploration (T1). To provide context (T5), we searched for the terms topic model to see where similar papers fall across the corpus in the Corpus Map. Selecting the Serendip paper brings a number of interesting looking papers. However, an outlier emerges both in the Corpus Map and in the Neighbor List View: Munzner’s Nested Model paper [56]. While we were excited that a paper that inspired our thinking was suggested as relevant, the explanation provided by the Document View shows that common words, particularly level and model, caused the similarity (T4). This example is shown in the figure below

Example image

RelaxedIK

These examples are modified from work with our collaborators. This example illustrates the value of using a local surrogate approach and multiple distance metrics. A robotics laboratory makes extensive use of RelaxedIK an algorithm for solving a common class of robotics problems (inverse kinematics). A researcher is interested in finding papers (T3) that provide different methods or new applications of the existing method. In AbstractsViewer, they search for and select the RelaxedIK paper. The neighbors for the TFIDF metric form an extremely tight cluster in the Corpus Map in a region , with one outlier. Examining the titles makes the connections obvious: the only paper that doesn’t have inverse kinematics in its title is a known variant of RelaxedIK from the same authors. The Neighborhood Matrix View shows that the most salient words all describe the problem, ik, inverse, and kinematics. Examining the papers confirms that most are methods for this problem.

Example image

In contrast, the neighborhood formed by the SPECTER embedding has documents that are less obviously connected. AbstractsViewer can help interpret these results. The Corpus Map shows the neighborhood to be spread around, hovering over the various regions covered shows terms that describe a range of robotics problems. Document View can examine the neighbors to find common terms, identifying com- mon terms between each document and RelaxedIK using the TFIDF metric. This surrogate provides an interesting pattern: words such as geometric, feasible, and end-effector that describe a range of robotics problems. These same words appear as the most salient terms in the Neighborhood Matrix View with the difference metric; similar words are highlighted by the contrastive metrics. Examina- tion reveals that the neighborhood is identifying methods that achieve similar properties on different problems.

A second example illustrates the potential for term refinement (T2) and corpus context (T5). AbstractsViewer in term refinement involves a researcher trying to apply a new computer vision device for robotics application. They were trying to identify similar devices, the methods used with them, and their applications. One lead came when a review of a manuscript provided a related paper. Examining the neighbors of the paper in the Corpus Map showed them to be in regions defined by common robotics problems (e.g., slam, localization) and methods applied to them (e.g., trifocal). This suggests techniques for com- parison and potential applications (T5). The Neighborhood Matrix View revealed laser and rangefinder as salient terms: looking in Document View showed that laser rangefinder was historically a commonly used device with similar properties to the sensors being considered (T2). This suggested a new search that provided a more focused foothold into the corpus (T2→T1).

Depth Sensor in Robotics

we applied AbstractsViewer to help with develop work with a collaborator involving applying a new com- puter vision device for a robotics application. We were unable to identify similar papers (T3), in part because we were unsure of how the field referred to similar concepts (T1) and because we lacked context (T2) of what kinds of problems were solved using these re- lated devices. Searching for the problem terms we were aware of lead to too diverse, for example the term “calibration” leads to over 500 matches in the smaller recent robotics corpus. A reviewer of an early manuscript said our work was not novel, but provided only a paper from an obscure journal that was not in our corpus. But, its abstract gave us an initial foothold (T1).

Identifying similar papers to the abstract provided a set of sim- ilar papers that were tightly clustered in a region. Region analy- sis shows this part of the map to be about some common robotics problems (e.g., “slam”, “localization”) and methods applied to them (e.g., “trifocal”). This suggests techniques for comparison and po- tential applications for our work. The closest match to the ab- stract was a paper that presented a nearly identical approach to the same problem as the obscure paper we started with, except that it was published 5 years earlier in a standard venue. Examining the neighbors of this paper quickly revealed common terms “laser rangefinder”, which suggested an alternative to the search term we had been using “depth sensor”. Combining terms provided focused search terms that has led to identifying relevant papers

Using the map to identify different groups of related work to focus a search

H2 used the generic term “teleoperation” to begin an exploration, but this led to too many matches. H2 used the map to see two clusters, whose meanings he identified quickly by hovering over the points to see the titles and confirming it with the region analysis word list hover. After confirming the prevalance of relevant words in the region of the second cluster, H2 said “I feel like I have to read every single one of these.” The matrix allowed identifying combinations of words “the word alone doesn’t mean anything.”

Identifying good search terms

H3 began with a paper central to his current work. He observed many relevant papers in the neighbor list. Examining the Neighborhood Matrix View, he was able to see a common term used in many of these, “camera-in-hand”. This provided a starting point for another search. H5 had a similar experience: beginning with a search query about one topic, she observed intriguing words shared by neighboring papers in the Neighborhood Matrix View, saying “maybe I should search for that.” Seeing how that term appeared in several different map regions enabled diversifying the set of papers she discovered by encouraging her to sample papers from different regions.

Negative recommendations

Often, the transparency of similarity exposed why a bad match appears. For example, H3 was looking for papers related to manipulation in cluttered spaces. But many of the papers that best matched his initial anchors matched because of terms such as “continuum” that describe types of robots he is uninterested in. Unfortunately, the current AbstractsViewer has limited facilities to help with known irrelevant words: a user can look for documents without the word in the matrix view, or use negation in a keywords query (our experiment participants were not shown how to make complex queries). Better handling of negative words is an important feature to add.