This is a possible problem to work on for Design Challenge 2: A Visualization Project (Hard Vis Problems): it is choice 2, dimensionality reduction viewer (embeddings).
There are many tasks one might do with points in high dimensions: we might cluster them, or assess a clustering of them; we might look for groups; we might look for items similar to items of interest; we might look for gaps in the space; we might try to identify objects that are very close to others; etc.
There is a de facto standard visualization to do these with: perform dimensionality reduction to 2D, and show a scatterplot. This design relies on the power of the dimensionality reduction to preserve the interesting “structures”. Of course, there is often some degree of error (e.g., similar points are far away, or far away points are similar). And the meaning of apparent shapes in 2D is never really very clear (does a line in 2D mean a line in high dimensions?).
The standard answer to this has been to use increasingly complex dimensionality reduction (DR) techniques, with the hopes that they will better preserve the “structures” in high dimensions. Ironically, this approach might backfire: as the DR techniques become more complex, the meanings of their outputs become more complex. Yet, the common answer is still to show a scatterplot (and assume that the 2D patterns can be interpretted in high dimensions).
In this project we want you to consider a different approach: can we use visualization to help us understand the high-dimensional structures and/or guide us to interpret what the 2D patterns derived from them actually mean.
Your job is to create a tool that helps a user understand the “structure” (relationships between objects) in high dimensional data and/or understand how this structure is preserved by dimensionality reduction to 2D. Here are two (somewhat related) strategies (they might not be the only strategies)
Given a high dimensional point set, help the viewer correctly interpret what is really going on in high dimensions. This may involve using a dimensionality reduction to 2D, but it might involve some other way to show the high-dimensional relationships.
Given a high dimensional point set and a dimensionality reduction of it to 2D, help the viewer correctly interpret the 2D and what it tells them about the actual high dimensional data. A variant of this strategy is to point out where the things visible in 2D may not be accurate representations of the original data.
For example, if we want the viewer to be able to identify the nearest neighbor for every item… (0) we could just show the DR scatterplot, even though, for many points, the nearest point in 2D won’t be the nearest point in high D; (1) we could give a list of items, and for every one, specify what its nearest neighbor is; (2) we could show the 2D scatterplot from #0, but connect every item to its nearest neighbor with a line. These 3 are intentionally crude, but meant to illustrate the kinds of solutions that may be possible. I hope you can come up with better ones.
Note that this is open to what tasks, what types of dimensionality reduction, and what types of designs. You might focus on one task (or a small set of tasks); you might focus on a particular type of dimensionality reduction (e.g., UMAP or t-SNE); or you might explore a particular type of design (e.g., how to add interaction to scatterplots); or you might focus on a particular type of tasks (identifying clusters is probably the best explored in the literature).
We will provide some data sets at various scales, but we expect everyone to identify at least one more (and preferably make of a combination of our sets and the new ones).
Finding other examples in the literature - either describing the problem or its solutions is part of the exercise.
There are examples in the literature, here are a few off the top of my head (yes, my work is over-represented here, because I’m most familiar with it):
Julian Stahnke, Marian Dork, Boris Muller, and Andreas Thom. 2015. Probing Projections: Interaction Techniques for Interpreting Arrangements and Errors of Dimensionality Reductions. IEEE transactions on visualization and computer graphics 22, 1 (August 2015), 629–638. DOI: https://doi.org/10.1109/TVCG.2015.2467717 - gives a strategy of augmenting the 2D scatterplots to show where they differ from the high dimensions.
Florian Heimerl and Michael Gleicher. 2018. Interactive Analysis of Word Vector Embeddings. Computer Graphics Forum 37, 3 (June 2018), 253–265. DOI: https://doi.org/10.1111/cgf.13417 (online version) While this is specific to Word Vector Embeddings, I like it because it tries to get away from the “default” scatterplot designs.
Florian Heimerl, Christoph Kralj, Torsten Möller, and Michael Gleicher. 2020. embComp: Visual Interactive Comparison of Vector Embeddings. IEEE Transactions on Visualization and Computer Graphics preprint, (December 2020). DOI: https://doi.org/10.1109/TVCG.2020.3045918 (online version). This paper talks about comparing embeddings, comparing a 2D dimensionality reduction to the original high dimensional data is a special case. This is relevant because it provides a number of designs that are very different from scatterplots as ways to examine high dimensional data, and to assess the 2D
There are some “application” papers which get at the problematic nature of reading too much into dimensionality reductions.
- Tara Chari, Joeyta Banerjee, and Lior Pachter. 2021. The Specious Art of Single-Cell Genomics. DOI (BioArxiv)
There are papers that explore the task spaces of both the visualizations and high dimensional data in general. A few that come to mind…
Alper Sarikaya and Michael Gleicher. 2018. Scatterplots: Tasks, Data, and Designs. IEEE Transactions on Visualization and Computer Graphics 24, 1 (January 2018), 402–412. DOI: https://doi.org/10.1109/TVCG.2017.2744184
Stephen Ingram, Tamara Munzner, Veronika Irvine, Melanie Tory, Steven Bergner, and Torsten Möller. 2010. DimStiller: Workflows for dimensional analysis and reduction. In 2010 IEEE Symposium on Visual Analytics Science and Technology, IEEE, 3–10. DOI: https://doi.org/10.1109/VAST.2010.5652392
Michael Sedlmair, Tamara Munzner, and Melanie Tory. 2013. Empirical Guidance on Scatterplot and Dimension Reduction Technique Choices. IEEE Transactions on Visualization and Computer Graphics 19, 12 (December 2013), 2634–2643. DOI: https://doi.org/10.1109/TVCG.2013.153
I am intentionally not giving more - so there are ones for you to find. In particularly, there is a rich literature on evaluating clustering.
Data and Hints
We will provide some high dimensional data sets (generally document embeddings), and their corresponding 2D dimensionality reductions. However, these are easy to generate (for example, the Python SciKit learn package has good implementations of many of the popular DR methods).
Specific Instructions for the Phases
In Phase 1, make sure that you convey that you understand the problems, and give a sense of the part of it you want to explore. There are ways to look at baseline solutions (scatterplots of high dimensional data) that expose differences between the original and reduced data. For example, my first experiments with the Vitality Demo (a very cool system for exploring the visualization literature), showed how different the nearest neighbors in high-dimensions were from what was close in 2D.
In Phase 2, please be clear about your focus - in terms of what tasks, what kinds of DR, what strategies, etc.
It will be good to show that you can actually view the data and show the issues with the reductions early in the project.
For assessment it will be important to show examples where the baseline (just give a scatterplot of the DR) is misleading, and how your approach could help the viewer avoid being misled.