Readings 07: High-Dimensional Data
Last week, we focused on scaling in the number of items. This week, we’ll talk about what to do when we have too many dimensions.
Unfortunately, we can’t discuss the mathematics and algorithms of dimensionality reduction in class. Which is too bad, since its useful and important and (in my mind) interesting. There are enough other classes that discuss it.
(required) High-Dimensional Visualizations. Georges Grinstein, Marjan Trutschl, Urska Cvek. (semantic scholar) (link1)
This is an old (Circa 2001) paper that I am not sure was actually published at KDD. However, it is a great gallery of old methods for doing “High-Dimensional” (mid-dimensional by modern standards) visualizations. Most of these ideas did not stand the test of time - but it’s amusing to look through the old gallery to get a sense of what people were trying.
(required) The Beginner’s Guide to Dimensionality Reduction, by By: Matthew Conlen and Fred Hohman. An Idyll interactive workbook.
This is a very basic demonstration of the basic concepts of dimensionality reduction. It doesn’t say much about the “real” algorithms, but you should get a rough idea if you haven’t already.
(required) How to Use T-SNE Effectively
I wanted to give you a good foundation on dimensionality reduction. This isn’t it. But… it will make you appreciate why you need to be careful with dimensionality reduction (especially fancy kinds of it).
(left off in 2021, but required in the future) Understanding UMAP
I like this as a way to explain the UMAP algorithm. It is a mix of the details, but also the intuitions. It is less important to understand UMAP, but more to get a sense of what these kinds of algorithms do.
I was going to suggest some optional readings for those of you who want to learn more about dimensionality reduction. There is a lot of great stuff the is visualization specific: techniques for using dimensionality reduction, approaches for user-controlled (supervised) dimensionality reduction, ways to visualize and interpret dimensionality reductions, … But there’s so much I don’t know where to start. If there is some topic that is interesting to you, make a posting on Piazza and I’ll give a recommendation on where to start.
If you’ve had an ML class, you might be wondering “what about X?” (where X is some more modern dimensionality reduction algorithm). Machine learning has made dimensionality reduction a hot topic recently, and there are a plethora of new methods to consider.
There is also a separate question of how to look at dimensionality reduced data. There are no required readings for this.
(optional) Julian Stahnke, Marian Dork, Boris Muller, and Andreas Thom. 2015. Probing Projections: Interaction Techniques for Interpreting Arrangements and Errors of Dimensionality Reductions. IEEE transactions on visualization and computer graphics 22, 1 (August 2015), 629–638. DOI: https://doi.org/10.1109/TVCG.2015.2467717
This focuses on more basic dimensionality reductions (PCA), but it gets at many of the issues.
(optional) Florian Heimerl and Michael Gleicher. 2018. Interactive Analysis of Word Vector Embeddings. Computer Graphics Forum 37, 3 (June 2018), 253–265. DOI: https://doi.org/10.1111/cgf.13417 (online version)
While this is specific to Word Vector Embeddings, I like it because it tries to get away from the “default” scatterplot designs.
[1]Florian Heimerl, Christoph Kralj, Torsten Möller, and Michael Gleicher. 2020. embComp: Visual Interactive Comparison of Vector Embeddings. IEEE Transactions on Visualization and Computer Graphics preprint, (December 2020). DOI: https://doi.org/10.1109/TVCG.2020.3045918 (online version)
A recent paper I am quite proud of - dealing with the challenges of comparing embeddings. Again, a lesson here are the choices in how to do things other than scatterplots.