Last year: https://pages.graphics.cs.wisc.edu/765-10/wp-content/uploads/2010/02/10-02-07-Evaluation.pdf
Religons of Visualization
Try to get at the philosophical question “why is it good”
Blindly following a “religon” is probably not the right thing – even though they all have their good elements.
- Tufte – ism
- It’s good because I say so
- reasoned as good through critique
- It’s good because it follows a set of principles (commandments)
- Munzner
- It’s good because you followed the process properly
- It’s good because you correctly determined that it is
- Empiricism
- It’s good because my experiment says so
- perceptualism (perceptual study)
- (even if the experiment is just about a detail)
- not holistic
- It’s good because I’ve found a way to really measure how good it is
- Pragmatism
- It’s good because it gets the job done
- It’s good because it actually helps gets the job done
- Assumes you can measure results
- Perceptualism
- It’s good because it should be good given what I know about perception
- Designerism
- It’s good since I’m a famous designer / it looks attractive
Tufte: Fundamental Principles
Emphasis
- Citations of Data
- Credibility of Author
- Title (inform viewers of intent)
- Legends
Note: Tufte does not separate data from presentation. Lots of his message is “are you showing me the right data” instead of “are you showing the data right”
Note: Tufte calls things “Evidence Presentations” – not visualizations, or comprehension tools, or …
- Principle 1: Comparisons
- Show comparisons, contrasts, differences
- Principle 2: Causality, Mechanism, Systematic Structure, Explanation
- Show causality; mechanism, explanation, systematic structure
- Not really: correlation vs. causality
- in practice, requires bringing lots of data to bear
- This is hard: and Tufte doesn’t give us any hints
- Principle 3: Multi-variate Analysis
- Show multivariate data; that is, show more than 1 or 2 variables.
- bring lots of data to bear!
- Principle 4: Integration
- Completely integrate words, numbers, images, diagrams
- use words and pictures (a place where vis does badly! – pragmatics)
- Whatever it takes to explain something
- Principle 5: Documentation
- Provenance – where did the data come from?
- Principle 6: Content
Thoroughly describe the evidence. Provide a detailed title,
indicate the authors and sponsors, document the data sources,
show complete measurement scales, point out relevant issues.
Analytical presentations ultimately stand or fall depending
on the quality, relevance, and integrity of their content.
Munzner’s Nested Model
Not just a model for evaluation – a way to think about the design process.
- “Threats” – ways things can go wrong
- Useful to see “what should be right”
- her concern is paper writing: making sure you’re evaluating your contribution
- don’t do the wrong test!
- be aware of what your testing really shows
- she has a strong idea of how you should write papers (like her!)
- what constitutes a design study vs. …
- our concern is design and evaluation
“Getting it right” at many levels
Nested: can solve inner (or upstream) problems – without solving downstream ones
- can evaluate each piece assuming previous one is right
- getting it right downstream requires “inner” pieces to be right
- wrong problem: they don’t do that;
- wrong abstraction: you’re showing them the wrong thing;
- wrong encoding/interaction: the way you show it doesn’t work;
- wrong algorithm: your code is too slow.
can do evaluation upstream or downstream for each layer
need to match evaluation technique / question to the appropriate layer
Examples in the paper (you probably haven’t read the papers – I haven’t read them all either)
- Papers make claims about domain – but don’t validate that
- Papers mainly do upstream for the “harder” or outer types
Important lessons:
- Different kinds of evaluation are necessary
- Some kinds of evaluation are hard/easy
- Different evaluations achieve different things
- Quantitative HCI studies – get a very specific things
- Unlikely that one type of evaluation shows everything
- Even a successful system might not do well internally (users put up with bad implementation/encodings because its well suited to their domain)
But:
- some kinds of evaluation are attractive because they are easy to quantify!
What about those hard to get at evaluations: how do we quantify what REALLY matters (outer levels)
- Qualitative Study
- Not something us CS folks are good at
- Hard to make comparisons
Beware of measuring the Wrong Thing!
Easy:
- Micro-tasks
- Quantitative (correctness, speed)
- Short-Term Recall
Are these good proxies for what we really care about?
- Learning / long-term recall
- Ability to gain more complicated insights (that you couldn’t get other ways)
- Efficiency in communicating (speed and clarity)
- Ability to discover
- Ability to connect to high-level task?
- Ability to work in context?
Useful Junk
- Blasphemy!
- Little argument that “Tufte-ists” say short term efficiency (speed, correctness) issues
- But are they asking the right questions?
- What are the right questions?
- Can we find ways to measure these “harder to measure” things?
- long term recall: hard to measure
How can we reconsile Tufte and Holmes?
Look at the religious quality of the online debate: people are not rational about this
Methods for High-Level Study
How do we know we’re doing well at the outer levels?
Need to measure results!
But this is really hard:
- Count the number of Nobel Prizes won by your collaborators
- Count the number of Science/Nature/Cell covers
What if your tool is great, but there isn’t anything to see?
How do we control for the users / data?
If a scientist doesn’t make a discovery:
- Is there nothing to discover?
- Might they have discovered something with a different tool?
- Might a better scientist have found something?
Some strategies:
- Model problems / fake data
- Long term, in-depth (wait and see)
- Carefully designed studies
Case studies – annectdotes
- Rarely provide comparison.
- Scientist did “X” – would they have done just as well with another tool?
- Can’t know – since it’s already been discovered, re-discovery is not the same
How do you define “result” so that you can measure it (insight)
North: Insight Quanitification
- Come up with standard datasets (that a lot of scientists could interpret)
- interesting, but not known
- Come up with a lot of scientists (at equivalent levels, but high enough)
- higher level better, but more costly
- Need to have the time to learn a tool
- Need to have the time to spend with the data (real discovery takes time)
- Need to have enough interest/incentive to really look
An amazing attempt to do this: it’s clearly really hard