Lecture 06: Evaluation

by Mike Gleicher on February 12, 2012

Last year: https://pages.graphics.cs.wisc.edu/765-10/wp-content/uploads/2010/02/10-02-07-Evaluation.pdf

Religons of Visualization

Try to get at the philosophical question “why is it good”

Blindly following a “religon” is probably not the right thing – even though they all have their good elements.

  • Tufte – ism
    • It’s good because I say so
      • reasoned as good through critique
    • It’s good because it follows a set of principles (commandments)
  • Munzner
    • It’s good because you followed the process properly
    • It’s good because you correctly determined that it is
  • Empiricism
    • It’s good because my experiment says so
      • perceptualism (perceptual study)
      • (even if the experiment is just about a detail)
      • not holistic
    • It’s good because I’ve found a way to really measure how good it is
  • Pragmatism
    • It’s good because it gets the job done
    • It’s good because it actually helps gets the job done
    • Assumes you can measure results
  • Perceptualism
    • It’s good because it should be good given what I know about perception
  • Designerism
    • It’s good since I’m a famous designer / it looks attractive

Tufte: Fundamental Principles

Emphasis

  • Citations of Data
  • Credibility of Author
  • Title (inform viewers of intent)
  • Legends

Note: Tufte does not separate data from presentation. Lots of his message is “are you showing me the right data” instead of “are you showing the data right”

Note: Tufte calls things “Evidence Presentations” – not visualizations, or comprehension tools, or …

  • Principle 1: Comparisons
    • Show comparisons, contrasts, differences
  • Principle 2: Causality, Mechanism, Systematic Structure, Explanation
    • Show causality; mechanism, explanation, systematic structure
    • Not really: correlation vs. causality
    • in practice, requires bringing lots of data to bear
    • This is hard: and Tufte doesn’t give us any hints
  • Principle 3: Multi-variate Analysis
    • Show multivariate data; that is, show more than 1 or 2 variables.
    • bring lots of data to bear!
  • Principle 4: Integration
    • Completely integrate words, numbers, images, diagrams
    • use words and pictures (a place where vis does badly! – pragmatics)
    • Whatever it takes to explain something
  • Principle 5: Documentation
    • Thoroughly describe the evidence. Provide a detailed title,
      indicate the authors and sponsors, document the data sources,
      show complete measurement scales, point out relevant issues.

    • Provenance – where did the data come from?
  • Principle 6: Content
    • Analytical presentations ultimately stand or fall depending
      on the quality, relevance, and integrity of their content.

 

Munzner’s Nested Model

Not just a model for evaluation – a way to think about the design process.

  • “Threats” – ways things can go wrong
  • Useful to see “what should be right”
  • her concern is paper writing: making sure you’re evaluating your contribution
    • don’t do the wrong test!
    • be aware of what your testing really shows
    • she has a strong idea of how you should write papers (like her!)
    • what constitutes a design study vs. …
  • our concern is design and evaluation

“Getting it right” at many levels

Nested: can solve inner (or upstream) problems – without solving downstream ones

  • can evaluate each piece assuming previous one is right
  • getting it right downstream requires “inner” pieces to be right

image

  • wrong problem: they don’t do that;
  • wrong abstraction: you’re showing them the wrong thing;
  • wrong encoding/interaction: the way you show it doesn’t work;
  • wrong algorithm: your code is too slow.

can do evaluation upstream or downstream for each layer

need to match evaluation technique / question to the appropriate layer

Examples in the paper (you probably haven’t read the papers – I haven’t read them all either)

  • Papers make claims about domain – but don’t validate that
  • Papers mainly do upstream for the “harder” or outer types

Important lessons:

  • Different kinds of evaluation are necessary
  • Some kinds of evaluation are hard/easy
  • Different evaluations achieve different things
  • Quantitative HCI studies – get a very specific things
  • Unlikely that one type of evaluation shows everything
  • Even a successful system might not do well internally (users put up with bad implementation/encodings because its well suited to their domain)

But:

  • some kinds of evaluation are attractive because they are easy to quantify!

What about those hard to get at evaluations: how do we quantify what REALLY matters (outer levels)

  • Qualitative Study
  • Not something us CS folks are good at
  • Hard to make comparisons

Beware of measuring the Wrong Thing!

Easy:

  • Micro-tasks
  • Quantitative (correctness, speed)
  • Short-Term Recall

Are these good proxies for what we really care about?

  • Learning / long-term recall
  • Ability to gain more complicated insights (that you couldn’t get other ways)
  • Efficiency in communicating (speed and clarity)
  • Ability to discover
  • Ability to connect to high-level task?
  • Ability to work in context?

Useful Junk

  • Blasphemy!
  • Little argument that “Tufte-ists” say short term efficiency (speed, correctness) issues
  • But are they asking the right questions?
  • What are the right questions?
  • Can we find ways to measure these “harder to measure” things?
  • long term recall: hard to measure

How can we reconsile Tufte and Holmes?

Look at the religious quality of the online debate: people are not rational about this

Methods for High-Level Study

How do we know we’re doing well at the outer levels?

Need to measure results!

But this is really hard:

  • Count the number of Nobel Prizes won by your collaborators
  • Count the number of Science/Nature/Cell covers

What if your tool is great, but there isn’t anything to see?

How do we control for the users / data?

If a scientist doesn’t make a discovery:

  • Is there nothing to discover?
  • Might they have discovered something with a different tool?
  • Might a better scientist have found something?

Some strategies:

  • Model problems / fake data
  • Long term, in-depth (wait and see)
  • Carefully designed studies

Case studies – annectdotes

  • Rarely provide comparison.
  • Scientist did “X” – would they have done just as well with another tool?
  • Can’t know – since it’s already been discovered, re-discovery is not the same

How do you define “result” so that you can measure it (insight)

North: Insight Quanitification

  • Come up with standard datasets (that a lot of scientists could interpret)
    • interesting, but not known
  • Come up with a lot of scientists (at equivalent levels, but high enough)
    • higher level better, but more costly
  • Need to have the time to learn a tool
  • Need to have the time to spend with the data (real discovery takes time)
  • Need to have enough interest/incentive to really look

An amazing attempt to do this: it’s clearly really hard

Previous post:

Next post: