Lecture 06: Evaluation

Last year: https://pages.graphics.cs.wisc.edu/765-10/wp-content/uploads/2010/02/10-02-07-Evaluation.pdf

Religons of Visualization

Try to get at the philosophical question “why is it good”

Blindly following a “religon” is probably not the right thing – even though they all have their good elements.

Tufte – ism

It’s good because I say so

reasoned as good through critique

It’s good because it follows a set of principles (commandments)

Munzner

It’s good because you followed the process properly
It’s good because you correctly determined that it is

Empiricism

It’s good because my experiment says so

perceptualism (perceptual study)
(even if the experiment is just about a detail)
not holistic

It’s good because I’ve found a way to really measure how good it is

Pragmatism

It’s good because it gets the job done
It’s good because it actually helps gets the job done
Assumes you can measure results

Perceptualism

It’s good because it should be good given what I know about perception

Designerism

It’s good since I’m a famous designer / it looks attractive

Tufte: Fundamental Principles

Emphasis

Citations of Data
Credibility of Author
Title (inform viewers of intent)
Legends

Note: Tufte does not separate data from presentation. Lots of his message is “are you showing me the right data” instead of “are you showing the data right”

Note: Tufte calls things “Evidence Presentations” – not visualizations, or comprehension tools, or …

Principle 1: Comparisons

Show comparisons, contrasts, differences

Principle 2: Causality, Mechanism, Systematic Structure, Explanation

Show causality; mechanism, explanation, systematic structure
Not really: correlation vs. causality
in practice, requires bringing lots of data to bear
This is hard: and Tufte doesn’t give us any hints

Principle 3: Multi-variate Analysis

Show multivariate data; that is, show more than 1 or 2 variables.
bring lots of data to bear!

Principle 4: Integration

Completely integrate words, numbers, images, diagrams
use words and pictures (a place where vis does badly! – pragmatics)
Whatever it takes to explain something

Principle 5: Documentation

Thoroughly describe the evidence. Provide a detailed title,
indicate the authors and sponsors, document the data sources,
show complete measurement scales, point out relevant issues.

Provenance – where did the data come from?

Principle 6: Content

Analytical presentations ultimately stand or fall depending
on the quality, relevance, and integrity of their content.

Munzner’s Nested Model

Not just a model for evaluation – a way to think about the design process.

“Threats” – ways things can go wrong
Useful to see “what should be right”
her concern is paper writing: making sure you’re evaluating your contribution

don’t do the wrong test!
be aware of what your testing really shows
she has a strong idea of how you should write papers (like her!)
what constitutes a design study vs. …

our concern is design and evaluation

“Getting it right” at many levels

Nested: can solve inner (or upstream) problems – without solving downstream ones

can evaluate each piece assuming previous one is right
getting it right downstream requires “inner” pieces to be right

wrong problem: they don’t do that;
wrong abstraction: you’re showing them the wrong thing;
wrong encoding/interaction: the way you show it doesn’t work;
wrong algorithm: your code is too slow.

can do evaluation upstream or downstream for each layer

need to match evaluation technique / question to the appropriate layer

Examples in the paper (you probably haven’t read the papers – I haven’t read them all either)

Papers make claims about domain – but don’t validate that
Papers mainly do upstream for the “harder” or outer types

Important lessons:

Different kinds of evaluation are necessary
Some kinds of evaluation are hard/easy
Different evaluations achieve different things
Quantitative HCI studies – get a very specific things
Unlikely that one type of evaluation shows everything
Even a successful system might not do well internally (users put up with bad implementation/encodings because its well suited to their domain)

But:

some kinds of evaluation are attractive because they are easy to quantify!

What about those hard to get at evaluations: how do we quantify what REALLY matters (outer levels)

Qualitative Study
Not something us CS folks are good at
Hard to make comparisons

Beware of measuring the Wrong Thing!

Easy:

Micro-tasks
Quantitative (correctness, speed)
Short-Term Recall

Are these good proxies for what we really care about?

Learning / long-term recall
Ability to gain more complicated insights (that you couldn’t get other ways)
Efficiency in communicating (speed and clarity)
Ability to discover
Ability to connect to high-level task?
Ability to work in context?

Useful Junk

Blasphemy!
Little argument that “Tufte-ists” say short term efficiency (speed, correctness) issues
But are they asking the right questions?
What are the right questions?
Can we find ways to measure these “harder to measure” things?
long term recall: hard to measure

How can we reconsile Tufte and Holmes?

Look at the religious quality of the online debate: people are not rational about this

Methods for High-Level Study

How do we know we’re doing well at the outer levels?

Need to measure results!

But this is really hard:

Count the number of Nobel Prizes won by your collaborators
Count the number of Science/Nature/Cell covers

What if your tool is great, but there isn’t anything to see?

How do we control for the users / data?

If a scientist doesn’t make a discovery:

Is there nothing to discover?
Might they have discovered something with a different tool?
Might a better scientist have found something?

Some strategies:

Model problems / fake data
Long term, in-depth (wait and see)
Carefully designed studies

Case studies – annectdotes

Rarely provide comparison.
Scientist did “X” – would they have done just as well with another tool?
Can’t know – since it’s already been discovered, re-discovery is not the same

How do you define “result” so that you can measure it (insight)

North: Insight Quanitification

Come up with standard datasets (that a lot of scientists could interpret)

interesting, but not known

Come up with a lot of scientists (at equivalent levels, but high enough)

higher level better, but more costly

Need to have the time to learn a tool
Need to have the time to spend with the data (real discovery takes time)
Need to have enough interest/incentive to really look

An amazing attempt to do this: it’s clearly really hard

Lecture 06: Evaluation

Religons of Visualization

Tufte: Fundamental Principles

Munzner’s Nested Model

Beware of measuring the Wrong Thing!

Methods for High-Level Study

Recent Posts

Categories

Handy Links