Design Challenge

This is a simple example of synthetic data, generated using the cocktail party simulator.

All of these data files come from the same network: a 12 person party with 1 host. All guests know the host and 2 other people (so D knows A (the host) and C and E (its two neighbors).

In the simulation, we add two factors:

sampling (how many observations do we take to build the matrix). in many cases, we are undersampling (not getting enough samples to really capture the phenomenon, which will lead to noisy measurements)

measurement noise (random chance added to the numbers). basically, this says that when we make an observation, there’s a chance it might be a random event (two people that do not know each other still may talk to each other, or two people are talking to each other, but we missed it)

This example should allow you to see how well your techniques deal with these two factors. The underlying phenomenon is the same (so we would hope to have very similar representations), but the errors might make that harder to discover.

The datafiles have the names formed as:

P 12 x 100 – 0 – 1

which means:

  • 12 person party (all these are the same)
  • x means that its the single host party (we’ll see other networks in future data)
  • 100 means 100 samples
  • 0 means no noise (6 means the +/- 3 noise added to each conversation selection)
  • 1 is the trial (there are two trials of each condition given)

Here is a ZIP of a bunch of these: p12x.zip (16 to be exact)

(right now, I can’t upload individual CSV files – but we’re working on fixing that)

We’ve posted a bunch of pages about the design challenge:

By now you should be familiar with the design task. We’ve compiled some simple “experimental” visualizations to provide both a starting place and to give you an idea of what works and what doesn’t when comparing adjacency matrices in this context.

All of the following examples (and the .cvs files with the raw Epistemic Net data) can be found here.

Here are some examples of the visualization tools that we’ve come up with to help with the problem. If you want to examine these experiments more closely, download the linked file, download and install Processing, and open the associated files. If you are just interested in the raw data, look at the .csvs in the data folder of the above file. The matrices are stored in blocks of nxn (where n is the number of nodes) cells with 0’s in the diagonal (representing that the strength of association from a node to itself is unknown/undefined). Each of the .csv represents a different venue. The .xlsx file presents all of the venues together so you can gauge relative scales.

Experiment one: The Asterisk

Similar to a radar plot, the asterisk measures the association strength along a different spoke for each member. The “fan” approach widens these bars to make them easier to see.

Experiment two: The CompareMat

Overlays the adjacency matrices, represent strength by the radius of a circle. Smaller circles are always drawn on top of the larger ones so there is no missing information.

Experiment three: The Golfball

Creates a graph of all the nodes in the matrix, represents the connection between them by the width of the edges. Of course, fully connected graphs of any significant size can be hard to parse…

Experiment four: The Spokes Graph

Represents each node separately as a line in a table, with the ability to highlight sections of the graph to see specific vertices.

As you can tell, none of these designs is perfect, and all of them could use some work even if they are in fact somehow the right way of looking at these data. For more discussion consult this page and the assignment page.

(note – read the general intro first – this is probably more detailed / domain specific than what you want to start with)

(note 2 – I (Mike) have added the headings and formatting. For comments I’ve added, i’ve italicized things)

The paper referenced in the text below is available from IJLM0102_Shaffer.

The Context: What is the domain?

Epistemic games are based on a specific theory of learning: the epistemic frame hypothesis. The epistemic frame hypothesis suggests that any community of practice has a culture and that culture has a grammar, a structure composed of:

  1. Skills: the things that people within the community do
  2. Knowledge: the understandings that people in the community share
  3. Identity: the way that members of the community see themselves
  4. Values: the beliefs that members of the community hold
  5. Epistemology: the warrants that justify actions or claims as legitimate within the community

This collection of skills, knowledge, identity, values, and epistemology forms the epistemic frame of the community. The epistemic frame hypothesis claims that: (a) an epistemic frame binds together the skills, knowledge, values, identity, and epistemology that one takes on as a member of a community of practice; (b) such a frame is internalized through the training and induction processes by which an individual becomes a member of a community; and (c) once internalized, the epistemic frame of a community is used when an individual approaches a situation from the point of view (or in the role) of a member of a community.

Put in more concrete terms, engineers act like engineers, identify themselves as engineers, are interested in engineering, and know about physics, biomechanics, chemistry, and other technical fields. These skills, affiliations, habits, and understandings are made possible by looking at the world in a particular way: by thinking like an engineer. The same is true for biologists but for different ways of thinking—and for mathematicians, computer scientists, science journalists, and so on, each with a different epistemic frame.

Epistemic games are thus based on a theory of learning that looks not at isolated skills and knowledge, but at the way skills and knowledge are systematically linked to one another—and to the values, identity, and ways of making decisions and justifying actions of some community of practice.

The domain problem: assessment of Epistemic Games / Epistemic Frames

To assess epistemic games, then, we begin with the concept of an epistemic frame. The kinds of professional understanding that such games develop is not merely a collection of skills and knowledge—or even of skills, knowledge, identities, values, and epistemologies. The power of an epistemic frame is in the connections among its constituent parts. It is a network of relationships: conceptual, practical, moral, personal, and epistemological.

Epistemic games are designed based on ethnographic analysis of professional learning environments, the capstone courses and practica in which professionals-in-training take on versions of the kinds of tasks they’ll do as professionals. Interspersed in these activities are important opportunities for feedback from more experienced mentors. In earlier work, I explored a few ways of providing technical scaffolds to help young people meaningfully engage in the professional work of science journalists. I also conducted an ethnography of journalism training practices, studying a reporting practicum course on campus. This has led to my current effort: seeking to better understand how we might measure and articulate the similarities and differences between the writing feedback in different venues – in this case, copyediting feedback given in the journalism practicum, copyediting feedback given in a journalism epistemic game, and copyediting feedback given in a graduate level psychology course (i.e., a non-journalism contrast venue).

I’m particularly interested in differentiating the kinds of writing feedback that are more characteristic of journalism from more general writing feedback. In order to investigate these patterns quantitatively, the feedback from each venue has been segmented (each comment from each writing assignment for each participant in each venue was treated as a separate data segment) and coded for the presence/absence of a number of categories (for a graphic example of this using a different data set, see the attached paper, p.6). Using epistemic network analysis, the resulting data set can then be used to investigate such ideas as the relative centrality of particular frame elements, i.e., the extent to which particular aspects of journalistic expertise (categories of skills / knowledge / values / identity / epistemology) are linked together in the feedback provided.

The challenge: Comparing Epsitemic Frame Networks

The design challenge arises when we try to compare this multidimensional data set across venues. It is unwieldy to say the least to try to compare multiple sets of 17 items. We can overcome that by first calculating the root mean square of the 17 relative centrality values, then scale the resulting values to achieve a single similarity index for the set, and finally compare those values. However, this involves collapsing a number of dimensions that a) might not properly be collapsed, and b) might be useful for providing an overall profile for comparison.

As a way of retaining potentially important dimensional information, we’re also trying a multidimensional scaling technique, principle coordinates analysis (similar to principle component analysis), to identify a subset of coordinates we might then use to map the different venue’s data and produce 2 or 3-dimensional, i.e., graph-able representations of the data for comparison. The challenge of how to represent these multi-dimensional data sets remains.

There is another challenge inherent in our relative centrality metric: it calculates the centrality of a given element by summing the co-occurrences of a particular element with any other element, meaning it collapses the specific linkages taking place to provide a more general indication of the importance of the element. Comparing data from different venues though reveals that two elements from different venues with the same relative centrality values can actually be linked to quite different specific elements. In the terms of this data set, this would be something like data from both the practicum venue and the psychology venue showing Knowledge of Story as highly central, while a closer inspection of the links occurring reveals they are linked quite differently in each case.

So, I’ve produced a new metric, relative link strength (RLS), which, like the relative centrality metric, is based on the co-occurrence of epistemic frame elements in the data segments. However, instead of collapsing these co-occurrence frequencies into a single value, RLS retains the specificity, producing a matrix of link frequencies between every pair of the codes (frame elements). This is particularly useful for drilling into the apparently similar relative centrality values between different contexts, but takes an unwieldy representational set of 17 elements and makes it even more complex as a matrix of 17 by 17 elements. Even focusing on a particularly interesting subset of 8 elements means figuring out the best way to show an 8×8 matrix. Working solutions to this so far include generating radar plots for each of the elements (the rows of the matrix, if you will) with each venue represented in semi-transparent solid fills to get a sense of the similarity / difference between the venues on each dimension. This approach is better than some, but has drawbacks.

Looking forward to thinking through this and the overall similarity representation with the group.

The Design Challenge

February 15, 2010

in Assignments

due dates: (see the rules)

  • initial solutions and class presentations – March 4th
  • final solutions and writeups – March 11th

The Design Challenge

The topic of this challenge is to create visualizations to help our colleagues in Educational Psychology interpret their Epistemic Frame Network data. Specifically, you need to address the problem of comparing two Frame Networks.

A detailed explanation of the data (and the problems the domain experts hope to solve) will be given in class on Thursday, February 18th.

This is a challenging problem for which we really don’t have a good solution yet. Our hope is that by having the class generate new ideas, we can find a bunch of new designs that may help them in both interpreting and presenting their data. Even though they have limited data right now, they are in the process of developing new tools that will generate a lot more data, so having good tools will be increasingly important. For your testing, we will also provide synthetic data.

The data is different than other data types seen in visualization. At first, it seems like lots of other network data. But these networks are small, dense, and weighted. Its not clear that standard network visualization methods apply. (and we haven’t discussed them in class yet)

The Data

(more details will be given in class on Thursday, February 18th)

An Epistemic Frame Network consists of a set of concepts. The size of the network (the number of concepts) we’ll denote as n. For small networks, n might be a handfull (5 or 6), large networks are unlikely to be bigger than a few dozen (20-30). Most networks we’ll look at are in the 6-20 range. Each concept has a name which has meaning to the domain scientist. (see the information from the domain scientist to really understand what the data means)

The data for the network is a set of association strengths. Between each pair of concepts, there is a strength that corresponds to how often the two concepts occur together. If the association strength is zero, the two concepts never occur together. If the number is bigger, the concepts appear together more often. The actual magnitude of the numbers has little meaning, but the proportions do. So if I say the association between A and B is .5, you don’t know if that’s a lot or a little. But if the association between A and B is .5 and between A and C is .25, you know that A is twice as strongly associated with B than C. The associations are symmetric, but they don’t satisfy the triangle inequality (knowing AB and AC tells you nothing about BC).

The numbers for a network are often written in matrix form. The matrix is symmetric. The diagonal elements (the association between a concept and itself) is not well defined – some of the data just puts zeros along the diagonal. So the matrix:

0 .5 .25
.5 0 .75
.25 .75 0

Is a 3 concept network, where the association between node A and B is .5, between A and C is .25, and between B and C is .75.

A more detailed explanation of what the data means may be provided by the domain experts. But you can think of association strength as “how closely related are the two concepts” (stronger is more closely related).

As an analogous problem, you can think of the network as a social network. The concepts are people, and the associations are how well they know each other, or how much time they talk to each other. A description of this problem (as well as this visualization problem) is provided on the SCCP page (single conversation cocktail party). (in the terminology of SCCP, what we get is the “interaction matrix”, not the “measurement matrix”).

As a practical issue, the data will be provided as “csv” (comma seperated value) files containing symmetric matrices. The matrices are small enough that the redundancy isn’t a big deal. The will usually be an associated text file with the names of the concepts. If the names aren’t provided, you can just refer to the concepts by letter (A,B,C, …). In fact, you might want to refer to them that way no matter what.

The Problem

The domain experts will explain what they want to do in interpreting the data. But the real problems are generally comparative: given 2 or 3 (or maybe more) networks, how do we understand the similarities and differences.

When comparing networks, you can assume they have the same concepts in the same order. In the event that one matrix is bigger than the other, you can simply pad the smaller ones with extra rows and columns of zeros.

Keep in mind that the data is noisy, has uncertainty, and some ambiguity (since the magnitudes don’t have meaning). What matters are the proportions between different observations. In fact, different matrices might be scaled differently. This matrix here:

0 2 1
2 0 3
1 3 0

is equivalent to the one above in the previous section.

It might be easier for you to think about the problem in terms of the cocktail party. In fact, we’ll provide you with a pile of example data from our cocktail party simulator. (we have limited real example data).

The Solution

First, I don’t think there is “THE” solution. There are probably lots of good ways to look at this data. Some good for some types of understanding, others good for other types.

How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth? (Sherlock Holmes)

I told David (the domain expert) that the way I was going to find one good visualization was to generate 50 bad ones first. You can see a number of my attempts on the SCCP page. We will provide you with the sample code for all of these (except for the graph visualization solutions, which use a program I downloaded called “graphvis”). Our domain experts have also generated a few visualization ideas that they will show to you on February 18th.

Well, hopefully, we won’t need to generate 50 ideas. We’ll learn from the initial attempts and get to good answers quickly.

Your team will be expected to generate at least 1 (preferably several) possible solutions. Ideally, you will implement them as a tool that can read in matrices of various sizes so that we can try it out. However, if you prefer to prototype your visualization by drawing it by hand, that’s OK – please use one of the “real” example data sets though.

There is a need for a variety of solution types:

  • static pictures (for putting into print publications) as well as interactive things
  • tools for exploring data sets (to understand the differences between a set of networks), as well as tools for communicating these findings to others (where the user understands the differences)

It is difficult to evaluate a solution without really understanding the domain. That’s part of the challenge. You will have access to the domain experts to ask them questions. You can also think about things in terms of the SCCP domain (for which you are as expert as anyone).

The Challenge

The class will be divided into teams of 3 (approximately, since we have 16 people). We will try to assign teams to provide a diverse set of talents to each team. Hopefully, each team will have at least one person with good implementation skills for building interactive prototypes.

You will be able to ask questions of the domain experts in class on February 18th. If you want to ask them questions after that, send email to me (Mike Gleicher). I will pass the question along, and give the response back to the entire class (watch the comments on this posting). 

Please do not contact the domain experts directly.This is partially to limit their burden, but also for fairness (some groups may have more access to them).

On March 4th, we’ll use the class period for each group to present their solutions to the domain experts and to discuss our progress. Groups will then get another week to write up their solutions. We’ll provide more details as time gets closer.

What to Create

Each time should create at least one (preferably more) visualization techniques for the ENF data.

You can devise tools for understanding a single network, but you must address the problem of comparing 2 networks. Its even better if you can come up with solutions for handing 3 or more networks. (but showing that you have a solution for the 2-way comparison is a minimum requirement)

Your approach should scale to networks with 20+ nodes in it.

It is best if you implement your proposed techniques so that they can load in data files. However, if you want to “prototype” manually (either drawing it by hand, or manually creating specific visualizations from some of the example data sets), that’s OK. You might want to do a simple prototype first, and then polish and generalize an implementation after.

For the demos (March 4th) you will be able to choose the data sets to show off your methods. For the final handins, we would prefer to be able to try out your techniques on “live” data. Ideally, we will give the tools you build to the domain experts and let them use them.

Designing tools that are interactive is great. For the demo, only you need to be able to use your tool (you will give the demo), but for the final handin, you will be expected to document what you’ve created.

I am aware that we haven’t discussed interaction (or network visualization) in class yet – this might be a good thing since I don’t want to cloud your judgment and have you just apply old ideas. Be creative!

Resources

Be sure to watch this page (and the comments on it) for updates and changes and more details.