DC3 Alternate Assignment: Machine Learning
This is an alternate option for DC3. I expect most people will want to do the regular DC3: The Tree of Stuff assignment. This assignment may be more work, and is more exploratory, but if it connects to other work that you would want to do anyway, then it may be attractive to you.
Visualization for machine learning is a hot topic that many people are interested in (from the Vis side and the ML side). Given the interest in it, I wanted to let students have the option of doing their final “project” in this domain. But, the problem is that in order to do visualization on machine learning things, you have to have done the machine learning work to have things to visualize.
If you are willing to do some machine learning work to generate your own data read on…
If you are interested in this option, you must make arrangements with the Professor and get permission to do this assignment. Our intent is to allow a few students to do it. The assignment is not for everyone.
The Basic Idea…
The idea of this option is that if you are experienced with machine learning, generating some example data to visualize may not be such a big deal. So, this project option will involve you:
- Building some Machine Learning models for a specific task (to generate data to visualize)
- Building a visualization system that visualizes the data in #1
Note: step 1 (doing the ML) is not really this Vis project. You will be required to do it, and provide the data (so that future students can use it). But it isn’t the main part of the project. It is extra work (but, if you like to practice your ML skills, …). You won’t be graded on the quality of your ML work (except that you must document it and provide the results). I don’t care that the models that you make are accurate or state-of-the-art. What I care is that you make models and compare them.
Specifically, this project is about model results comparison. The visualization challenge is to help the user compare the results of different machine learning models.
Here are the three scenarios you can choose from:
You have to classify an image data set, with many different categories. (for example, CIFAR-100). You build 3-5 models, but you want to know how they differ in their performance. You can get the basic measures (e.g., accuracy), but you want to “drill into” the results. Do the models get the same ones right/wrong? Are their confusions similar? Etc.
You have a collection of the abstracts to 4-5000 visualization papers (or 40,000 robotics papers, if you are brave). You build 3-5 models that measure their similarity (to perform clustering, or “more papers like this” recommendations). Here, there is no ground truth correctness - how do we help the user see which one leads to more meaningul distances (and/or downstream task performance)?
You have a large collection of reviews. (for example, the Amazon Data Set used in DC3: The Tree of Stuff). You have different models that predict the rating from the review text. How do you compare these and/or decide which ones are reliable? Again, just the ground truth scores (e.g., RMSE) may not be sufficient. You want to help the user understand what kinds of examples the models get right or wrong by helping to identify specific, interesting examples, and trends/groupings.
I am open to other similar tasks (and indeed have a few ideas). But they need to involve standard datasets, models that are easy to obtain, and need to be big enough and hard enough to create difficult visualization problems (but not so hard that we can’t get ML models to compare).
The catch here is that you need to:
- Get the data (not too hard, I have the abstracts data, there are plenty of standard data sets for 1 and 3)
- Build some models. It is totally OK to get models from the internet (providing you give proper attribution) - the goal is not to invent better ML models, it’s to have models to compare. Or you can build basic models using standard techniques.
- Generate model results. Capture the results of running those models on the test and training data (so you can view resubstitution errors).
- Articulate the visualization task for model comparison. This isn’t too hard - we will help you refine the task.
- Design and Build a tool that looks at the data.
- Assess your tool and document it. And make sure to give all the data (the models, the results) so that others can try in the future.
Oh yeah - the model comparison problem can be hard.
For example, if you are doing Scenario 1 (comparing classifiers) for a 100-category data set… You might consider showing the confusion matrices for the different models. But, confusion matrices don’t scale well (looking at a 100x100 matrix is hard!). Comparing confusion matrices (especially when they are big) is hard. And, confusion matrices don’t capture near misses (how to you convey that the 2nd choice is almost a tie with the best, and would be correct). And it doesn’t necessarily help identify classes that are hard, or even instances that are hard (all classifiers get this one wrong - and when you look at it, you might see why).
As a Class Project
There’s a big ask here… I am asking students to do the ML work to create their own data so they can do the visualization project. That is extra work, and it is orthogonal to the learning goals of the class. But, if people are willing to do it, it will (1) lead to very cool and challenging projects, (2) provide examples we could use for future projects, and (3) inspire research ideas (I am actively working in this area, but need examples!).
I’d like to get a few people to take on this challenge. There is a lot of risk: you need to be able to make a set of models (so you need to be pretty good with the ML part) - and we can’t help you much with that. And the Vis problems are hard. But here, we are happy to brainstorm with people to come up with ideas. So, there will be some leeway - we aren’t going to expect polished visualization solutions. Prototypes that work on your data will be fine.
But do not expect this to be replacing a Vis project with an ML project. In fact, if you can’t come up with the ML data in a reasonable amount of time, we’ll pull the plug and you can go back to doing the normal class project. We want this to about about doing the Vis for ML. Doing some ML to do Vis on is just the setup work.
About the Data
Note that there are two different sets of data:
The “input” data (the data that we use to train and test the Machine Learning model)
For example, for the visual scenario (#1), this data is the test and training images and their labels.
The “output” data (the results of the ML models)
For example, for the visual scenario (#1), this data is the classifier results for each image. So, for example, a table with a row per image (test and training), with a “score” per category (the likelihood that the image is in the class).
If you’re writing out this data, you might also put the “correct answer” into the table (so the Vis system has easy access to it).
The goal of the Vis project is to look at #2. The vis system might want to have access to #1 (for example, you might want to be able to show the images that the classifier gets wrong). But the focus is on #2.
Also, note that I do not mention the “internals” of the classifiers: for this assignment, you are to treat them as a black box. That’s not to say that visualizng the internals of models isn’t a fascinating and challenging problem - it’s just not the focus on this assignment.
If you want to do this…
If you want to do this, you must make an arrangement with the Professor. At each step, we can iterate and discuss - so don’t wait until the last minute!
Note: at each phase there will be some conversation with the Professor to keep things on track.
Phase 1 Proposal: (you will turn this as DC3-1 (due Mon, Nov 16) - although, you will need to do it by email, not by Canvas) You must propose the machine learning problem you want to work on. Be clear about what the machine learning problem is, where you will get the “input” data, and how you will come up with 3-5 models to compare. Be clear about what the “output data” will be. We should also have a sense of what kinds of Visualization tasks you’ll try to do (and maybe even some initial ideas for designs).
The crucial thing here is that we are sure that you have a problem that is hard enough (big enough, difficult enough for machine learning, rich enough results that there will be something to see), but not too hard (for example, you need to be able to build and run the models).
If we can’t agree on the proposal, we’ll cancel and have you do the regular DC3.
Note: you should discuss your proposal with the Professor before turning it in as part of Phase 1. We’ll have sessions to discuss this.
Phase 2 Modeling and Data Creation: (you will turn this in as DC3-2 (due Mon, Nov 23))In this phase, you do the machine learning and save the output data. You’ll need to make models to compare. You’ll need to come up with the output data format, and make sure you can generate it. While you can tweak the models or add more models in the future the idea is at this point, you could be “done” with the machine learning, and focusing on visualization.
The key here is that you have some output data so you can start on the visualization.
The crucial thing here is to demonstrate that you will have enough data to do the visualization assignment. If at the end of this phase you still have no data, we’ll cancel and have you do the regular DC3.
Note: you will turn this in by sending an email to the Prof. and TA. In the email, you must describe what you have done and what the data is. Expect a bit of dialog. We’ll probably have a “workshop” to discuss.
Phase 3 First Draft: (you will turn this in as DC3-3 (due Mon, Nov 30)) In this phase, you’ll actually start working on the Visualization assignment. We’d like you to identify the tasks you’d like to address, and give us some initial ideas/sketches of what you expect to be able to build for the final. We can help you brainstorm ideas to try.
The point here is that we will help you come up with “good enough” visualizations for class. We’re hoping that you’ll create creative solutions for hard problems. But we’re willing to help.
Phase 4 Initial Handin: (you will turn this in as DC3-4 (due Mon, Dec 7)) In this phase, you’ll turn in your work so we can evaluate it. The idea is that we’ll look at it (possibly together with you) to make sure everything is well set, and that everything we want/need is in place.
Phase 5 Final Handin: (you will turn this in as DC3-5 (due Mon, Dec 14)) In this phase, you’ll provide “final documents” - this is an interactive process, since we want to make sure that we get all the intermediate data (input and output data, ML models so we can recreate the models, documentation so we know what is what)