Theme 3: Predictors

In this theme, your objective is to predict time usages from people based on their other variables, and the visualize the results of these predictors. Making the predictors is an ML project: the visualization challenge is to interpret the results.

Fail your ML Class

In this theme your goal is to build predictors that predict the different uses of time based on other variables. I say “predictors” because your goal is to predict all of the different time usage categories (doing this for the top level set is OK, but again, you can make it harder by doing more). You can also view this as a single predictor that makes a vector valued prediciton (a vector of the time usages).

Except that, since this is a Vis class not an ML class, we will let you perform machine learning malpractice: don’t worry about over-fitting. You can train and test on the entire data set if you like (if you’d like to do this exercise by making a holdout split, or using cross-validation, that is certainly acceptable).

And, of course, since this isn’t an ML class, we don’t care about how good your predictor is based on the standard metrics… We care about how well you show how well you show/present/interpret the results! (in a sense, if your predictor was perfect, then there would be nothing to show, so your project would be boring!).

There are simple summary metrics you could use (e.g., RMS error). But hopefully, digging deeper will show more interesting things. For example:

  1. Which time usages are you better/worse at predicting?
  2. What are the distributions of errors? (are they generally good with a few outliers, or …)
  3. Do the errors correlate/anti-correlate between predictions? (for people that you are good at estimating sleep, are you also good at estimating time on the phone?)
  4. Are there subgroups that have higher or lower errors? (e.g., can you see that you are good at predicting that kinds spend lots of time in school, or bad at predicting things about people in some region)
  5. What variables are most useful in making predictions? Are the same variables useful for many different time usage predictions?

Given how rich and complex the predictors are (dozens of predictions, dozens of variables, thousands of samples), even if you are computing simple summary statistics, there are still challenges in how to present all the information.

In this theme, you will need to both build the predictors, and then visualize their results.

You can view this as either a project to build the tools to let you look at the classifiers (so you can answer the questions), or a project to visualize what you found when you were looking at the results.

Warning: if you lack the ML skills to create the predictors, do not attempt this theme. In theory, creating such supervised predictors should be as simple as throwing the data at a supervised regression algorithm (be careful to handle the non-interval values correctly! one-hot encodings are a naive solution, but better than ignoring that in some cases the numbers are encoding things that are not ordinal).