Here are some pointers to example data sets for Design Challenge 2.
Warning: this is a work in progress, check back often to see if there is anything new…
Most files are in a canvas Files folder: https://canvas.wisc.edu/courses/273473/files/folder/DC2%20SampleData
- 11/16 some high dimensional text data has been posted
Problem 1: Sub Group Structure
This problem was motivated by the Time Usage data from DC1 Design Challenge 1 Data Sets (4. Time Usage Survey), although any of the Design Challenge 1 Data Sets could be used.
A real motivator for me was creating a new data set. I tried to get a new set from the server that had pandemic data (2020). I didn’t get to check the data in time to make sure it was OK. A solution for DC2 might help check. An initial version of this data is ipums_v0.
While getting at that, I noticed the historical data - it goes back to 1930! But its hard to see what’s there - there are only certain years, and some things aren’t sampled in all years, certain kinds of time usage isn’t there at all. That makes it a perfect test case for finding “holes” in the data (a good DC2 challenge). I’ve uploaded an initial version of the data as atush_v0.
If someone asks (on Piazza), I can generate even better versions and try to capture more variables.
Problem 2: Dimensionality Reduction
Coming Soon: text corpus exploration data.
This data comes from our Text Corpus Exploration System. We’ll post on Canvas on how you can try it out.
We have 3 different data sets:
- Vis Abstracts (about 5000 documents)
- Robotics Abstracts (about 40000 documents)
- NY Times Articles (up to 120000 documents)
For each, we have:
- Word count (really TFIDF) data (about 10K columns, as there is one per word - 16706 for Vis Abstracts)
- Universal Sentence Encoder data (a few hundred columns - 512 for Vis Abstracts)
- NMF topic model data (about 20 columns - 20 for the Vis Abstracts)
- UMAP dimensionality reductions of each of the above
We will provide some examples of both the whole sets (e.g. all 5K Vis papers), as well as “neighborhoods” (i.e., the 100 closest neighbors to a given document). There will be a pair of the high dimensional data (1-3) and its UMAP to 2D (although, you can do it yourself). These are not posted yet.
Not surprisingly, the 10 nearest neighbors in 10K dimensions are not the same as the 10 nearest neighbors in 2 dimensions…
Update 11/16: A dataset from the Vis papers corpus (about 5K items) has been posted. There is the whole set (about 5K), a “small” set (500 selected randomly), a “neighborhood” set (a selected document and its 100 nearest neighbors - 101 items), and a “non-neighbor” set (101 random items). The 4 files listed above are available for each. The data is Python “.npy” files - if you need something else, ask for help on Piazza. 765 text data.zip
You can try this out using the Abstracts Viewer. We’re working on making the system easier to get started with. It’s not hard to see examples of dimensionality reduction distortions.
Problem 3: Tiny Charts
The data you need will vary depending on the chart type that you decide to work on.
Data shouldn’t be hard to come by, but if you need help, ask on Piazza.