Design Challenge 3
Design Challenge 3: The Tree of Stuff
Please check for updates. If changes are made after 11/18, there will be a note here at the top.
Contents
Summary
In this assignment, we’ll examine the “Tree of Stuff” – the Amazon Product Categories.
Every product is in one or more categories. The product categories for Amazon are hierarchical: they form a big tree. Your job is to design (and probably implement) a tool that allows the viewer to explore the tree to understand the tree and how the products distribute across it.
We’ll give you the data of the product categories. You give us visualizations (preferably interactive visualizations) that help us look at it and see interesting things that are going on.
Some key points (details below):
- Things are due on Saturday, December 14th. We can take late assignments (with a caveat)
- We will provide you with the data in a convenient form.
- Programming is optional.
- You may work with a partner, subject to the rules below.
- You need to consider how to allow the viewer to explore the data. The data is too big to put into a single picture.
- Your assignment must excel in some way (to get a good grade) – it can be an interesting design, answer questions comprehensively, provide a nice implementation, explore a range of design options, or some combination.
- All assignments require a discussion of your design (detailed description), rationale, the tasks you can support (and not support), and self-critique.
Dates
- Wed, Nov 20 – Project kickoff. We’ll discuss the project in class, and give time for people to find partners. For the “deadline” (OK, if its a little late) you just need to tell if if you are working with a partner or not. If you want a partner and can’t find one, we’ll try to help.
- Wed, Nov 27 – Sketch – You will hand in a sketch/notes about your current thoughts on the project. This is mainly for us to check that you have actually started thinking about the project.
- Wed, Dec 4 – Check In – You will hand in a sketch/draft to show your progress.
- Wed, Dec 11 – Official due date (last day of class). However, assignments will be accepted without penalty until Saturday, Dec 14. (so that is the real due date).
- Sat Dec 14 – Last day to turn in projects without penalty. We need to have enough time to get grading done. You may turn projects in on Sunday, Dec 15 if you tell us ahead of time. Projects turned in After December 15th may not be fully evaluated. See the instructions below
- Monday, Dec 16 (or possibly Tuesday Dec 17) – Optional in-person demos. Because it is exam week, we cannot schedule a required session. However, we may need to see your project in person to grade it effectively.
The Data
Every product on Amazon is in one or more “categories”.
Each category can have subcategories. Non-root categories can have products in them. So a product might be in the “Books” category, or the “Books”:”Childrens” category (i.e., the Children’s subcategory of the Books category), or even [‘Books’, “Children’s Books”, ‘Fairy Tales, Folk Tales & Myths’, ‘Dragons’]. The categories form a tree. Each category is a node in the tree.
Different products are categorized differently. Sometimes, products are put into general categories (Books), other times they are placed in very specific categories (Books:Children’sBooks:FairyTales:FolkTales&Myths:Dragons). In theory, a single product could be placed in both a generic category and a specific category.
The Amazon metadata contains 9430088 products. The tree of product categories has 29241 nodes.
When products have multiple categories, these are often in different parts of the tree. For each category (node), there is a set of other categories (nodes) that it shares products with. We call these “alsos”. If a product has categories “A”,”B” and “C”, then category “A” will have “B” and “C” in its alsos set.
Also, because products can be in multiple categories, getting accurate counts without overlap can be tricky (and is actually impossible with the simple data), so we’ve tried to provide it in the simple data (see subtreeProductCount below).
Some Example Designs
Note: you might try to implement one of these ideas, but you probably won’t get credit for creativity :-). Working out the details, and making a nice implementation with interaction, and doing a thorough analysis can still make one of these designs something to try.
In some cases, these are purposefully bad ideas to give you something to do better than.
- You could present the list of nodes in a table, and use interaction (sorting, filtering) to help the viewer see interesting things. You might pre-compute quantities of interest that make more things findable by sorting and filtering. If you provide a table, you need to explain why what you did is better than just loading the data into a good spreadsheet program (loading the data into Excel and using its table tools is really interesting).
- You could draw a standard node-link tree. Each node will be a small dot (since there are 2700 of them). You’d probably want to provide interaction: pan, zoom, give note details on hover, etc. You’ll probably want some interaction to help reduce the data, such as interactively hiding/revealing things.
- You could show the product counts in a hierarchical treemap. (you can actually do this with Excel, if you are skilled with Excel). You probably want to create some interactions.
- You might try some more unusual tree representations (check treevis.net for more than 300 ideas).
For these (or any design) be sure to think about what you can and cannot see with the design.
We are not necessarily looking for answers to specific questions that could be answered with a simple query or chart. Your design should support some exploration to help the viewer either better target their question, or get more details about an answer. “What are the biggest categories?” could be answered with a simple query (Excel table above) – but it could also be a starting point for an interactive exploration (what are the kinds of categories within the big things/small things, etc.). A simple chart that tells a single story is unlikely to be satisfying.
Given the size of the data, some interaction is likely to be necessary since you can’t show everything, and you want to viewer to be able to “look around” to find what they are interested in.
Interaction
Interaction and summarization are likely to be your two best weapons against the scale of this data.
Implementing interaction well is hard. So, if you decide to do programming to build an interactive system, we will have realistic expectations.
However, it is possible for you to prototype an interactive design without actually programming an interactive system. You can simulate the system – as a series of screen shots / storyboard / comic book, showing the sequence of steps and explaining what the viewer’s actions. You can do this with a sequence of screen shots from your own program (it might not be really interactive), visualizations you made with some tool, sketches, …
Even if you do implement a system, you will need to describe it well. Unfortunately, we won’t be able to run everyone’s program. Even arranging for demos will be hard (since it will be after the end of the semester). Make sure your document lets us appreciate what you have done. Videos are a good way to do this (see “what to turn in” below).
The Simple Data
We have processed the huge product metadata file and produced a (relatively simple) CSV file that just contains information about the product categories.
Each row in the file represents a category. For each category, you have the following information:
- id: each node is assigned a unique integer id. The id 0 is the root of the tree (which isn’t actually a category)
- name: the string name of the node. The names are not unique: the same name can occur as a child of different nodes. (“Accessories” occurs dozens of times – it’s a sub-category of womens clothing, mens clothing, videogames, automotive, …)
- productCount: how many products list this category as one of their categories
- subtreeProductCount: how many products are in this node’s subtree (this node, or any of its children). If any of a products categories have this node (either as their node, or a parent of the category), it is counted. Care is taken to make sure things are not double counted. The root node of the tree should have the actual total product count (of all products that are in at least one valid category).
- parent: the id of the category that is the parent of this category
- numChildren: the number of children (sub-categories) this category has
- pathName: the list of the names of the entire “path” of nodes leading to this node
- children: a list of the IDs of the children of the node
- alsoCount: how long the whole list of alsos is (since we truncate after the 100 most common)
- alsos: a list of the nodes for which some product also has that node. So, if you see [(10,5), (6,4)] that means that there are 5 products with the current category and category 10, and 4 products that also have category 6. A single product can be in many categories, so there may be overlaps (e.g., items in 10, 6 and the current category). Because of the limits of CSV formatting, we limit this list to 100 items (we give the most common).
We will provide a CSV for the entire data set (29K rows), as well as some smaller CVSs for testing. And yes, there is a “blank” category (it has 194000 products!) – I am not sure what is up with that – some products just seem to have this “non-category” as a category.
For those of you who use Python, I will provide a Python pickle file of the tree (as well as the code that processes the raw data into the CSV and tree). The tree has all of the “alsos”, and example product for each category, and generally puts the tree into a convenient Python data structure.
You can find the data and the example code on canvas in a file directory.
More Data
There are many things that we did not give you. We did not give you product level information. You cannot tell anything about the products in categories. We only provide crude information about categories that co-occur – you can’t even tell how many products there are (since category counts include products in other categories). You don’t know if there are products that have 7 categories (there are), or if certain categories get higher ratings, or …
We purposefully limited the data. What we’ve given you in the “simple” CSV file is large and complex enough to do interesting things.
But, if you want to do things with data we didn’t give you, you are welcome to obtain the “raw” data and extract more from it.
The source data is derived from datasets provided by Prof. Julian McAuley at UCSD. The web page documents the data formats and data well. We are only using the product metadata. We will provide the python script that processes the metadata json file and makes the CSV file.
Be warned: the entire metadata file is huge (3.2GB compressed). It takes over 15 minutes to extract the CSV file (on a fast PC). My extraction code is provided in case you want to extend it.
If you add data, be sure to explain that in your writeup.
Tasks
You need to pick tasks that visualization really helps – it shouldn’t be something that can be done with a simple list (or even a simple bar chart). Any simple question – “what are the top 10 product categories” is probably not a good task for this assignment. If it isn’t obvious why your visualization adds value, you can explain.
Be sure to pick tasks that you can actually solve with the data that you have. It would be nice to show the correlation between how specific a product is categorized and how well it sells, but we don’t have sales data (unless you extract it – see More Data above).
You can pick the tasks that you create designs for. We’ll brainstorm a little in class, and provide some ideas. To get you started, consider:
- What “shape” is the tree? What kinds of categories are “deep” and have many subcategories vs. using broad categories? Which kinds of things have more finely divided categories?
- How are products distributed across the tree? Are there surprises?
- What are common pairings of categories? Are there things paired that are “far away”? Are there surprising combinations?
- Are some product types more finely categorized?
- Are some kinds of products more likely to be in multiple categories?
Programming
You do not have to program to do this assignment. If you want to turn in a “sketched by hand” assignment, that is possible. However, you will need to find other ways to excel: your designs will need to be extra interesting, your descriptions clear, and your rationales compelling. See evaluation below. Also see the notes on interaction.
If you do program…
- We may not be able to run your program. Your documentation must be complete and show off what the program can do. Be sure to describe things well and give pictures. You can even provide a video.
- You can use any tools you like. We do not restrict you in terms of languages, libraries, etc. You do need to tell us what you’ve used.
- You do need to turn in everything we would need to run your program (in terms of the source code). However, we understand that we may not have the right environment to run it. Therefore, we may ask you to give a demo on your own computer (if your program requires a demonstration).
- We will consider how expert you are with the tools you use. We’ll ask you in the self-assessment. However, there is a limit to how much value we will place on you learning tools. You can get some credit for learning new tools, but ultimately, you need to create visualizations that are well documented.
Working with a partner
You may work with a partner for this project, subject to some rules.
- You must identify your partner in all hand-in phases (this means that you need to choose your partner and tell us as part of phase 1). Therefore, you must pick your partner by the phase 1 deadline (Nov 20).
- Both partners must agree to work together. Once you agree to work together, we will only allow groups to split if they make an explicit request by email to the professor and TA.
- Both partners will get the same grade. We will not try to determine who did what.
- Your partner must be from a different department/degree program. Less than half the class are CS grad students, so this is possible. We will grant exceptions to this rule if you can explain that you haven’t worked together before and provide different backgrounds. We will also provide help with finding partners.
- Only one partner should turn in the assignments (be sure to have both names). The other partner should turn in something saying that there work has been handed in by their partner.
What to turn in
Phase 1 – November 20 (Canvas) – all you need to do is tell us who your partner is, or that you aren’t going to work with a partner. It’s a Canvas type-in. It’s mainly to check that you’ve read the assignment.
Phase 2 – November 27 (Canvas) – all you need to do is tell us that you’ve been thinking about the assignment. We’re not going to look at what you’ve done at this point (but feel free to come to office hours to discuss your plans).
Phase 3 – December 4 (Canvas) – please upload a PDF or image to show us that you’ve working on something and making progress. We’ll just check that there’s something there. You won’t believe how many students won’t start working on a project unless they have some deadline in place. We won’t provide feedback, but if you want to come to office hours or make an appointment we’re happy to talk to you about your ideas.
Phase 4 – Final Turnin – (Canvas) – December 11th is the last day of class, but you can turn things in “late without penalty” (until December 14th). This weirdness about date is because of the University exam policy. We really do need to get your assignments on the 15th because we need time to grade them.
We need to begin grading on December 15th, so we need to make the deadline Saturday, December 14th. If you want to turn things in late (on December 15th), you need to tell us (in the Canvas type-in box for the assignment) that you will be turning things in late. We may assess a penalty for late assignments. If you don’t turn things in on December 15th, we may not be able to grade it.
You can turn in up to 3 things (number 2 and 3 are optional – if you don’t program, #2 is required if you program):
-
A single PDF with your writeup. This has to include a description of your design(s), a discussion of your implementation, a critique of how it applies to tasks, instructions to run the program, anything else we need to know, … (some more info below). please submit one PDF on Canvas – do not put it inside of your ZIP file
-
A single ZIP file with all of your code and resources. If the file is bigger than 25MB, we will give you an alternate mechanism for getting it to us. You may provide us with access to a GitHub repo. If you can make your system run on the web (for example, we’ve seen assignments using GitHub pages or Heroku, or using the CS public web servers), we will be able to try it out. We understand that such web deployments take extra effort, and will reward people who do it.
-
A video (no longer than 5 minutes). If the file is more than 20MB, we will give you an alternate mechanism for getting it to us.
Note: Everyone must turn in a PDF file on Canvas. If you turn in other things (ZIP, video), please put the PDF as a separate file on Canvas.
Notes on the final turnin
Your documentation is the main thing we will look at. You should assume that we will not be able to run your program. If we feel like we need to see a demo of your system, we may ask you – but this may be hard since it’s exam week.
You are welcome to turn in a video (no longer than 5 minutes) demonstrating your “system.” A few years ago, a student turned in a video made by stop-motion animating post-its on a whiteboard – so you can make a video without a system. Making a “good” video is hard (we’ll discuss this in class), but a brief screencast video with voiceover can be a quick way to show off interactivity in the system. Do not upload files more than 10MB to Canvas. We’ll provide instructions on how to turn things in.
Your PDF document should contain the following:
-
Descriptions of the Designs and Their Intents: Describe your visualizations, their intended tasks, and the rationale for why the designs address the tasks. Provide pictures (screenshots). Remember, we probably will not be able to run your code – so you need to show it off in the document. Provide examples of how your designs make it easy to see the things it is supposed to help the viewer see. It should be clear from your description what the intended user experience is.
-
Use Case Evaluation: Show examples (e.g., screenshots with descriptions) that show that your designs really address the tasks that they are meant to address. (this is part of 1, but is so important that I emphasize it)
-
Discussion of Interaction: Be clear about what it is like to use the system. Be explicit (what can you click on or not). (this is part of 1, but is so important that I emphasize it)
-
Discussion of Data and Findings: Explain how much you were able to use your tool with the real data, and examples of what it helped you see. This might be as simple as “Everything is a sketch, I just looked at the CSV file in Excel to get a sense of what’s there before I tried to come up with a design.” If your tool really works, give us some examples (this is related to 2), but is more about what you did with the tool than the tool itself.
-
Self-Assessment: Give an honest assessment of your project. This is a place to say that you had bigger goals but had to scale back to fit the reality of a 4-week class project, or that you are really pleased with what you’ve done. Please give your honest assessment of your familiarity with the tools that you used, and how much of your energy for this project went into learning those tools. For example, you might say “I am an experienced Python programmer, and used to having to learn new APIs, so picking up Bokeh wasn’t a big deal” or “I had never done any JavaScript programming, so I spend a ton of time working through a lot of tutorials to learn D3”.
We ask that you turn in “all” source code for your program, so that we would have a reasonable chance of being able to build/run it ourselves. There is a fine line between including obscure libraries that we might not have, and bundling up the whole universe. Use your judgment. Definitely include anything that you have written. And remember that in the writeup, we need instructions.
Some hints…
- You can see some interesting things just by loading the data into Excel and sorting the tables in various different ways. I strongly recommend that you look at the data and get a sense of it using something like Excel. Your visualizations should show more complex and interesting things.
- Showing the entire big tree at once may not be a good idea. Just putting 2700 names in one image would be a bit much. Even 2700 dots. You’ll need some scalability strategy.
- Think about interaction. If you can’t implement it, you can describe it and show before/after pictures.
- Think about summarization. How can you combine things (aggregate) so you don’t need to show as much, and provide details on demand?
- If you zoom in (to show a portion of the tree), consider focus + context to help the viewer remember where they are.
- There are no shortage of clever ways to show trees. Check TreeVis.Net.
- BarcodeTree is a brand new way to show trees compactly. I am not sure if I like it, but it did make me think.
Grading
We will give you a single A-F grade (on the 0-100 scale, 90=A). It will be based on what you turn in at the end. If you failed to turn in any of the preliminary phases, we may penalize you.
Your grade will be a combination of:
- Quality of the implementation (how impressive is what you’ve done)
- Quality of the design (how interesting, well thought out, adapted to the tasks, …)
- Quality of the analysis and rationale (how well do you convince us that you’ve come up with designs that really address tasks)
- Quality of the presentation (write-up, video, …)
You can score points in any category. If your implementation is just a sketch, you won’t get much for implementation, but you can make up for it in the other categories.