AbstractsViewer Owner Manual

created nov 2022 by Keaton Leppanen


This document is for project owners of Abstracts Viewer and is not written with general consumption in mind.


Table of Contents

System Overview

System Architecture

AbstractsViewer’s backend is run on CSL machines - it is comprised of a number of virtual machines (one for each dataset). These virtual machines are deployed and managed through the use of Portainer. Portainer creates the virtual machines using a Docker Image stored on DockerHub. AbstractViewer’s frontend is deployed and hosted on GitHub Sites.

The project’s codebase is distributed acrossfive Github repositories:

AbstractsViewer’s backend is run on CSL machines and is deployed/managed via portainer: graphics-experiments/portainer

A DockerHub repository is used to store the Docker Images used to create the backend virtual machines: abstractsviewerdockerid/backend

Data used by the application is stored on CSL machines at ‘p/graphics/MountPoints/AbstractsViewerData’

The data is documented and archived on the Github repository (this data is not actively used by the application): uwgraphics/AbstractsData

A backup of the data is stored on UWMadison Box (this data is not actively used by the application): Vis Data Set Storage/AbstractsData

Data Curation

Formatting

relevant repositories: uwgraphics/AbstractsData

AbstractsViewer’s Data is stored in the form of RIS files. Each RIS file contains the information for the documents from a particular venue/source. A singular document’s data is stored as a list of tags. For examples of what these files look like, view the uwgraphics/AbstractsData repository.

Example RIS data for a single conference paper: RIS Example

Note: each document’s list of tags must begin with the TY tag and end with the ER tag.

More details on the RIS format and tag meaning can be found here: https://en.wikipedia.org/wiki/RIS_(file_format).

The tags used by AbstractsViewer are:

  • TY* - type of reference (CONF or JOURN for the most part), must be first tag
  • TI* - Title
  • T2* - Secondary Title (journal title, if applicable)
  • SP - Start Page
  • EP – End Page (optional)
  • AU* – Author (each author gets their own AU tag - you can have any number of AU tags)
  • PY* – Publication year
  • KW – Keyword (each keyword gets their own KW tag - you can have any number of KW tags)
  • DO – DOI number
  • JO – Journal/Periodical Name
  • IS – Issue Number
  • SN – ISBN/ISSN
  • VO – Published Standard number
  • VL – Volume Number
  • JA – Periodical Name: standard abbreviation
  • Y1 – primary date of the form (25-30 Oct 2020)
  • AB* – Abstract
  • ER* – End Record (must be at end)

Tags marked with ‘’ are required tags. If you do not have data for a given tag, you can leave it blank or not include it. *

Scraping

relevant repositories: uwgraphics/AbstractProcessor

For academic papers from conferences and journals, the most effective way of acquiring data is through the use of web scraping.

However, with the wide array of data sources found in the AbstractsViewer dataset, there is not a one size fits all solution for scraping it. There exists some tools to help with this process on the uwgraphics/AbstractProcessor repository, however they are custom tailored to scraping previous data sources and may or may not be useful for future data sources.

A good place to start is by trying to use the semantic_scholar.py script which scrapes the necessary data from Semantic Scholar. This script does require the DOIs of the papers you want to scrape. In order to obtain those you will likely need to scrape those from a conference website or assemble them from published proceedings.

Another approach is to try and locate the publication containing the papers in question on IEEE Explore or ACM Digital Library and either download or scrape the information from there. You may find ieee_scraper.py to be a useful starting point.

However, more likely than not, you will need to create a custom scraper to assemble the data - sometimes drawing information from multiple sites in the process.

Storage

relevant repositories/links: uwgraphics/AbstractsData, Vis Data Set Storage/AbstractsData

The raw data used for AbstractsViewer is archieved in two locations:

Whenever you add update or add an additional dataset, you should update both archive locations accordingly.

Data is organized into high level datasets, the two most notable being the VIS and Robotics datasets, each represented by a directory. Within those are directories for each conference/journal included in the dataset. Within those are individual RIS files for each year we have data for.

There are README files at all levels of the dataset describing its contents, holes, and any descrepencies/things of note. Make sure to update these to reflect any new data added. There are also heatmaps which help summarize the data, they should also be updated (see Heat Maps under Data Processing).

Data Processing

relevant repositories: uwgraphics/AbstractProcessor, uwgraphics/AbstractsViewer

Reformatting

Raw AbstractsViewer data is stored in the RIS form, however, later stages of the application require the data to be converted into csv format.

To do this run process.py in the uwgraphics/AbstractProcessor repository. You will need to modify the file path in the code to reflect where you are storing the RIS files locally. If you are adding a new dataset you should add the corresponding publication venue to the shortBookTitle and translateBookTitles lists. It will produce two csv files: abstracts.csv and heatMapInfo.csv.

We will use these files in following sections.

Heatmap Generation

We generate heatmaps to help us provide an overview of the dataset.

To do this run the gen_heatmap.py script in the uwgraphics/AbstractProcessor. It takes the heatMapInfo.csv file and produces three heatmaps:

  • The Overall Number of Documents per Year and Conference
  • The percentage of documents which come from a particular conference year (ex. # of Vis 2022 papers/ Total # of Vis conference papers)
  • The percentage a given years documents comprise relative to the largest conference year (# of Vis 2022 papers/ # of Vis papers from the year with the most papers)

Once generated, these heatmaps should be used to update both data archives: uwgraphics/AbstractsData, Vis Data Set Storage/AbstractsData. ]

You should generate new heatmaps for the entire dataset as well more specific heatmaps for the specific datasets you added/modified. To do this rerun process.py on the requisite subset of the data to generate a new heatMapInfo.csv with data only from that specific dataset. Then run gen_heatmap.py using that new csv file.

Statistics Generation

TODO

Web App Deployment

relevant repositories: uwgraphics/AbstractsViewer22

Backend

Precomputation

In order to make AbstractsViewer more efficient, more time intensive operations, such as generating embeddings and computing distances, are done prior to deployment in a precomputation stage.

To begin, you will need to configure a python enviornment according to the requirements.txt file in the uwgraphics/AbstractsViewer22 repository.

Create a directory in the repository named data_[your dataset name] ex. data_visUpdated. This is where we are going to store the results of our precomputation.

Next, navigate to the precomputation directory in the uwgraphics/AbstractsViewer22 repository.

Follow the directions in the directory’s README file. When asked for a csv file, we want to use the abstracts.csv generated in the Data Processing step. When asked for an output directory suffix and or an output file suffix use the name of your dataset (should be the same as your directory name, ie visUpdated). The directory suffix tells the scripts where to place the precomputed data, the file suffix tells it how to label the data.

At the end you should have all these files, double check to make sure, as they are necessary for the application to function:

  • abstracts_vis2022.json
  • seDist_vis2022.npy
  • tfidfTsneCoords_vis2022.json
  • data_vis2022.json
  • seUmapCosine_vis2022.json
  • tfidfTsne_vis2022.json
  • docs_vis2022.json
  • specterDist_vis2022.npy
  • tfidfUmapCosine_vis2022.json
  • index_vis2022/
  • specterUmapCosine_vis2022.json
  • tfidfUmapHiCosine_vis2022.json
  • knnpickle_specter_vis2022
  • specterUmap_vis2022.json
  • tfidfUmapHi_vis2022.json
  • knnpickle_tfidf_vis2022
  • stemToWord_vis2022.json
  • tfidfUmapMedCosine_vis2022.json
  • nmfDist_vis2022.npy
  • tfidf2dDist_vis2022.npy
  • tfidfUmapMed_vis2022.json
  • nmfUmapCosine_vis2022.json
  • tfidfDist_vis2022.npy
  • tfidfUmap_vis2022.json
  • nmfUmap_vis2022.json
  • tfidfLsaCoords_vis2022.json
  • useUmap_vis2022.json
  • se2dDist_vis2022.npy
  • tfidfLsa_vis2022.json

Once you have ensured you have all the necessary files, we need to upload them to the backend machines.

Simply scp the directory containing all the precomputed files to the CSL machine directory: /p/graphics/MountPoints/AbstractsViewerData.

Docker Image Update

This step is only required if you want to modify the backend code or update the packages used by the backend - this is not necessary for simply updating or adding in new data.

The backend Docker Image is the template used for all the virtual machines running the backend. It contains the backend code as well as all the environment requirements for running it and is stored on DockerHub where it is accessed by Portainer to create the backend virtual machines. If we want to modify either the code or the environment of the backend, we need to create and push a new Docker Image.

The code for the backend can be found in the backend directory of the uwgraphics/AbstractsViewer22 repository.

The enviornment requirements for the docker image can be found in the requirements.txt document in the uwgraphics/AbstractsViewer22 repository.

Use the command docker login -u abstractsviewerdockerid to log into the project’s DockerHub account (you will be prompted for a passcode, obtain from collegue).

Make any necessary changes and then, while inside the repository, follow the ‘Build Docker Image’ instructions in the README file in the backend directory. This will have you create a new docker image and push it to the abstractsviewerdockerid/backend DockerHub repo.

The changes will take effect once you redeploy the backend from the Portainer console (covered in the next section).

Portainer

Portainer is a tool used to manage and deploy the virtual machines which comprise AbstractsViewer’s backend. To access the Portainer console navigate to: graphics-experiments/portainer.

Portainer uses the concept of stacks to group virtual machines. There is a seperate stack for each instance of AbstractsViewer (AbstractsViewer, AbstractsViewerHistorical, and AbstractsViewerExperimental). Each stack contains a group of virtual machines deployed on the CSL machines. Each VM corresponds to a dataset.

Updating backend/data: If all we want to do is update an existing dataset or the backend code, we only need to click the stop this stack button to shut down all VMs on the current stack (never shut down individual VMs as this will causes issues, always shut down the entire stack), then navigate to the stack’s Editor tab and click the update the stack button and toggling on the Re-pull image and redeploy button. This will redeploy all the VMs using the most recent Docker Image uploaded to Docker Hub as their template. It will also result in them using the latest precomputation data.

Adding a new dataset: If we want to add a new dataset, we will need to modify the Docker Compose file to instruct Portainer to create a new VM for the dataset and make it accessible to the front end.

To do this, navigate to the Editor view where you can edit the compose file and create a new entry under the services section (you can copy and modify one of the existing ones). It should follow this format: Portainer Config

Where you replace [DATASET NAME] with the name of your dataset - this should correspond to the directory and file suffix you used in the precomputation step.

The first [Dataset Name] at the top controls the name of the VM. The labels section specifies how the front end can access the VM. the command section actually deploys the backend on the VM by running the code in the backend directory of the uwgraphics/AbstractsViewer22 repository, and the volumes section mounts the directory where we have stored the precomputed data on the CSL machine into our VM’s file system. Finally, you should click the update the stack button to deploy the new VM.

Once you are done configuring the Docker Compose file on Portainer, you should also update the Docker Compose file in the uwgraphics/AbstractsViewer22 repository. It does not affect the actual system, but is a way of documenting the compose files since Portainer does not have any version control.

Note you still have to configure the Front End to talk to this new VM (see the Front End section).

Frontend

Updates

If you are updating front end code, you should be able to directly modify the source code in the uwgraphics/AbstractsViewer22 repository and GitHub will automatically redeploy the website with the new changes.

If you are adding a new dataset or have somehow otherwise reconfigured the backend, you will need to modify the vars.json file int eh src directory to include the new VM’s hostname (found in the labels section of the Docker Compose file from the Backend update process) and the datasetList in the App.vue file, the name you assign the dataset in the list will be what appears on the website.

GitHub Workflow

You hopefully should not have to worry about GitHub Workflow unless you are creating a new respository or making some other major change.