AbstractsViewer Owner Manual
created nov 2022 by Keaton Leppanen
This document is for project owners of Abstracts Viewer and is not written with general consumption in mind.
Table of Contents
System Overview
AbstractsViewer’s backend is run on CSL machines - it is comprised of a number of virtual machines (one for each dataset). These virtual machines are deployed and managed through the use of Portainer. Portainer creates the virtual machines using a Docker Image stored on DockerHub. AbstractViewer’s frontend is deployed and hosted on GitHub Sites.
The project’s codebase is distributed acrossfive Github repositories:
- uwgraphics/AbstractProcessor: Tools for gathering and processing data
- uwgraphics/AbstractsViewer: The user facing web application (contains code necessary for running both the frontend and backend)
- uwgraphics/AbstractsViewer22: The historical, base copy of the web application
- uwgraphics/AbstractsViewerExperimental: The test enviornment
- uwgraphics/AbstractsViewerDocs: The user guide in the form of a Hugo site
AbstractsViewer’s backend is run on CSL machines and is deployed/managed via portainer: graphics-experiments/portainer
A DockerHub repository is used to store the Docker Images used to create the backend virtual machines: abstractsviewerdockerid/backend
Data used by the application is stored on CSL machines at ‘p/graphics/MountPoints/AbstractsViewerData’
The data is documented and archived on the Github repository (this data is not actively used by the application): uwgraphics/AbstractsData
A backup of the data is stored on UWMadison Box (this data is not actively used by the application): Vis Data Set Storage/AbstractsData
Data Curation
Formatting
relevant repositories: uwgraphics/AbstractsData
AbstractsViewer’s Data is stored in the form of RIS files. Each RIS file contains the information for the documents from a particular venue/source. A singular document’s data is stored as a list of tags. For examples of what these files look like, view the uwgraphics/AbstractsData repository.
Example RIS data for a single conference paper:
Note: each document’s list of tags must begin with the TY tag and end with the ER tag.
More details on the RIS format and tag meaning can be found here: https://en.wikipedia.org/wiki/RIS_(file_format).
The tags used by AbstractsViewer are:
TY
* - type of reference (CONF or JOURN for the most part), must be first tagTI
* - TitleT2
* - Secondary Title (journal title, if applicable)SP
- Start PageEP
– End Page (optional)AU
* – Author (each author gets their own AU tag - you can have any number of AU tags)PY
* – Publication yearKW
– Keyword (each keyword gets their own KW tag - you can have any number of KW tags)DO
– DOI numberJO
– Journal/Periodical NameIS
– Issue NumberSN
– ISBN/ISSNVO
– Published Standard numberVL
– Volume NumberJA
– Periodical Name: standard abbreviationY1
– primary date of the form (25-30 Oct 2020)AB
* – AbstractER
* – End Record (must be at end)
Tags marked with ‘’ are required tags. If you do not have data for a given tag, you can leave it blank or not include it. *
Scraping
relevant repositories: uwgraphics/AbstractProcessor
For academic papers from conferences and journals, the most effective way of acquiring data is through the use of web scraping.
However, with the wide array of data sources found in the AbstractsViewer dataset, there is not a one size fits all solution for scraping it. There exists some tools to help with this process on the uwgraphics/AbstractProcessor repository, however they are custom tailored to scraping previous data sources and may or may not be useful for future data sources.
A good place to start is by trying to use the semantic_scholar.py
script which scrapes the necessary data from Semantic Scholar. This script does require the DOIs of the papers you want to scrape. In order to obtain those you will likely need to scrape those from a conference website or assemble them from published proceedings.
Another approach is to try and locate the publication containing the papers in question on IEEE Explore or ACM Digital Library and either download or scrape the information from there. You may find ieee_scraper.py
to be a useful starting point.
However, more likely than not, you will need to create a custom scraper to assemble the data - sometimes drawing information from multiple sites in the process.
Storage
relevant repositories/links: uwgraphics/AbstractsData, Vis Data Set Storage/AbstractsData
The raw data used for AbstractsViewer is archieved in two locations:
- GitHub: uwgraphics/AbstractsData
- UW Box: Vis Data Set Storage/AbstractsData
Whenever you add update or add an additional dataset, you should update both archive locations accordingly.
Data is organized into high level datasets, the two most notable being the VIS
and Robotics
datasets, each represented by a directory. Within those are directories for each conference/journal included in the dataset. Within those are individual RIS files for each year we have data for.
There are README files at all levels of the dataset describing its contents, holes, and any descrepencies/things of note. Make sure to update these to reflect any new data added. There are also heatmaps which help summarize the data, they should also be updated (see Heat Maps under Data Processing).
Data Processing
relevant repositories: uwgraphics/AbstractProcessor, uwgraphics/AbstractsViewer
Reformatting
Raw AbstractsViewer data is stored in the RIS form, however, later stages of the application require the data to be converted into csv format.
To do this run process.py
in the uwgraphics/AbstractProcessor repository. You will need to modify the file path in the code to reflect where you are storing the RIS files locally. If you are adding a new dataset you should add the corresponding publication venue to the shortBookTitle
and translateBookTitles
lists. It will produce two csv files: abstracts.csv
and heatMapInfo.csv
.
We will use these files in following sections.
Heatmap Generation
We generate heatmaps to help us provide an overview of the dataset.
To do this run the gen_heatmap.py
script in the uwgraphics/AbstractProcessor. It takes the heatMapInfo.csv
file and produces three heatmaps:
- The Overall Number of Documents per Year and Conference
- The percentage of documents which come from a particular conference year (ex. # of Vis 2022 papers/ Total # of Vis conference papers)
- The percentage a given years documents comprise relative to the largest conference year (# of Vis 2022 papers/ # of Vis papers from the year with the most papers)
Once generated, these heatmaps should be used to update both data archives: uwgraphics/AbstractsData, Vis Data Set Storage/AbstractsData. ]
You should generate new heatmaps for the entire dataset as well more specific heatmaps for the specific datasets you added/modified. To do this rerun process.py
on the requisite subset of the data to generate a new heatMapInfo.csv
with data only from that specific dataset. Then run gen_heatmap.py
using that new csv file.
Statistics Generation
TODO
Web App Deployment
relevant repositories: uwgraphics/AbstractsViewer22
Backend
Precomputation
In order to make AbstractsViewer more efficient, more time intensive operations, such as generating embeddings and computing distances, are done prior to deployment in a precomputation stage.
To begin, you will need to configure a python enviornment according to the requirements.txt
file in the uwgraphics/AbstractsViewer22 repository.
Create a directory in the repository named data_[your dataset name]
ex. data_visUpdated
. This is where we are going to store the results of our precomputation.
Next, navigate to the precomputation
directory in the uwgraphics/AbstractsViewer22 repository.
Follow the directions in the directory’s README
file. When asked for a csv file, we want to use the abstracts.csv
generated in the Data Processing step. When asked for an output directory suffix
and or an output file suffix
use the name of your dataset (should be the same as your directory name, ie visUpdated
). The directory suffix tells the scripts where to place the precomputed data, the file suffix tells it how to label the data.
At the end you should have all these files, double check to make sure, as they are necessary for the application to function:
abstracts_vis2022.json
seDist_vis2022.npy
tfidfTsneCoords_vis2022.json
data_vis2022.json
seUmapCosine_vis2022.json
tfidfTsne_vis2022.json
docs_vis2022.json
specterDist_vis2022.npy
tfidfUmapCosine_vis2022.json
index_vis2022/
specterUmapCosine_vis2022.json
tfidfUmapHiCosine_vis2022.json
knnpickle_specter_vis2022
specterUmap_vis2022.json
tfidfUmapHi_vis2022.json
knnpickle_tfidf_vis2022
stemToWord_vis2022.json
tfidfUmapMedCosine_vis2022.json
nmfDist_vis2022.npy
tfidf2dDist_vis2022.npy
tfidfUmapMed_vis2022.json
nmfUmapCosine_vis2022.json
tfidfDist_vis2022.npy
tfidfUmap_vis2022.json
nmfUmap_vis2022.json
tfidfLsaCoords_vis2022.json
useUmap_vis2022.json
se2dDist_vis2022.npy
tfidfLsa_vis2022.json
Once you have ensured you have all the necessary files, we need to upload them to the backend machines.
Simply scp the directory containing all the precomputed files to the CSL machine directory: /p/graphics/MountPoints/AbstractsViewerData
.
Docker Image Update
This step is only required if you want to modify the backend code or update the packages used by the backend - this is not necessary for simply updating or adding in new data.
The backend Docker Image is the template used for all the virtual machines running the backend. It contains the backend code as well as all the environment requirements for running it and is stored on DockerHub where it is accessed by Portainer to create the backend virtual machines. If we want to modify either the code or the environment of the backend, we need to create and push a new Docker Image.
The code for the backend can be found in the backend
directory of the uwgraphics/AbstractsViewer22 repository.
The enviornment requirements for the docker image can be found in the requirements.txt
document in the uwgraphics/AbstractsViewer22 repository.
Use the command docker login -u abstractsviewerdockerid
to log into the project’s DockerHub account (you will be prompted for a passcode, obtain from collegue).
Make any necessary changes and then, while inside the repository, follow the ‘Build Docker Image’ instructions in the README
file in the backend
directory. This will have you create a new docker image and push it to the abstractsviewerdockerid/backend DockerHub repo.
The changes will take effect once you redeploy the backend from the Portainer console (covered in the next section).
Portainer
Portainer is a tool used to manage and deploy the virtual machines which comprise AbstractsViewer’s backend. To access the Portainer console navigate to: graphics-experiments/portainer.
Portainer uses the concept of stacks to group virtual machines. There is a seperate stack for each instance of AbstractsViewer (AbstractsViewer, AbstractsViewerHistorical, and AbstractsViewerExperimental). Each stack contains a group of virtual machines deployed on the CSL machines. Each VM corresponds to a dataset.
Updating backend/data: If all we want to do is update an existing dataset or the backend code, we only need to click the stop this stack
button to shut down all VMs on the current stack (never shut down individual VMs as this will causes issues, always shut down the entire stack), then navigate to the stack’s Editor
tab and click the update the stack
button and toggling on the Re-pull image and redeploy
button. This will redeploy all the VMs using the most recent Docker Image uploaded to Docker Hub as their template. It will also result in them using the latest precomputation data.
Adding a new dataset: If we want to add a new dataset, we will need to modify the Docker Compose file to instruct Portainer to create a new VM for the dataset and make it accessible to the front end.
To do this, navigate to the Editor
view where you can edit the compose file and create a new entry under the services
section (you can copy and modify one of the existing ones). It should follow this format:
Where you replace [DATASET NAME]
with the name of your dataset - this should correspond to the directory and file suffix you used in the precomputation step.
The first [Dataset Name] at the top controls the name of the VM. The labels
section specifies how the front end can access the VM. the command
section actually deploys the backend on the VM by running the code in the backend
directory of the uwgraphics/AbstractsViewer22 repository, and the volumes
section mounts the directory where we have stored the precomputed data on the CSL machine into our VM’s file system. Finally, you should click the update the stack
button to deploy the new VM.
Once you are done configuring the Docker Compose file on Portainer, you should also update the Docker Compose file in the uwgraphics/AbstractsViewer22 repository. It does not affect the actual system, but is a way of documenting the compose files since Portainer does not have any version control.
Note you still have to configure the Front End to talk to this new VM (see the Front End section).
Frontend
Updates
If you are updating front end code, you should be able to directly modify the source code in the uwgraphics/AbstractsViewer22 repository and GitHub will automatically redeploy the website with the new changes.
If you are adding a new dataset or have somehow otherwise reconfigured the backend, you will need to modify the vars.json
file int eh src
directory to include the new VM’s hostname (found in the labels
section of the Docker Compose file from the Backend update process) and the datasetList
in the App.vue
file, the name
you assign the dataset in the list will be what appears on the website.
GitHub Workflow
You hopefully should not have to worry about GitHub Workflow unless you are creating a new respository or making some other major change.