Design Process Figure#
This notebook demonstrates the design process figure example, a simple demonstration of the design process using the flexible subset selection strategy. We can express criteria and approaches as objectives, blend them together, and tune the parameters to create subsets for visualization. This notebook generates Figure 1 of the paper in three parts: figures/express.pdf
, figures/blend.pdf
, and figures/tune.pdf
. The random dataset generated for the example and the subsets selected can be found in data/1-designProcess
.
Imports and Setup#
# Standard library
import logging
from pathlib import Path
# Third party
import matplotlib.pyplot as plt
import matplotlib_inline
import numpy as np
import seaborn as sns
# Local files
import flexibleSubsetSelection as fss
# Initialize notebook settings
sns.set_theme() # set seaborn theme
matplotlib_inline.backend_inline.set_matplotlib_formats('svg') # vector plots
%load_ext autoreload
%autoreload 2
Express#
Common or custom criteria or standard approaches can be expressed as objectives which can be selected and used for a visualization. The flexibility of the general strategy means any criteria that can be expressed in terms of a loss function can be employed for subset selection. Here, we demonstrate expressing convex hull, outliers, and distinctness as objectives and applying them for subset selection.
notebookName = "Fig1-designProcess"
dataDirectory = Path("..") / "data" / notebookName
figuresDirectory = Path("..") / "figures" / notebookName
seed = 123456789 # random seed for replicability
fss.logger.setup(level = logging.WARNING) # set logging level for the package
# Create a random blobs dataset to use as our example dataset
dataset = fss.Dataset(name="Random Blobs",
randTypes="blobs",
size=(200, 2),
seed=seed)
dataset.save(directory=dataDirectory)
Convex Hull Objective#
Applying this objective allows us to select the smallest subset that covers the convex hull of the original dataset.
# Precalculate the hull metric on the full dataset
dataset.compute(hull = fss.metric.hull)
# Create a unicriterion loss function with the hull metric and precomputation
lossFunction = fss.UniCriterion(objective = fss.objective.preserveMetric,
metric = fss.metric.hull,
datasetMetric = dataset.hull)
# Create a solve method with a greedy algorithm and a set subset size
solver = fss.Solver(algorithm = fss.algorithm.greedyMinSubset,
lossFunction = lossFunction)
# Initialize color and plot settings
color = fss.Color()
fss.plot.initialize(color, font="DejaVu Sans")
lossPlotter = fss.plot.RealTimePlotter(color)
# Solve for a convex hull subset
subsetHull = solver.solve(dataset,
epsilon = 0,
initialSize = 3,
callback = lossPlotter.update)
subsetHull.save(directory=dataDirectory, name="hullSubset")
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[3], line 19
16 lossPlotter = fss.plot.RealTimePlotter(color)
18 # Solve for a convex hull subset
---> 19 subsetHull = solver.solve(dataset,
20 epsilon = 0,
21 initialSize = 3,
22 callback = lossPlotter.update)
24 subsetHull.save(directory=dataDirectory, name="hullSubset")
File ~/Documents/GitHub/flexibleSubsetSelection/src/flexibleSubsetSelection/solver.py:93, in Solver.solve(self, dataset, **parameters)
91 # Run algorithm on dataset to select a subset that minimizes loss
92 with Timer() as timer:
---> 93 z, loss = self.algorithm(dataset, self.lossFunction, **parameters)
95 # Log information on completion of the solve
96 log.info(
97 (
98 "Selected subset from dataset '%s' with '%s' and '%s' "
(...) 105 loss,
106 )
File ~/Documents/GitHub/flexibleSubsetSelection/src/flexibleSubsetSelection/algorithm.py:292, in greedyMinSubset(dataset, lossFunction, epsilon, minError, maxIterations, seed, initialSize, callback)
289 z = np.zeros(datasetLength, dtype=int)
291 # Randomly select initial points
--> 292 selected_indices = np.random.choice(
293 datasetLength, initialSize, replace=False, seed=rng
294 )
295 z[selected_indices] = 1
297 # Set of available indices
File numpy/random/mtrand.pyx:855, in numpy.random.mtrand.RandomState.choice()
TypeError: choice() got an unexpected keyword argument 'seed'
Outliers Objective#
Applying this objective allows us to select a subset of 40 points with the highest local outlier effect.
# Precalculate the outlierness (local outlier effect) of the full dataset
dataset.compute(outlierness = fss.objective.outlierness)
# Create a loss function that is just the sum of the LOF in the subset
solver.lossFunction = fss.UniCriterion(objective = np.sum,
solveArray = "outlierness")
solver.algorithm = fss.algorithm.greedySwap
# Solve for an outlier subset
subsetOutliers = solver.solve(dataset,
subsetSize = 40,
callback = fss.plot.RealTimePlotter(color).update)
subsetOutliers.save(directory=dataDirectory, name="outliersSubset")
Distinctness Objective#
Applying this objective allows us to select a subset of 60 points that are distant from their nearest neighbors in 2D space.
# Create a unicriterion loss function with the distinctness objective
dataset.compute(distances = fss.metric.distanceMatrix)
solver.lossFunction = fss.UniCriterion(objective = fss.objective.distinctness,
solveArray = "distances",
selectBy = "matrix")
# Solve for distinctness subset
subsetDistinct = solver.solve(dataset=dataset,
subsetSize=60,
callback = fss.plot.RealTimePlotter(color).update)
subsetDistinct.save(directory=dataDirectory, name="distinctSubset")
Plot#
Now we visualize these three example objectives by plotting the dataset and subsets in 3 scatterplots.
# Plot the three different resulting subsets as scatterplots
titleSize = 24
subtitleSize = 18
titles = ["Hull", "Outliers", "Distinctness"]
subsets = [subsetHull, subsetOutliers, subsetDistinct]
fig, axs = plt.subplots(nrows=1, ncols=3, figsize=(6, 3))
fig.text(0.49, 1, '1. Express', ha='center', va='center', fontsize=titleSize)
for i, ax in enumerate(fig.axes):
ax.grid(visible=False)
ax.set_xticks([])
ax.set_yticks([])
ax.set_title(titles[i], fontsize=subtitleSize)
fss.plot.scatter(ax = ax,
color = color,
dataset = dataset,
subset = subsets[i],
alpha = 0.6)
plt.savefig(figuresDirectory / "express.pdf", bbox_inches="tight")
Blend#
For visualizations with multiple criteria, objectives can be combined together to achieve more complicated outcomes. Multiple criteria can be balanced using the weight parameters to ensure the objectives apply to the subsets at desirable levels. Here we create three different subsets that blend a distribution objective (using the earth movers distance as the metric) with the distinctness objective from the previous section. Each subset blends the two objectives differently by varying the weights.
subsetSize = 80 # The size of subsets being selected with blended objectives
# Use distribution and distinctness objectives
objectives = [fss.objective.earthMoversDistance, fss.objective.distinctness]
# Parameters of the distribution and distinctness objectives
parameters = [{"dataset": dataset.original},
{"solveArray": "distances", "selectBy": "matrix"}]
# Create the multicriterion loss function from the objectives and weight them
solver.lossFunction = fss.MultiCriterion(objectives = objectives,
parameters = parameters,
weights=[100, 1])
# Solve for the blended distribution and distinctness subset
subsetBlend1 = solver.solve(dataset,
subsetSize=subsetSize,
callback = fss.plot.RealTimePlotter(color).update)
subsetBlend1.save(directory=dataDirectory, name="blend1Subset")
# Update the weights to provide less emphasis on the distribution objective
solver.lossFunction = fss.MultiCriterion(objectives = objectives,
parameters = parameters,
weights=[10, 1])
# Solve for the blended distribution and distinctness subset
subsetBlend2 = solver.solve(dataset,
subsetSize=subsetSize,
callback = fss.plot.RealTimePlotter(color).update)
subsetBlend2.save(directory=dataDirectory, name="blend2Subset")
# Update the weights to an even weight of the two objectives
solver.lossFunction = fss.MultiCriterion(objectives = objectives,
parameters = parameters,
weights=[1, 1])
# Solve for the blended distribution and distinctness subset
subsetBlend3 = solver.solve(dataset,
subsetSize=subsetSize,
callback = fss.plot.RealTimePlotter(color).update)
subsetBlend3.save(directory=dataDirectory, name="blend3Subset")
Plot#
Now we visualize these three example blended subsets by plotting the dataset and subsets in 3 scatterplots.
# Plot the three subsets with different blends of the two objectives
titles = ["More Distribution,\nLess Distinct", "",
"Less Distribution,\nMore Distinct"]
subsets = [subsetBlend1, subsetBlend2, subsetBlend3]
fig, axs = plt.subplots(nrows=1, ncols=3, figsize=(6, 3))
fig.text(0.49, 1, '2. Blend', ha='center', va='center', fontsize=titleSize)
for i, ax in enumerate(fig.axes):
ax.grid(visible=False)
ax.set_xticks([])
ax.set_yticks([])
ax.set_title(titles[i], fontsize=subtitleSize)
fss.plot.scatter(ax = ax,
color = color,
dataset = dataset,
subset = subsets[i],
alpha = 0.6)
plt.savefig(figuresDirectory / "blend.pdf", bbox_inches="tight")
Tune#
Along with the weight parameters that help in blending objectives, the parameters of the objectives and the optimization formulation can be tuned to select a subset that is more effective for a given visualization. For example, the subset size, loss bounds, and subset weight parameters can be used to tune the subset size for a particular visualization. Objectives may have parameters associated with them such as the number of clusters in a clustering objective, the amount of seperation in a distinctness objective, or the smoothing parameter of a kernel density estimation. These provide additional flexibility to tune the objectives to a desirable level. Here we tune the subset size by varying the weight given to this fundamental characteristic of a subset to three different levels.
# Create a unicriterion loss function with the distinctness objective
solver.lossFunction = fss.UniCriterion(objective = fss.objective.distinctness,
solveArray = "distances",
selectBy = "matrix")
solver.algorithm = fss.algorithm.greedyMixed
# Solve for subsets with 3 different subset sizes
subsetDistinct1 = solver.solve(dataset=dataset, weight=0.5, initialSize=3,
callback = fss.plot.RealTimePlotter(color).update)
subsetDistinct1.save(directory=dataDirectory, name="distinct1Subset")
subsetDistinct2 = solver.solve(dataset=dataset, weight=0.25, initialSize=3,
callback = fss.plot.RealTimePlotter(color).update)
subsetDistinct2.save(directory=dataDirectory, name="distinct2Subset")
subsetDistinct3 = solver.solve(dataset=dataset, weight=0.05, initialSize=3,
callback = fss.plot.RealTimePlotter(color).update)
subsetDistinct3.save(directory=dataDirectory, name="distinct3Subset")
Plot#
Now we visualize these three example subsets of different sizes by plotting the dataset and subsets in 3 scatterplots.
# Plot the three subsets with different subset sizes based on the tuning
titles = ["Fewer Points,\nMore Distinct", "", "More Points,\nLess Distinct"]
subsets = [subsetDistinct1, subsetDistinct2, subsetDistinct3]
fig, axs = plt.subplots(nrows=1, ncols=3, figsize=(6, 3))
fig.text(0.49, 1, '3. Tune', ha='center', va='center', fontsize=titleSize)
for i, ax in enumerate(fig.axes):
ax.grid(visible=False)
ax.set_xticks([])
ax.set_yticks([])
ax.set_title(titles[i], fontsize=subtitleSize)
fss.plot.scatter(ax = ax,
color = color,
dataset = dataset,
subset = subsets[i],
alpha = 0.6)
plt.savefig(figuresDirectory / "tune.pdf", bbox_inches="tight")