Skip to content
Snippets Groups Projects
COMPARISON.md 4.41 KiB
Newer Older
# Comparison

As mentioned before, PROVEE is a user-friendly system for 2D embedding projections. 

However, PROVEE is not the only library that aims to visualize embeddings. To build the optimal embedding projector we have looked at existing tools and examined what worked well using these tools. But most importantly: what was missing? Other's mistakes can be a fruitful sources of information. Mistakes can only imPROVEE our library!

There are several libraries and tools that try to visualize embeddings, but there are only a few tools where visualization of embeddings is the main purpose. 

The most important libraries/tools we have looked at to create PROVEE:
- Vec2graph
- Tensorflow Embedding Projector
- Whatlies 
- Parallax 

Libraries and tools were examined for their scalability, user-friendliness, responsiveness and advantages and disadvantages. We were also interested in their back-end and data storage and transfer.

##### Vec2graph
Vec2graph is a library by [Katricheva et al.(2020)][vec2graph]  for visualizing word embeddings as graphs. The 2D graph is created based on cosine similarity between points and between neighbours. Graphs can contain nodes that link to other graphs and are displayed using an .html file. Vec2graph shines in the ease of use, but the projections are not suited to display many data points at once. 
 
##### Whatlies
[Whatlies][df1] is a project developed by Rasa. The goal of the project is to create an API that supports many languages, such as SpaCy, Gensim and FastText to display word or sentence embeddings. Whatlies enables users to easily display data in interactive 2D graphs. The axes can be defined by dimension reduction methods, such as PCA or UMAP, but also special queries. For instance, 'man' can be the y-axis and 'woman' can be the x-axis. The special thing about Whatlies is the support for vector arithmetic on embeddings, which can be visualized directly in the interactive plots. It has a high scalability of the input, but the overall goal is to visualize smaller groups of words. The drawback is that the library depends on a lot of backends, which can be problematic when packages get updated. Furthermore, the tool requires little programming knowledge, but has a clear documentation on their Github.io page.

##### Tensorflow Embedding Projector
The [Embedding Projector][tens] is part of the Tensorboard. It can graphically represent embeddings in a 2D or 3D space. These embeddings can be anything, as long as they can be converted to a tab separated file. The tool has a high scalability and is suited to display many data points at once. The user can interactively explore the embedding space, varying many parameters, easily switching from PCA to UMAP, 2D to 3D. A disadvantage of the tool is the limitation of dimensionality reduction methods as preprocessing method to display the data points. 

##### Parallax
Last but not least, there is [Parallax][par]. Parallax is a tool to display word embedding spaces, suited for many embeddings with high dimensions. The tool is suited to display many data points at once. Most interesting is the article accompanying the tool of [Molino et al. (2019)][parallax]. The axes can be obtained using PCA or t-SNE, but they propose a Cartesian approach to the specification of the axes: axes that are the result of algebraic formulas on these vectors. This results in axes that can be the average of two words or the most frequently occuring word in the data set. The major drawback of the tool is the slow loading and reaction time when using >10.000 points. Another difficulty of the tool is the required programming exprience. Some knowledge of programming is required. 
Our current project PROVEE tries to overcome all the major drawbacks of the studied tools, by designing a user-friendly, responsive tool, suitable for many embeddings with high dimensionality. The embeddings can be anything. Not only word embeddings, but image embeddings, DNA embeddings, sentence embeddigns, you name it. Programming experience is not required, which makes the tool easy to use. 
[//]: # (These are reference links used in the body of this note and get stripped out when the markdown processor does its job)

   [parallax]: <https://arxiv.org/abs/1905.12099>
   [tens]: <https://projector.tensorflow.org/>
   [par]: <https://github.com/uber-research/parallax>
   [df1]: <https://rasahq.github.io/whatlies/>
   [vec2graph]: <https://link.springer.com/chapter/10.1007/978-3-030-39575-9_20>