This repository includes:
- The implementation of CIME
- Documentation
- Installation
- How to cite?
Check out our paper for further details about the implementation and use cases of CIME.
Check out the DEMO website of CIME, which includes the datasets used in the use cases.
Check out the SDF generation examples if you want to try CIME with your own dataset.
Check out the example datasets from the paper's use cases.
The ChemInformatics Model Explorer (short CIME) extension of the Projection Space Explorer allows users to interactively explore a fixed subspace of chemical compounds. Users can apply a 2D projection to the provided data, and additionally show the high-dimensional data in a LineUp table. Furthermore, users can select datapoints and show the 2D compound structures of all selected items, aligned to each other, in a side-view. If provided in the data, users can change the representation in the side-view to show atom-level attributions in the 2D compound structure. This could be used for comparing neighbors for example to check if machine learning model explanations - generated for those datapoints - make sense. Using the grouping tool allows for easier interaction with item neighborhoods.
Instructions for installing the application are provided at the end of this documentation.
This section explains the general layout of the tool and the basic controls with which you can interact with the tool.
- Left Menu Drawer (orange): Shows tabs that contain different groups of actions
- Center View (yellow): Shows the current projection and allows the user to interact with the low dimensional projection of the data items
- Table Component (blue): Can be dragged up from the bottom of the window to show a LineUp table of the high dimensional space of the data items
The following describes a list of controls:
- hover over item: shows a detailed view of the item
- hover over group center: shows group label
- left-click on item: select this item
- left-click + shift on item: toggle the selection status (i.e. if the item is selected, it is removed from selection; if the item is not selected, it is added to the selection)
- left-click on group-center: select the whole group
- left-click + shift on group-center: add the group to the selection
- left-click + drag on group-center: draw a storytelling arrow to another group center
- left-click + drag: new selection of items
- left-click + shift + drag: toggles the selection (i.e. unselected points that are within the lasso are added to the selection and selected points that are within the lasso are deselected)
- right-click + drag: allows you to move the whole scatterplot
- mouse wheel: zoom in and out to get a more/less detailed view of the items in the scatterplot
- right-click on background or item: opens a context menu that allows to create a group from the selected points
- right-click on group center: opens group context menu that allows to delete a group or start the storytelling feature
When loading the website there is a default dataset loaded, which is called "test.sdf". Additionally, users can load datasets that were already uploaded previously or they can upload their own custom dataset. The list of uploaded files includes all SDF files that are available in the backend (from any user!) and can be deleted with the delete button next to the filename. The list can also be manually refreshed with the refresh button next to "Uploaded Files" (this is only necessary if another user uploads a file during a simultaneous session and the current user needs this exact file).
If a user wants to upload a custom file they have to use the file format that is described in the “Data Format” subsection. We provide an SDF generation example to get users started with their own datasets.
Data is handed to the system using a Structure-Data File (SDF) that contains a collection of chemical compounds and additional properties that can be customized. New files are first uploaded to the python backend that runs with Bottle (https://bottlepy.org/docs/dev/) and then processed with the help of the RDKit framework (https://www.rdkit.org/). For big files, the initial upload and preprocessing can take several minutes. If the files are already uploaded, it is much faster.
Properties can be compound-specific (i.e. for the whole datapoint) or atom-specific (i.e. one value for each atom in the compound). Details are described in the next subsections.
A small example can be found in “backend/test.sdf”. Datasets used in CIME's article are available in the data repository: https://www.doi.org/10.17605/OSF.IO/KNS6M
These properties can be used for projection and can be shown in the LineUp table (like solubility, atom weight, or any other property that is important to the user). Properties without semantic meaning like fingerprints or the embedding space of a compound can be used for projection, but are not shown in the table to reduce unnecessary information and loading times. Such properties can be specified with the “fingerprint” modifier as described in the “Modifiers” subsection.
Compound-specific properties can contain arbitrary values, however the naming should be consistent for all compounds (i.e. each property should be present for each compound).
There are special properties that are handled differently by the system:
- Including properties x and y tells the system to initialize the scatterplot according to these values.
- The property groupLabel specifies the group each compound belongs to.
Atom-specific properties are recognized by the backend if the property starts with atom.dprops. Those properties are interpreted as attribution scores and shown on top of the compound structure with a heatmap and contour lines (see section “Details” for more information.
Atom properties must contain one value for each atom of the compound. They can be easily generated with RDKit: https://www.rdkit.org/docs/RDKit_Book.html#atom-properties-and-sdf-files.
In the frontend there is an autocomplete user input that groups atom properties. Values for the autocomplete are extracted as follows (e.g. example property "atom.dprop.rep_0"):
- atom.dprop is dropped because it is just a modifier that is needed by the backend
- group name: substring that includes everything before the last underscore (e.g. "rep")
- value: substring after the last underscore (e.g. "0")
Modifiers are used to group compound properties. This enables the system to provide features that enhance usability (e.g. when projecting the data users can choose, which properties should be used for the projection; with grouping, users are allowed to (de-)select entire groups, which is important if a group consists of hundreds of properties as in the case of fingerprints). Some modifiers have special functions, which will be explained later in this section.
By default the system recognizes the following modifiers: "fingerprint", "rep", "pred", "predicted", "measured", "smiles". When choosing a file a dialog window opens where users can specify custom modifiers in addition to the default set of modifiers.
To decorate a property with a modifier, the modifier has to be prepended to the property name and separated by an underscore (e.g. “fingerprint_1”, “fingerprint_2” etc).
The predefined smiles modifier has a special function: if a property is decorated with "smiles_*" the system will recognize the property as a SMILES string and thus show the compound structure in the LineUp table.
If there is no fingerprint modifier in the properties of a dataset, the system will create them automatically using the built-in RDKit function: https://rdkit.readthedocs.io/en/latest/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints.
When the data is loaded the x and y properties are used as initial positions for the scatterplot. If x and y are not specified they will be randomly initialized. The values for x and y can then be calculated with a projection method.
Currently, only UMAP projection is available for CIME. To implement the projection we used this library: https://github.com/PAIR-code/umap-js. The JavaScript library code is a reimplementation of this python library https://github.com/lmcinnes/umap, with the difference that the JS library uses random seed points as initialization by default.
Before calculating the projection, users can choose the features which should be used for the projection. This can be done by selecting and deselecting the corresponding checkboxes. To select or deselect whole semantic groups of features, users can interact with the checkboxes next to the group name. Clicking on a group-row collapses/expands the list of items in this group.
Users are also able to choose, if a numerical feature should be normalized, which applies standardization to all values of this feature (i.e. subtract by mean and divide by standard deviation).
The range value indicates the minimum and maximum values of the feature.
Furthermore, users can adjust hyperparameters used for the projection. Noteworthy here is the checkbox Seed Position, which tells the system to initialize the projection with the current positions of the items instead of using a random initialization.
Parameters that can not be defined by the user are set to the defaults suggested in https://umap-learn.readthedocs.io/en/latest/api.html.
The “Project” tab panel includes a view that shows the progress of a projection as soon as the projection starts to calculate. Here, the calculations can be paused and continued.
If there are item groups specified, the movement (trail) of the group centers during the projection can be visualized by enabling the Show Group Trail toggle.
Users also have the possibility to save current projections and change between the projection states of those savepoints.
In the "Encoding" tab panel users can change the marks and channels of the displayed data.
- shape by: select a categorical attribute and encode each value as a different mark
- brightness by: select a numerical attribute and scale the brightness (opacity) of each point by that value; the upper and lower limit of the brightness can be adjusted with the scale below; if nothing is selected, the slider can be adjusted to set the general brightness value of all points
- size by: select a numerical attribute and scale the size of each point by that value; the upper and lower limit of the size can be adjusted with the scale below; if nothing is selected, the slider can be adjusted to set the general size value of all points
- color by: select a categorical or numerical attribute that defines the color of the points; the colormap can be chosen below and depends on whether the attribute is numerical or categorical
- advanced coloring: if you color by a categorical attribute, this allows you to hide/show items with certain values
In the "Groups" tab panel users can adjust group settings, automatically define groups by clustering and select different stories.
One toggle allows users to show or hide items in the scatterplot. The other one allows users to show or hide group centers (grey diamonds).
Users can choose, how the items of a selected group should look like. If a user clicks on a group center (grey diamond), all items belonging to that group are highlighted. If Contour Plot is selected, the items belonging to that group are surrounded by contour lines. If Star Visualization is selected, there are lines drawn from the group center to each item. If None is selected, the points belonging to the group are just highlighted.
Automatic Clustering of the projected features can be done in this panel. The algorithm used for clustering is HDBSCAN (https://hdbscan.readthedocs.io/en/latest/index.html). Parameters can be changed either by adjusting the slider (few clusters...many clusters), or by enabling the Advanced-Mode. Chosen parameters are always synchronized with the values in the advanced user inputs. Any other possible parameters that could be used for HDBSCAN are set to the default parameters that can be retrieved from the HDBSCAN docs.
A storybook is a set of groups and possible connections between those groups that were either created automatically or manually composed. This way, users can view different groupings by just switching between stories.
A new storybook can be created by clicking Add Empty. Users can manually add groups to a new or existing storybook by selecting points in the scatter plot and choosing "Create Group from Selection" from the context menu that opens with a right-click on the scatter plot.
The groups in a storybook are listed below the user select. Each item in the list represents one group. If a user clicks on a group, the corresponding points are highlighted in the scatter plot. Holding CTRL adds a group to the selection. Next to each group label there is a settings button where users can adjust group names, delete a group or filter the LineUp table by this group.
In this tab panel summary visualizations of selected points are shown. The user can choose to show this in an external window by clicking the corresponding toggle.
When points are selected users can see the 2D compound structure of the selected items, aligned to each other according to their maximum common substructure. Users can select compounds from this view if they check the corresponding checkboxes and filter by the selected compounds by clicking on Confirm Selection (green).
There is a user input that allows to choose among all provided representations (yellow). The available representations are specified in the dataset and contain atom-level attribution scores for each compound. To choose a representation users can either scroll through the list, or they can filter the list by typing in the auto-complete text field. Representations are organized by groups that can be specified manually as described in the "Atom Properties" chapter.
The Settings button allows users to manually refresh the representation list (blue). Furthermore, users can adjust settings that are used in the backend. Especially important is the Align Structure toggle, since the alignment might distort the compound structure. By disabling this feature, the compound structures are not aligned to each other anymore. However, the structures will be shown as expected again.
Clicking on Add View (orange) places an additional view of the selected compounds next to the existing view and enables the user to choose and compare several representations at once. Additional views can be removed again using the Delete-symbol button. It is recommended to use this feature in the external window only because there is more space.
For high-dimensional data exploration, we included a LineUp table that can be viewed on-demand. To show the table you need to drag the component from the bottom of the window to increase the size of the table.
The table shows all properties that were included in the provided dataset except properties that have the "fingerprint" modifier. Fingerprints were excluded because their values usually do not contain semantic meaning and would take a lot of space in the table, which causes higher loading times and makes the table more complex.
All LineUp functionalities are included like filtering, searching, sorting, etc. The grouping functionality can be performed in all columns, especially relevant is group by selected items and group by group labels, which actively uses features of the Projection Space Explorer.
The Load All button automatically makes the table component visible - if it was not shown yet - and removes all filters.
The Load Selection button automatically makes the table component visible - if it was not shown yet - and filters the table by the selected items.
The Show Cell Values toggle can be enabled to show values in numerical table cells. If it is disabled, the values are only shown for highlighted rows.
The Export CSV downloads the table in its current state as .csv file. Current filters, ordering, and custom annotations are contained in this file.
Using the "smiles" modifier, users can manually specify, which properties represent SMILES strings. For each column that contains SMILES, there is an additional "structure" column created that shows the 2D structure next to the SMILES column. The SMILES columns have some additional features:
- Users can filter those columns by substructure (a valid SMILES string must be provided in the filter input).
- Changing the width of those columns dynamically adapts row heights, which provides a better view of the 2D structures.
- When grouping several rows, this column displays the maximum common substructure of all compounds in the group.
The table can be used interactively with the scatter plot that represents the projected space and the summary view that shows selected items:
- Hovering items in the table highlights the corresponding items in the other views as well and vice versa.
- Users can select items in the table, which are also selected in the other views and vice versa.
There are multiple ways to run CIME. Option 1 is the easiest method.
Once you have Docker installed, you can quickly run the following commands and have CIME ready to use.
To install the latest version of CIME:
docker pull jkuvdslab/cime
docker run -d -p 8080:8080 --name cime --detach jkuvdslab/cime
To update CIME:
docker rm --force cime
docker pull jkuvdslab/cime
docker run -d -p 8080:8080 --name cime --detach jkuvdslab/cime
To uninstall CIME:
docker rm --force cime
A docker image of CIME is available at Docker Hub.
Use a git tool to clone this repository to your computer.
git clone https://github.com/jku-vds-lab/cime.git
Then navigate to the Application folder in a terminal using
cd cime/Application/
and run the command to install the required packages
npm install
There is always a valid build in the repository, but in case you want to make changes, you can use the local build server. Start it with the command
npm run webpack:dev
Whenever a file is changed while this server is running, it will automatically build a new version and deploy it in the /dist
folder.
To start the application you just need to start the index.html
locally. The easiest way to this is by using the live server provided by either Atom or Visual Studio Code.
In the backend, a Python server runs with the Bottle Framework. Many features that relate to the “Chem” aspects of the Projection Space Explorer are only available if the backend is running. Also, the feature to derive groups from clustering is only available in the backend.
To start the server you need to create a conda environment with the following dependencies:
- bottle=0.12.18
- rdkit=2020.09.5
- hdbscan=0.8.27
- joblib=0.17.0
- bottle-beaker=0.1.3
A requirements.txt
is provided in the folder Application/backend
.
Using this environment you only have to start the server by running
python backend-cime-dist.py
To combine frontend and backend in a docker image we provide a Dockerfile. Before creating the image you have to adjust some settings:
- In the
Application/backend/backend-cime-dist.py
theresponse_header_origin_localhost
constant needs to be set to “http://localhost:8080” - In the
Application/backend/backend-cime-dist.py
the line that starts the server needs to be replaced byrun(app=app, host='0.0.0.0', port=8080)
- In the
Application/src/utils/backend-connect.ts
theBASE_URL
constant needs to be set to an empty string (i.e.“”
) - In the
Application/src/utils/frontend-connect.ts
theBASE_PATH
constant needs to be set to an empty string (i.e.“”
)
In the root folder of the project, you can create the docker image by running
docker build -f Dockerfile -t cime .
and run the image with
docker run -d -p 8080:8080 --detach cime
The application will be available on ‘localhost:8080’.
You can cite CIME using the following bibtex:
@article{humerheberle2022cime,
author={Humer, Christina and Heberle, Henry and Montanari, Floriane and Wolf, Thomas and Huber, Florian and Henderson, Ryan and Heinrich, Julian and Streit, Marc},
journal={Journal of Cheminformatics},
title={{ChemInformatics Model Explorer (CIME)}: Exploratory analysis of chemical model explanations},
year={2022},
doi={10.1186/s13321-022-00600-z},
volume={14},
number={21},
}