Basic similarity search workflow

This is an example of using the supplied command line tools to generate descriptors for molecule sets and invoke similarity searches on them using molecule queries. Parts of the steps described below are implemented in script search-workflow.sh found in examples/ directory. This basic workflow consists of the following steps:

Import molecules and IDs from structure file (creating master molecule storage)
Calculate molecular descriptors to be used as targets for the search
Invoke similarity search on prepared storages
Diagnostic dump of prepared storages (optional)

For more details on the command line scripts involved see their description. For more details on performance see document Performance. An example for skipping the preparation steps and do on the fly descriptor calculation is also given. See also document Details on searchStorage.

Create master molecule storage

Master molecule storage used by other scripts (search) to retrieve structure sources and IDs. Structure IDs also stored in a similar data structure.

Notes

These examples also create a serialized file containing molecule names (as ID) by specifying parameter -name. Creating this file is optional.
Current version keep all parsed structure and the optional IDs/names in memory during processing.
Since version 0.3.0 gzipped input files are automatically recognized. In some of the examples we use gzip to decompress the the input file and read it from the standard input using command line option -in. When the input is read by the command line scripts (bin/<SCRIPT>.sh) the input file can be used directly.

Commands

# Retrieve IDs from SDF properties
gzip -dc data/molecules/drugbank/drugbank-all.sdf.gz | bin/createMms.sh \
    -in - \
    -out drugbank-all-mms.bin \
    -prop COMMON_NAME:drugbank-all-commonname.bin

# Retrieve IDs from molecule name
gzip -dc data/molecules/nci/nci-250k.smi.gz | bin/createMms.sh \
    -in - \
    -out nci-250k-mms.bin \
    -name nci-250k-name.bin

Breakdown of the invocations

Command line part	Description
`gzip -dc <GZFILE>`	Decompress the content of `gzip` encoded file `<GZFILE>` and print it to the standard output.
`\|`	Pipe the standard output of the previous command into the standard input of the following command. See http://www.tldp.org/LDP/abs/html/io-redirection.html for details.
`bin/createMms.sh`	Tool shipped to process input file and store structures and optionally IDs in a proprietary binary file readable by other tools.
`\`	Sign that command is continued in the following line.
`-in <INPUT>`	Specify the location of input structures to process.
`-`	Specify standard input.
`-out <BINFILE>`	Specify the binary file for the master molecule storage to write.
`-prop <PROPNAME>:<BINFILE>`	Specify an SDF property `<PROPNAME>` to be extracted and stored in a binary file `<BINFILE>`.
`-name <BINFILE>`	Extract and store molecule name in a binary file `<BINFILE>`.

Expected performance

Preprocessing the nci-250k dataset on recent desktop machine is expected to be done under a minute.

Calculate fingerprints

Generated descriptors (fingerprints) are stored in a binary file. This file will be read by search tool.

Notes

The settings regarding descriptor generation is determined by the specified OverlapAnalysisContext. Examples below uses predefined contexts. Customization is possible using scripting hooks; see Basic overview of the concepts of overlap analysis context for details.
Currently molecular descriptors are calculated from the original structure file. Current version does not recognize gzipped input files. We use gzip to decompress the the input file and read it from the standard input using command line option -in.

Commands

gzip -dc data/molecules/drugbank/drugbank-all.sdf.gz  | bin/buildStorage.sh \
    -context createSimpleCfp7Context \
    -in - \
    -out drugbank-all-cfp7.bin

gzip -dc data/molecules/nci/nci-250k.smi.gz | bin/buildStorage.sh \
    -context createSimpleCfp7Context \
    -in - \
    -out nci-250k-cfp7.bin

Expected performance

Fingerprint calculation for the nci-250k dataset on a recent desktop machine is expected to be done well under a minute.

Breakdown of the invocations

Command line part	Description
`bin/buildStorage.sh`	Tool to process structure file input, calculate molecular descriptors (fingerprints) and store them in a binary file.
`-context <CONTEXT>`	Specify molecular descriptor, default comparison metric and other parameters to be used during calculation and later search. For details see document Basic overview of the concepts of overlap analysis context.
`-in <INFILE>`	Structure file to process.
`-out <BINFILE>`	Binary file containg calculated descriptors to write.

Invoke similarity search from command line

By default targets are identified by their master index (0-based index in the input structure file). If serialized id or name storage (created with master molecule storage) is specified then the stored ID (or name) is retrieved and printed.

Commands

# Launch a simple search against the NCI database with a query molecule specified as a SMILES string
bin/searchStorage.sh \
    -frombytes nci-250k-cfp7.bin \
    -qm "C1CCCC1"

# Use previously extracted IDs
bin/searchStorage.sh \
    -frombytes nci-250k-cfp7.bin \
    -idstorage nci-250k-name.bin \
    -qm "C1CCCC1"

# Find the 10 most similar structures
bin/searchStorage.sh \
    -frombytes nci-250k-cfp7.bin \
    -idstorage nci-250k-name.bin \
    -mode MOSTSIMILARS \
    -count 10 \
    -qm "C1CCCC1"

# Search most similar from the NCI database for each of the first 10 of the Drugbank dataset
gzip -dc data/molecules/drugbank/drugbank-all.sdf.gz | head -10 | bin/searchStorage.sh \
    -frombytes nci-250k-cfp7.bin \
    -idstorage nci-250k-name.bin \
    -qf -

Breakdown of the invocations

Command line part	Description
`bin/searchStorage.sh`	Tool to invoke similarity searches against molecular descriptors stored in a binary file previously generated by `buildStorage`.
`-frombytes <BINFILE>`	Binary file containing molecular descriptors.
`-idstorage <BINFILE>`	Read target IDs from specified location.
`-qm <QMOLSOURCE>`	Import molecule from source `<QMOLSOURCE>` and use it as a query.
`-qf <QUERY>`	Import query molecules from specified location.
`-qf -`	Import query molecules from standard input.
`-mode MOSTSIMILARS`	Find the `n` most similar molecules for each query.
`-count 10`	Specify the max number of most similar structures to find.

Expected performance

Execution time of the above runs is expected to be in the few seconds range.

Diagnostics: dump contents of the serialized storages

Tool dumpStorage reads spcified binary files and prints an overview of their contents. Note that the given storage is fully read into memory (regardless of the printed line count).

Command

bin/dumpStorage.sh -in drugbank-all-cfp7.bin -in drugbank-all-mms.bin -in drugbank-all-commonname.bin

On the fly descriptor calculations

The example above use tools createMms and buildStorage to prepare descriptors and IDs for later search or for exposing through Web UI / REST API. It is possible to skip this preparation steps and let searchStorage to do the calculation. For further information on parametrization of searchStorage see Details on searchStorage.

# Find the 10 most similar structures using asymmetric tversky with on the fly descriptor calculation
tabs 40
gzip -dc gzip -dc data/molecules/drugbank/drugbank-all.sdf.gz | bin/searchStorage.sh \
    -context createSimpleCfp7Context \
    -metric "tversky,coeffT:0.01,coeffQ:0.99" \
    -tmf - \
    -tidprop COMMON_NAME \
    -mode MOSTSIMILARS \
    -count 10 \
    -qm "O1CC1 epoxy" \
    -qidname

The output:

Query                                   Target                                  Dissimilarity
epoxy                                   Sevelamer                               0.019607843137254943
epoxy                                   Colestipol                              0.027946537059538312
epoxy                                   Fosfomycin                              0.03264812575574361
epoxy                                   3-Oxiran-2ylalanine                     0.03730445246690739
epoxy                                   R-Styrene Oxide                         0.03961584633853543
epoxy                                   Oxiranpseudoglucose                     0.04534606205250602
epoxy                                   D-Limonene 1,2-Epoxide                  0.04534606205250602
epoxy                                   3,4-Epoxybutyl-Alpha-D-Glucopyranoside	0.06432748538011701
epoxy                                   (R)-4-Nitrostyrene oxide                0.06868451688009314
epoxy                                   (S)-4-Nitrostyrene oxide                0.06868451688009314

Breakdown of the invocations

Command line part	Description
`tabs 40`	Set tab stops of the terminal to 40 characters. This ensures that the columns of the ouptut are visually aligned. See https://linux.die.net/man/1/tabs.
`gzip -dc <GZFILE>`	Decompress the content of `gzip` encoded file `<GZFILE>` and print it to the standard output.
`\|`	Pipe the standard output of the previous command into the standard input of the following command. See http://www.tldp.org/LDP/abs/html/io-redirection.html for details.
`bin/searchStorage.sh`	Tool to invoke similarity searches against molecular descriptors stored in a binary file previously generated by `buildStorage` or generated on the fly.
`\`	Sign that command is continued in the following line.
`-context <CONTEXT>`	Specify molecular descriptor, default comparison metric and other parameters to be used during calculation and later search. For details see document Basic overview of the concepts of overlap analysis context.
`-metric <METRIC>`	Customize comparison metric.
`tversky,coeffT:0.01,coeffQ:0.99`	Asymmetric tversky metric with parameters where query only features are highly penalized, while target only features are slightly penalized.
`-tmf <MOLFILE>`	Read and parse targets from a molecule file
`-tmf -`	Use stdin to read the target molecules from
`-tidprop <PROPNAME>`	Extract target IDs from the given property of the parsed target molecules
`-tidprop COMMON_NAME`	Property name to use for target IDs
`-mode MOSTSIMILARS`	Find the `n` most similar molecules for each query.
`-count 10`	Specify the max number of most similar structures to find.
`-qm <QMOLSOURCE>`	Import molecule from source `<QMOLSOURCE>` and use it as a query.
`-qm "O1CC1 epoxy"`	`SMILES` structure source with molecule name specified.
`-qidname`	Use molecule name of query molecule(s) as query IDs .

Output formatting

By default the dissimilarity values uses Java Double formatting. Using option -out-format <FORMAT> a custom formatting can be specified which delegates to Java java.text.Format. The following example use %.3f for a fixed 3 digit precision:

# Find the 5 most similar structures using asymmetric tversky with on the fly descriptor calculation
tabs 25
gzip -dc gzip -dc data/molecules/drugbank/drugbank-all.sdf.gz | bin/searchStorage.sh \
    -context createSimpleCfp7Context \
    -metric "tversky,coeffT:0.01,coeffQ:0.99" \
    -tmf - \
    -tidprop COMMON_NAME \
    -mode MOSTSIMILARS \
    -count 5 \
    -qm "O1CC1 epoxy" \
    -qidname \
    -out-numeric-format "%.3f"

The output:

Query                    Target                   Dissimilarity
epoxy                    Sevelamer                0.020
epoxy                    Colestipol               0.028
epoxy                    Fosfomycin               0.033
epoxy                    3-Oxiran-2ylalanine      0.037
epoxy                    R-Styrene Oxide          0.040

Heatmap visualization

As an experimental feature a heatmap of the search results can be calculated using options -heatmap-image <FILE> and -heatmap-image-cellsize <CELLSIZE>. Search modes MOSTSIMILAR, MOSTSIMILARS and FULLMATRIX are all supported. Please note that heatmap rendering is not recommended for very large datasets. The approximate pixel count of the resulting image is <QUERIES> * <TARGETS> * <CELLSIZE> * <CELLSIZE> which is recommended to be kept below a few tens of megapixels.

Self overlap of the `vitamins` dataset

bin/searchStorage.sh \
    -context createSimpleCfp7Context \
    -qmf data/molecules/vitamins/vitamins.smi \
    -qidname \
    -tmf data/molecules/vitamins/vitamins.smi \
    -tidname \
    -mode FULLMATRIX \
    -out vitamins-fullmatrix.txt \
    -heatmap-image vitamins-fullmatrix.png \
    -heatmap-image-cellsize 15 \
    -heatmap-image-query-ids-length 250 \
    -heatmap-image-target-ids-length 250

The generated image layout is adjusted to have larger than default cell sizes and enough space to accomodate the long structure ID strings of the dataset. For details on the heatmap image generation see document Details on searchStorage.

Basic similarity search workflow

Create master molecule storage

Notes

Commands

Breakdown of the invocations

Expected performance

Calculate fingerprints

Notes

Commands

Expected performance

Breakdown of the invocations

Invoke similarity search from command line

Commands

Breakdown of the invocations

Expected performance

Diagnostics: dump contents of the serialized storages

Command

On the fly descriptor calculations

Breakdown of the invocations

Output formatting

Heatmap visualization

Self overlap of the vitamins dataset

Self overlap of the `vitamins` dataset