Verification and benchmarking of concurrent implementations

This document describes the verification and benchmarking of concurrent descriptor generation stack. The goal of the verification is to ensure that generated descriptors match to reference descriptors. Reference descriptors are calculated by using the chemaxon.descriptors API in a single threaded, small tool called stdg. Self contained example script executes the comparison workflow described below. Launch with option -h to get command line help.

Input file

A single input file containing molecules will be used. A label for the input set is also required.

export INPUTSET=pubchem-compound-rnd-10k
export INFILE="$INPUTSET.smi.gz"

Note that execution statistics will contain input set reference.

Descriptor configuration

Matching descriptor configuration for stdg and buildStorage tools are needed.

Configuration for maccs-166

export FINGERPRINT=maccs-166                                           # Descriptor label used in files and in stat data
export STDJS="std.aromatizeBasic()"                                    # Standardization for reference calculation
export DESC=com.chemaxon.descriptors.alternates.Maccs166BinaryString   # Class used in reference calculation
export CFGSTRING=""                                                    # Config string for classs above
export CONTEXT=createMaccs166Context                                   # Overlap analysis context used in multithreaded calculation
export DESCF=BINARYSTRING                                              # Descriptor format used for exporting using dumpStorage

Note that used com.chemaxon.descriptors.alternates.Maccs166BinaryString class directly invokes MACCS-166 recognition with no standardization, while the createMaccs166Context factory uses basic aromatization as the only standardization.

Configuration for cfp-7-1

export FINGERPRINT=cfp-7-1
export STDJS="std.identityStandardizer()"
export DESC=com.chemaxon.descriptors.alternates.CfpDsWrapper
export CFGSTRING=`cat "data/cfp-7-1.xml"`
export CONTEXT=createSimpleCfp7Context

Configuraion for ecfp-4

export FINGERPRINT=ecfp-4
export STDJS="std.identityStandardizer()"
export DESC=com.chemaxon.descriptors.alternates.CfpDsWrapper
export CFGSTRING=`cat "data/ecfp-4.xml"`
export CONTEXT=createSimpleEcfp4Context

Generate single threaded descriptors reference

A binary String representation is generated with TAB separated IDs using tool stdg. Profiling data and execution statistics are also collected. Note that this tool uses minimal code to access descriptor generation.

gzip -dc "$INFILE" | bin/ \
    "-Doverlap-benchmark.input-set=${INPUTSET}" \
    "-Doverlap-benchmark.fingerprint=${FINGERPRINT}" \
    -v \
    -stdjs "$STDJS" \
    -desc "$DESC" \
    -cfgstring "$CFGSTRING" \
    -idsrc MOLNAME \
    -idloc TRAILING \
    -idsep TAB \
    -escape false \
    -out "$INPUTSET-$FINGERPRINT-stdg-fp.txt" \
    -prof "$INPUTSET-$FINGERPRINT-stdg-prof.txt" \
    -stat "$INPUTSET-$FINGERPRINT-stdg-stat.txt"

Create storages

Import master molecule storage and master ID storage for the input set. The imported IDs will be used in the later descriptor data dump step.

gzip -dc "$INFILE" | bin/ \
    "-Doverlap-benchmark.input-set=${INPUTSET}" \
    -in - \
    -out "$INPUTSET-mms.bin" \
    -name "$INPUTSET-name.bin" \
    -prof "$INPUTSET-mms-prof.txt" \
    -stat "$INPUTSET-mms-stat.txt"

Generate concurrent descriptors

Use the fast similarity search tool buildStorage in its default concurrent execution mode.

gzip -dc "$INFILE" | bin/ \
    "-Doverlap-benchmark.input-set=${INPUTSET}" \
    -context "$CONTEXT" \
    -in - \
    -out "$INPUTSET-$FINGERPRINT-fp.bin" \
    -prof "$INPUTSET-$FINGERPRINT-prof.txt" \
    -stat "$INPUTSET-$FINGERPRINT-stat.txt"

Export descriptors from generated storage

bin/ \
    -in "$INPUTSET-name.bin" \
    -in "$INPUTSET-$FINGERPRINT-fp.bin" \
    -descout "$INPUTSET-$FINGERPRINT-fp.txt" \
    -descf "$DESCF"

Compare string representations

String representations created by tool stdg and the one exported from concurrently generated descriptor storage must be identical.

diff -s "$INPUTSET-$FINGERPRINT-stdg-fp.txt" "$INPUTSET-$FINGERPRINT-fp.txt"

Inspect benchmark data

Launch embedded server with collected profiling data:

bin/ -profres *-prof.txt

For details see Profiling and execution statistics.