Verification and benchmarking of concurrent implementations
This document describes the verification and benchmarking of concurrent descriptor generation stack. The goal of the verification is to ensure that generated descriptors match to reference descriptors. Reference descriptors are calculated by using the chemaxon.descriptors
API in a single threaded, small tool called stdg
. Self contained example script verify-concurrent-generation.sh
executes the comparison workflow described below. Launch with option -h
to get command line help.
Input file
A single input file containing molecules will be used. A label for the input set is also required.
export INPUTSET=pubchem-compound-rnd-10k
export INFILE="$INPUTSET.smi.gz"
Note that execution statistics will contain input set reference.
Descriptor configuration
Matching descriptor configuration for stdg
and buildStorage
tools are needed.
Configuration for maccs-166
export FINGERPRINT=maccs-166 # Descriptor label used in files and in stat data
export STDJS="std.aromatizeBasic()" # Standardization for reference calculation
export DESC=com.chemaxon.descriptors.alternates.Maccs166BinaryString # Class used in reference calculation
export CFGSTRING="" # Config string for classs above
export CONTEXT=createMaccs166Context # Overlap analysis context used in multithreaded calculation
export DESCF=BINARYSTRING # Descriptor format used for exporting using dumpStorage
Note that used com.chemaxon.descriptors.alternates.Maccs166BinaryString
class directly invokes MACCS-166 recognition with no standardization, while the createMaccs166Context
factory uses basic aromatization as the only standardization.
Configuration for cfp-7-1
export FINGERPRINT=cfp-7-1
export STDJS="std.identityStandardizer()"
export DESC=com.chemaxon.descriptors.alternates.CfpDsWrapper
export CFGSTRING=`cat "data/cfp-7-1.xml"`
export CONTEXT=createSimpleCfp7Context
export DESCF=INTDECIMALSTRING
Configuraion for ecfp-4
export FINGERPRINT=ecfp-4
export STDJS="std.identityStandardizer()"
export DESC=com.chemaxon.descriptors.alternates.CfpDsWrapper
export CFGSTRING=`cat "data/ecfp-4.xml"`
export CONTEXT=createSimpleEcfp4Context
export DESCF=INTDECIMALSTRING
Generate single threaded descriptors reference
A binary String representation is generated with TAB
separated ID
s using tool stdg
. Profiling data and execution statistics are also collected. Note that this tool uses minimal code to access descriptor generation.
gzip -dc "$INFILE" | bin/stdg.sh \
"-Doverlap-benchmark.input-set=${INPUTSET}" \
"-Doverlap-benchmark.fingerprint=${FINGERPRINT}" \
-v \
-stdjs "$STDJS" \
-desc "$DESC" \
-cfgstring "$CFGSTRING" \
-idsrc MOLNAME \
-idloc TRAILING \
-idsep TAB \
-escape false \
-out "$INPUTSET-$FINGERPRINT-stdg-fp.txt" \
-prof "$INPUTSET-$FINGERPRINT-stdg-prof.txt" \
-stat "$INPUTSET-$FINGERPRINT-stdg-stat.txt"
Create storages
Import master molecule storage and master ID storage for the input set. The imported IDs will be used in the later descriptor data dump step.
gzip -dc "$INFILE" | bin/createMms.sh \
"-Doverlap-benchmark.input-set=${INPUTSET}" \
-in - \
-out "$INPUTSET-mms.bin" \
-name "$INPUTSET-name.bin" \
-prof "$INPUTSET-mms-prof.txt" \
-stat "$INPUTSET-mms-stat.txt"
Generate concurrent descriptors
Use the fast similarity search tool buildStorage
in its default concurrent execution mode.
gzip -dc "$INFILE" | bin/buildStorage.sh \
"-Doverlap-benchmark.input-set=${INPUTSET}" \
-context "$CONTEXT" \
-in - \
-out "$INPUTSET-$FINGERPRINT-fp.bin" \
-prof "$INPUTSET-$FINGERPRINT-prof.txt" \
-stat "$INPUTSET-$FINGERPRINT-stat.txt"
Export descriptors from generated storage
bin/dumpStorage.sh \
-in "$INPUTSET-name.bin" \
-in "$INPUTSET-$FINGERPRINT-fp.bin" \
-descout "$INPUTSET-$FINGERPRINT-fp.txt" \
-descf "$DESCF"
Compare string representations
String representations created by tool stdg
and the one exported from concurrently generated descriptor storage must be identical.
diff -s "$INPUTSET-$FINGERPRINT-stdg-fp.txt" "$INPUTSET-$FINGERPRINT-fp.txt"
Inspect benchmark data
Launch embedded server with collected profiling data:
bin/gui.sh -profres *-prof.txt
For details see Profiling and execution statistics.