Using custom binary descriptors
Custom binary fingerprints and float vector descriptors can also be handled. Note that the custom descriptors expose only the serialization mechanisms of the underlying representations. No descriptor generation (from Molecules) is available in this case, so for queries also the custom descriptors must be used. Parts of the steps described below are implemented in self contained example scripts custom-binaryfp-workflow-vitamins.sh
and custom-binaryfp-workflow-nci250k.sh
found in the examples
directory. Note that these example scripts currently failing under Windows + Cygwin.
The basic workflow described below contains the following steps:
- Create textual representations of descriptors generated by the diagnostic tool
stdg
. - Also create textual descriptor representation to be used as queries
- Import custom descriptors
- Search custom descriptors
The examples use the vitamins dataset containig 30 molecules. Below performance data for larger sets can be found.
Create an input file
Use diagnostic tool stdg.sh
to generate sample binary descriptor representations by creating a binary string based input file using chemical fingerprints in format 010101....01010 <ID>
. This tool uses the legacy descriptors API (chemaxon.descriptors
) to generate descriptors.
cat data/vitamins.smi | bin/stdg.sh \
-in - \
-cfg data/cfp-7-1.xml \
-desc com.chemaxon.descriptors.alternates.CfpBsWrapper \
-idloc TRAILING \
-idsrc MOLNAME \
-escape false \
-processdesc "d.replace(/\\|/g, '')" \
-v \
-out vitamins-custom-binstring.txt
Note that the exposed descriptor generator writes bits organized into groups of 8 using characters |
in the form 01001101|01111000|000...
. The desired output format is a plain bitstring (0100110101111000000...
) containing no such separators. Option -processdesc <SCRIPT>
provides a JavaScript hook to transform the descriptor String representation. In the passed script reference d
contains the original String representation (before escaping/id appending) and the expected return value is the representation to be used. For more info on JavaScript's replace()
function used and the passed regular expressions can be found in the JavaScript String Reference, in the JavaScript RegExp reference and in MDN's JavaScript Guide.
Breakdown of the -processdesc
option:
Expression | Details |
---|---|
"d.replace(/\\|/g, '')" |
Command line argument passed to the executable. This is escaped for the shell. |
d.replace(<PATTERN>, <REPLACEMENT>) |
JS function to replace <PATTERN> to <REPLACEMENT> in string d . |
d.replace(/\|/g, '') |
The value after the shell processes the escaped \\ character. This value is passed to the JavaScript hook |
/\/|/g |
The regular expresseion processed by the scripting hook. |
/.../g |
Regular expression literal (/.../ ) and flag to indicate global search (g ). |
\| |
Escaped | character which is matched. |
So the given JS hook will delete all occurrences of |
character by replacing them to an empty string.
An alternative approach to removing the separator character could be using standard output (-out -
) and using command tr
(... -out - | tr -d " " > vitamins-custom-binstring.txt
).
IDs created from input names are also written.
Note that this tool use single threaded execution.
Import custom descriptors
Note that the underlying context must be composed using a JavaScript hook (specified by -contextjs <SCRIPT>
). This must be a valid JavaScript code which returns the OverlapAnalysisContext
instance to be used (as the value of the last expression). Many initialized references and helper functions are available, use option -h
to print command line help for details. Since the input is an arbitrary line oriented text file which might contains additional data the methods used for accessing descriptor and optionally ID parts are needed to be specified explicitly. Such specification is done by using splitters.
bin/importStorage.sh \
-in vitamins-custom-binstring.txt \
-splitter com.chemaxon.overlap.splits.FirstToken \
-idsplitter com.chemaxon.overlap.splits.AllButFirstToken \
-out vitamins-custom-fp.bin \
-id vitamins-custom-id.bin \
-contextjs "ctx_from_descpb(bld_bv.length(1024).endianness(en_BIG_ENDIAN).stringFormat(sf_STRICT_BINARY_STRING))"
Note that writing IDs (using options -id
and -idsplitter
) is optional. Note that IDs might contain white spaces (like Vitamin C
), so using splitter SecondToken
instead of AllButFirstToken
would compromise them (selecting only Vitamin
instead of the full remaining part Vitamin C
).
Breakdown of the contents of the passed JavaScript fragment creating the OverlapAnalysisContext
used:
Script part | Description |
---|---|
ctx_from_descpb(..) |
Helper function which creates a default OverlapAnalysisContext from the associated DescriptorParameters builder. |
bld_bv |
A builder instance for BvParameters in default state. |
.length(..) |
Update builder with length parameter (see apidoc). |
.endianness(..) |
Update builder with endianness parameter (see apidoc). |
en_BIG_ENDIAN |
Constant which can be passed to .endianness(..) (see apidoc). |
.stringFormat(..) |
Update builder with string format parameter (see apidoc). |
sf_STRICT_BINARY_STRING |
Constant which can be passed to .stringFormat(..) (see apidoc). |
Import associated master molecule storage
Master molecule storage can be created when structures are available with tool createMms
. Note that currently the order of custom descriptors and the order of molecules must match; ID based matching is not available.
cat data/vitamins.smi | bin/createMms.sh -in - -out vitamins-mms.bin
Diagnostic dump storages
Peek into the contents of created storages.
bin/dumpStorage.sh \
-in vitamins-custom-fp.bin \
-in vitamins-custom-id.bin \
-in vitamins-mms.bin
Create descriptor for querying
Query descriptors has no associated IDs. We use "Vitamin E - alpha-Tocopherol" structure slightly modified (an additional carbon atom is attached):
echo "Oc2c(c(c1O[C@](CCc1c2C)(C)CCC[C@H](C)CCC[C@H](C)CCCC(C)C)C)CC" | bin/stdg.sh \
-in - \
-cfg data/cfp-7-1.xml \
-desc com.chemaxon.descriptors.alternates.CfpBsWrapper \
-idloc NONE \
-escape false \
-processdesc "d.replace(/\\|/g, '')" \
-v \
-out query-desc.txt
Query descriptor storage
Inline query descriptors are set using parameter -qd
. Query descriptors stored in a file can be read using -qdf
. Note that query molecules (-qm
or -qmf
) can not be used, since we dont know how to generate the descriptors for them.
bin/searchStorage.sh \
-frombytes vitamins-custom-fp.bin \
-qd `cat query-desc.txt`
bin/searchStorage.sh \
-frombytes vitamins-custom-fp.bin \
-qdf query-desc.txt
searchStorage
can use IDs instead of plain structure indices. Parameter -idstorage
can specify the associated ID storage.
bin/searchStorage.sh \
-frombytes vitamins-custom-fp.bin \
-idstorage vitamins-custom-id.bin \
-qdf query-desc.txt