Store additional data

Please note that additional data storage is in a work in progress state with limited support.

Prior to version 0.3.0 the molecule stores only supported storing and retrieving a single arbitrary string for each molecule (referred as ID). With the version 0.3.2 additional properties stored when an overlap analysis is calculated can be used in overlap visualization.

Limited support

Additional data can be attached only to molecules imported by calculateOverlap.sh.
Additional data can be displayed on molecule details dialog of the molecules page.
Numeric additional data can be used on overlap analysis visualization.
REST API makes querying possible.

Specify additional data

Tool calculateOverlap.sh can add extra data to the query and target sets too. See the command line help of options -qprop <SPEC> and -tprop <SPEC>. Property specifications are applied to each processed molecule and are stored in the serialized files.

Specification format

Each specification parameter is a string in the format of <NAME>:<VALUETYPE>:<CALCTYPE>[:<CALCSPEC> where

NAME Is the name which identifies the stored additional property for the query or the target molecule sets. Values must be unique across the molecule sets, so two stored property can not have the same name assigned.
<VALUETYPE> Expected/interpreted type of the stored property. Must be explicitly specified. Applicable types:
- int: 32 bit signed integer value
- long: 64 bit signed integer value
- double: 64 bit floating point value
- string: Unicode character sequence
<CALCTYPE>: Calculator/extractor to be used to get the value for each input molecules.
<CALCSPEC>: Optional parameter/argument passed to the calculator/extractor.

Currently supported calculators, their value types and parameter strings:

`<CALCTYPE>`	Value types	Description and argument (`<CALSPEC>`) interpretation
`molname`	`string`	Use the name stored in the input molecule. No argument accepted.
`sdfprop`	`int`, `long`, `double`, `string`	Use the value stored in an SDF property in the input molecule. The argument is the property name which is mandatory.
`chemterm`	`int`, `long`, `double`, `string`	Execute a Chemical Terms expression evaluation on the input molecule. The argument is the expression to be evaluated which is mandatory. Note that the imported molecule will be passed for Chemical Terms evaluation WITHOUT SDF properties. Please note that you might need additional licenses for executiong certain calculations.

Example

The following invocation is based on the example script rest-api-example.sh (see Self contained exmaples for details) where various SDF properties are extracted and stored:

bin/calculateOverlap.sh \
    -name pubchem1k-in-drugbank-cfp7 \
    -descname cfp7 \
    -context createSimpleCfp7Context \
    -queryname pubchem1k \
    -qmf data/molecules/pubchem-compound/pubchem-compound-rnd-1k.sdf.gz \
    -qidname \
    -qprop compound_cid:int:sdfprop:PUBCHEM_COMPOUND_CID \
    -qprop compound_canonicalized:int:sdfprop:PUBCHEM_COMPOUND_CANONICALIZED \
    -qprop cactvs_complexity:double:sdfprop:PUBCHEM_CACTVS_COMPLEXITY \
    -qprop cactvs_hbond_acceptor:int:sdfprop:PUBCHEM_CACTVS_HBOND_ACCEPTOR \
    -qprop cactvs_hbond_donor:int:sdfprop:PUBCHEM_CACTVS_HBOND_DONOR \
    -qprop cactvs_rotatable_bond:int:sdfprop:PUBCHEM_CACTVS_ROTATABLE_BOND \
    -qprop exact_mass:double:sdfprop:PUBCHEM_EXACT_MASS \
    -qprop molecular_weight:double:sdfprop:PUBCHEM_MOLECULAR_WEIGHT \
    -qprop cactvs_tpsa:double:sdfprop:PUBCHEM_CACTVS_TPSA \
    -qprop monoisotopic_weight:double:sdfprop:PUBCHEM_MONOISOTOPIC_WEIGHT \
    -qprop total_charge:int:sdfprop:PUBCHEM_TOTAL_CHARGE \
    -qprop heavy_atom_count:int:sdfprop:PUBCHEM_HEAVY_ATOM_COUNT \
    -qprop atom_def_stereo_count:int:sdfprop:PUBCHEM_ATOM_DEF_STEREO_COUNT \
    -qprop atom_udef_stereo_count:int:sdfprop:PUBCHEM_ATOM_UDEF_STEREO_COUNT \
    -qprop bond_def_stereo_count:int:sdfprop:PUBCHEM_BOND_DEF_STEREO_COUNT \
    -qprop bond_udef_stereo_count:int:sdfprop:PUBCHEM_BOND_UDEF_STEREO_COUNT \
    -qprop isotopic_atom_count:int:sdfprop:PUBCHEM_ISOTOPIC_ATOM_COUNT \
    -qprop cactvs_tauto_count:int:sdfprop:PUBCHEM_CACTVS_TAUTO_COUNT \
    -targetname  drugbank \
    -tmf data/molecules/drugbank/drugbank-common_name.smi.gz \
    -tidname \
    -tprop chemterms_atom_count:int:chemterm:atomCount \
    -tprop chemterms_rotatable_bonds:int:chemterm:rotatableBondCount \
    -tprop chemterms_ring_count:int:chemterm:ringCount \
    -tprop chemterms_mass:double:chemterm:mass \
    -out pubchem1k-drugbank-overlap.bin

Execution time is a few seconds. Launching the embedded server:

bin/gui.sh -in pubchem1k-drugbank-overlap.bin -port 8085

Display additional data

Properties are shown in the molecule table view and molecule details dialog of the pubchem1k molecule set (Note that "hydrogene display" setting was set to "Display dehydrogenized structures" by default; cell size for the molecule table was increased):

Additional data display for pubchem1k

Properties shown for the drugbank-1 dataset:

Additional data display for drugbank

Use additional data in overlap analysis visualization

Distribution of numeric stored additional data can also be selected as histogram dimension. In the following example the non overlapping queries (for which no similar target was found) are selected. The selection is further narrowed for queries having small cactvs_complexity and cactvs_rotatable_bond (coming from PubChem) values. Distibution of the remaining most similar targets atom_count (calculated by Chemical Terms) are also shown.

(Note that "hydrogenbe display" setting was set to "Display dehydrogenized structures"):

Additional data display in knn visualization

Use k-NN analysis for property space exploration

The current visualization capabilities of the k-NN analysis can be used for single molecule set property space analysis. For this demonstration we will use the shipped chembl-21 dataset as a query set with properties calculated by chemical terms. As a target set we will use a single molecule containing a single carbon atom:

bin/calculateOverlap.sh \
    -name propspace-of-chembl-cfp7 \
    -descname cfp7 \
    -context createSimpleCfp7Context \
    -queryname chembl \
    -qmf data/molecules/chembl/chembl-21.smi.gz \
    -qidname \
    -qprop chemterms_atom_count:int:chemterm:atomCount \
    -qprop chemterms_rotatable_bonds:int:chemterm:rotatableBondCount \
    -qprop chemterms_ring_count:int:chemterm:ringCount \
    -qprop chemterms_mass:double:chemterm:mass \
    -targetname carbon \
    -tm "C a carbon" \
    -tidname \
    -out propspace-of-chembl-overlap.bin

Execution time (on an i7-4790 desktop machine) is around 4 minutes.

bin/gui.sh -in propspace-of-chembl-overlap.bin -port 8085

This analysis as well a similar one on the nci-250k dataset is included in self contained example script examples/overlap-example.sh.

After opening the k-NN visualization page the following changes are made to the page layout

Four additional histograms are added
The displayed histograms are set to show the individual query properties and the query indices
Histogram y axes are set to logarithmic (with the exception of the query indices)
The molecule table is set to show 0 most similar targets (only the selected queries shown) and to show 4 records.
Molecule table cell size is increased
Components are reordered - molecule table on the left, histograms on the right

Property space exploration

This example visualization handles around nearly 8M data points on the client (browser) side (~1.6 M molecules * 5 displayed dimensions). The response time of the UI for certain interactions falls into the 1-2 seconds range. This is expected to be improved in the further versions.

Accessing additional data through REST API

See REST API endpoint molecules for detailed documentation. A few examples using curl are given. To use these examples launch examples/rest-api-example.sh script.

Metadata on stored properties

curl -g "http://localhost:8085/rest/molecules/pubchem1k" | python -m json.tool

{
    "absentids": 0,
    "absentmols": 0,
    "description": "pubchem1k (from pubchem1k-drugbank-overlap.bin)",
    "name": "pubchem1k",
    "propnames": [
        "compound_cid",
        "compound_canonicalized",
        "cactvs_complexity",
        "cactvs_hbond_acceptor",
        "cactvs_hbond_donor",
        "cactvs_rotatable_bond",
        "exact_mass",
        "molecular_weight",
        "cactvs_tpsa",
        "monoisotopic_weight",
        "total_charge",
        "heavy_atom_count",
        "atom_def_stereo_count",
        "atom_udef_stereo_count",
        "bond_def_stereo_count",
        "bond_udef_stereo_count",
        "isotopic_atom_count",
        "cactvs_tauto_count"
    ],
    "props": [
        {
            "extractor": "Get SDF property \"PUBCHEM_COMPOUND_CID\" as java.lang.Integer. Missing values allowed: false",
            "name": "compound_cid",
            "numeric": true,
            "type": "java.lang.Integer"
        },
        {
            "extractor": "Get SDF property \"PUBCHEM_COMPOUND_CANONICALIZED\" as java.lang.Integer. Missing values allowed: false",
            "name": "compound_canonicalized",
            "numeric": true,
            "type": "java.lang.Integer"
        },

        ...

        {
            ...
        }
    ],
    "size": 1000,
    "url": "rest/molecules/pubchem1k"
}

Single property on a single structure

curl -g "http://localhost:8085/rest/molecules/pubchem1k/10/props/compound_cid" | python -m json.tool

{
    "value": 72819133
}

Properties for multiple molecules (index range)

curl -X POST \
     -H "Content-Type: application/x-www-form-urlencoded" \
     -d 'start=10' \
     -d 'maxcount=20' \
     -g "http://localhost:8085/rest/molecules/pubchem1k/props/compound_cid/get-properties-on-index-range" | python -m json.tool

{
    "count": 20,
    "start": 10,
    "values": [
        72819133,
        26175566,
        66777683,
        43934213,
        68924857,
        73745083,
        81901454,
        6127448,
        52060933,
        23288074,
        66710358,
        68665148,
        76605832,
        73377643,
        60711711,
        64497837,
        62627366,
        31893941,
        20771615,
        81785057
    ]
}

Multiple properties for multiple molecules

curl -X POST \
     -H "Content-Type: application/x-www-form-urlencoded" \
     -d 'indices[]=10&indices[]=11&indices=12&indices[]=20' \
     -d 'props[]=compound_cid&props[]=molecular_weight' \
     -g "http://localhost:8085/rest/molecules/pubchem1k/get-multiple-props" | python -m json.tool

{
    "props": {
        "compound_cid": {
            "10": 72819133,
            "11": 26175566,
            "20": 66710358
        },
        "molecular_weight": {
            "10": 940.668,
            "11": 460.951,
            "20": 169.26
        }
    }
}