Store additional data

Please note that additional data storage is in a work in progress state with limited support.

Prior to version 0.3.0 the molecule stores only supported storing and retrieving a single arbitrary string for each molecule (referred as ID). With the version 0.3.2 additional properties stored when an overlap analysis is calculated can be used in overlap visualization.

Limited support

Specify additional data

Tool calculateOverlap.sh can add extra data to the query and target sets too. See the command line help of options -qprop <SPEC> and -tprop <SPEC>. Property specifications are applied to each processed molecule and are stored in the serialized files.

Specification format

Each specification parameter is a string in the format of <NAME>:<VALUETYPE>:<CALCTYPE>[:<CALCSPEC> where

Currently supported calculators, their value types and parameter strings:

<CALCTYPE> Value types Description and argument (<CALSPEC>) interpretation
molname string Use the name stored in the input molecule. No argument accepted.
sdfprop int, long, double, string Use the value stored in an SDF property in the input molecule. The argument is the property name which is mandatory.
chemterm int, long, double, string Execute a Chemical Terms expression evaluation on the input molecule. The argument is the expression to be evaluated which is mandatory. Note that the imported molecule will be passed for Chemical Terms evaluation WITHOUT SDF properties. Please note that you might need additional licenses for executiong certain calculations.

Example

The following invocation is based on the example script rest-api-example.sh (see Self contained exmaples for details) where various SDF properties are extracted and stored:

bin/calculateOverlap.sh \
    -name pubchem1k-in-drugbank-cfp7 \
    -descname cfp7 \
    -context createSimpleCfp7Context \
    -queryname pubchem1k \
    -qmf data/molecules/pubchem-compound/pubchem-compound-rnd-1k.sdf.gz \
    -qidname \
    -qprop compound_cid:int:sdfprop:PUBCHEM_COMPOUND_CID \
    -qprop compound_canonicalized:int:sdfprop:PUBCHEM_COMPOUND_CANONICALIZED \
    -qprop cactvs_complexity:double:sdfprop:PUBCHEM_CACTVS_COMPLEXITY \
    -qprop cactvs_hbond_acceptor:int:sdfprop:PUBCHEM_CACTVS_HBOND_ACCEPTOR \
    -qprop cactvs_hbond_donor:int:sdfprop:PUBCHEM_CACTVS_HBOND_DONOR \
    -qprop cactvs_rotatable_bond:int:sdfprop:PUBCHEM_CACTVS_ROTATABLE_BOND \
    -qprop exact_mass:double:sdfprop:PUBCHEM_EXACT_MASS \
    -qprop molecular_weight:double:sdfprop:PUBCHEM_MOLECULAR_WEIGHT \
    -qprop cactvs_tpsa:double:sdfprop:PUBCHEM_CACTVS_TPSA \
    -qprop monoisotopic_weight:double:sdfprop:PUBCHEM_MONOISOTOPIC_WEIGHT \
    -qprop total_charge:int:sdfprop:PUBCHEM_TOTAL_CHARGE \
    -qprop heavy_atom_count:int:sdfprop:PUBCHEM_HEAVY_ATOM_COUNT \
    -qprop atom_def_stereo_count:int:sdfprop:PUBCHEM_ATOM_DEF_STEREO_COUNT \
    -qprop atom_udef_stereo_count:int:sdfprop:PUBCHEM_ATOM_UDEF_STEREO_COUNT \
    -qprop bond_def_stereo_count:int:sdfprop:PUBCHEM_BOND_DEF_STEREO_COUNT \
    -qprop bond_udef_stereo_count:int:sdfprop:PUBCHEM_BOND_UDEF_STEREO_COUNT \
    -qprop isotopic_atom_count:int:sdfprop:PUBCHEM_ISOTOPIC_ATOM_COUNT \
    -qprop cactvs_tauto_count:int:sdfprop:PUBCHEM_CACTVS_TAUTO_COUNT \
    -targetname  drugbank \
    -tmf data/molecules/drugbank/drugbank-common_name.smi.gz \
    -tidname \
    -tprop chemterms_atom_count:int:chemterm:atomCount \
    -tprop chemterms_rotatable_bonds:int:chemterm:rotatableBondCount \
    -tprop chemterms_ring_count:int:chemterm:ringCount \
    -tprop chemterms_mass:double:chemterm:mass \
    -out pubchem1k-drugbank-overlap.bin

Execution time is a few seconds. Launching the embedded server:

bin/gui.sh -in pubchem1k-drugbank-overlap.bin -port 8085

Display additional data

Properties are shown in the molecule table view and molecule details dialog of the pubchem1k molecule set (Note that "hydrogene display" setting was set to "Display dehydrogenized structures" by default; cell size for the molecule table was increased):

Additional data display for pubchem1k

Additional data display for pubchem1k

Properties shown for the drugbank-1 dataset:

Additional data display for drugbank

Additional data display for drugbank

Use additional data in overlap analysis visualization

Distribution of numeric stored additional data can also be selected as histogram dimension. In the following example the non overlapping queries (for which no similar target was found) are selected. The selection is further narrowed for queries having small cactvs_complexity and cactvs_rotatable_bond (coming from PubChem) values. Distibution of the remaining most similar targets atom_count (calculated by Chemical Terms) are also shown.

(Note that "hydrogenbe display" setting was set to "Display dehydrogenized structures"):

Additional data display in knn visualization

Use k-NN analysis for property space exploration

The current visualization capabilities of the k-NN analysis can be used for single molecule set property space analysis. For this demonstration we will use the shipped chembl-21 dataset as a query set with properties calculated by chemical terms. As a target set we will use a single molecule containing a single carbon atom:

bin/calculateOverlap.sh \
    -name propspace-of-chembl-cfp7 \
    -descname cfp7 \
    -context createSimpleCfp7Context \
    -queryname chembl \
    -qmf data/molecules/chembl/chembl-21.smi.gz \
    -qidname \
    -qprop chemterms_atom_count:int:chemterm:atomCount \
    -qprop chemterms_rotatable_bonds:int:chemterm:rotatableBondCount \
    -qprop chemterms_ring_count:int:chemterm:ringCount \
    -qprop chemterms_mass:double:chemterm:mass \
    -targetname carbon \
    -tm "C a carbon" \
    -tidname \
    -out propspace-of-chembl-overlap.bin

Execution time (on an i7-4790 desktop machine) is around 4 minutes.

bin/gui.sh -in propspace-of-chembl-overlap.bin -port 8085

This analysis as well a similar one on the nci-250k dataset is included in self contained example script examples/overlap-example.sh.

After opening the k-NN visualization page the following changes are made to the page layout

Property space exploration

This example visualization handles around nearly 8M data points on the client (browser) side (~1.6 M molecules * 5 displayed dimensions). The response time of the UI for certain interactions falls into the 1-2 seconds range. This is expected to be improved in the further versions.

Accessing additional data through REST API

See REST API endpoint molecules for detailed documentation. A few examples using curl are given. To use these examples launch examples/rest-api-example.sh script.

Metadata on stored properties

curl -g "http://localhost:8085/rest/molecules/pubchem1k" | python -m json.tool
{
    "absentids": 0,
    "absentmols": 0,
    "description": "pubchem1k (from pubchem1k-drugbank-overlap.bin)",
    "name": "pubchem1k",
    "propnames": [
        "compound_cid",
        "compound_canonicalized",
        "cactvs_complexity",
        "cactvs_hbond_acceptor",
        "cactvs_hbond_donor",
        "cactvs_rotatable_bond",
        "exact_mass",
        "molecular_weight",
        "cactvs_tpsa",
        "monoisotopic_weight",
        "total_charge",
        "heavy_atom_count",
        "atom_def_stereo_count",
        "atom_udef_stereo_count",
        "bond_def_stereo_count",
        "bond_udef_stereo_count",
        "isotopic_atom_count",
        "cactvs_tauto_count"
    ],
    "props": [
        {
            "extractor": "Get SDF property \"PUBCHEM_COMPOUND_CID\" as java.lang.Integer. Missing values allowed: false",
            "name": "compound_cid",
            "numeric": true,
            "type": "java.lang.Integer"
        },
        {
            "extractor": "Get SDF property \"PUBCHEM_COMPOUND_CANONICALIZED\" as java.lang.Integer. Missing values allowed: false",
            "name": "compound_canonicalized",
            "numeric": true,
            "type": "java.lang.Integer"
        },

        ...

        {
            ...
        }
    ],
    "size": 1000,
    "url": "rest/molecules/pubchem1k"
}

Single property on a single structure

curl -g "http://localhost:8085/rest/molecules/pubchem1k/10/props/compound_cid" | python -m json.tool
{
    "value": 72819133
}

Properties for multiple molecules (index range)

curl -X POST \
     -H "Content-Type: application/x-www-form-urlencoded" \
     -d 'start=10' \
     -d 'maxcount=20' \
     -g "http://localhost:8085/rest/molecules/pubchem1k/props/compound_cid/get-properties-on-index-range" | python -m json.tool
{
    "count": 20,
    "start": 10,
    "values": [
        72819133,
        26175566,
        66777683,
        43934213,
        68924857,
        73745083,
        81901454,
        6127448,
        52060933,
        23288074,
        66710358,
        68665148,
        76605832,
        73377643,
        60711711,
        64497837,
        62627366,
        31893941,
        20771615,
        81785057
    ]
}

Multiple properties for multiple molecules

curl -X POST \
     -H "Content-Type: application/x-www-form-urlencoded" \
     -d 'indices[]=10&indices[]=11&indices=12&indices[]=20' \
     -d 'props[]=compound_cid&props[]=molecular_weight' \
     -g "http://localhost:8085/rest/molecules/pubchem1k/get-multiple-props" | python -m json.tool
{
    "props": {
        "compound_cid": {
            "10": 72819133,
            "11": 26175566,
            "20": 66710358
        },
        "molecular_weight": {
            "10": 940.668,
            "11": 460.951,
            "20": 169.26
        }
    }
}