Self contained examples
Directory examples/
contains example scripts. These scripts implement workflow specific self contained examples and serve as a starting point for evaluating the functionality provided. The examples use a working directory to store downloaded/generated files. The working directory is usually a subdirectory of the distributions examples-tmp/
directory.
Most of these scripts can be launched after unpacking the distribution and installing the supplied license file without further arguments. Some of them require to download publicly available datasets (see details below). Note that the behavior of these scripts can be customized; see details below. See document Getting started guide for an overview of some of these examples.
The self contained example scripts eventually use the MadFast command line tools as the building blocks of the implemented workflows. See document Command line interfaces (CLIs) for details.
Please note that a possible race condition in file IO might result in an Exception (Exception in thread "main" java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: Empty stream cannot be read.
) when reading files/standard input.
Command line processing examples
These examples wont launch the embedded server/web based UI. They quit after processing and searching is done.
-
search-workflow.sh
: Complete example of the basic search usage introduced in Basic search workflow. This prepares the input setdrugbank-all
and invokes similarity searches against it. -
custom-binaryfp-workflow-vitamins.sh
: Custom binary fingerprint based workflow using the smallvitamins
dataset. This script creates custom binary fingerprint, imports it and invokes search. For details see document Using custom binary descriptors. -
custom-floatv-workflow.sh
: Custom float vector descriptor based workflow using the a small dataset (data/floatdesc.txt
) containing 2D vectors. This script imports the dataset and invokes search. For details see document Using custom float descriptors.
REST API/Web UI examples
These examples launch the embedded web server contained in tool gui.sh
. They calculate descriptors and launch an embedded server which provides REST API and Web UI for real time searches. Scripts exposing larger datasets collect profiling and execution statistics and expose them on the web UI. The following scripts use only the shipped datasets. They can be launched without downloading any further public datasets:
Script | Memory used | Preparation time | Load time | Molecule count | Descriptor count | Molecule sets | Descriptors |
---|---|---|---|---|---|---|---|
rest-api-example.sh |
java default | < 20 s | 1 s | 30 | 51 | vitamins , N/A (custom descriptors with no attached molecules) |
CFP-7 , custom binary and floats |
rest-api-small.sh |
0.5 G | < 1 min | 1 s | 249 k | 249 k | nci-250k |
CFP-7 |
rest-api-medium.sh |
1 G | 5 min | 4 s | 1.9 M | 3.8 M | vitamins , drugbank-all , pubchem-rnd-100k , nci-250k , chembl_21 |
CFP-7 , ECFP-4 |
rest-api-medium-maccs.sh |
2 G | 20 min | 4 s | 1.9 M | 5.8 M | vitamins , drugbank-all , pubchem-rnd-100k , nci-250k , chembl_21 |
CFP-7 , ECFP-4 , MACCS-166 |
The following scripts exercise overlap analysis calculation (see Introduction to overlap analysis):
Script | Memory used | Preparation time |
---|---|---|
rest-api-example.sh |
java default | < 20 s |
overlap-example.sh |
java default | 40 min |
The following scripts expose one or more publicly available datasets which are not included in the distribution. They must be downloaded prior to execution, either manually (as described in document Prepare example molecule sets) or by using script examples/download-molecules.sh
. Column Sets to download lists options needed to pass to this script.
Script | Memory used | Preparation time | Load time | Sets to download | Molecule count | Descriptor count | Molecule sets | Descriptors |
---|---|---|---|---|---|---|---|---|
rest-api-large.sh |
8 G | 11 min | 38 s | -E |
~ 19 M | ~ 19 M | vitamins , drugbank-all , pubchem-rnd-100k , nci-250k , chembl_21 , emolecules-plus |
CFP-7 |
rest-api-large-ecfp.sh |
10 G | 25 min | 28 s | -E |
~ 19 M | ~ 38 M | vitamins , drugbank-all , pubchem-rnd-100k , nci-250k , chembl_21 , emolecules-plus |
CFP-7 , ECFP-4 |
rest-api-large-ecfp-maccs.sh |
12 G | 2 h 30 min | 40 s | -E |
~ 19 M | ~ 58 M | vitamins , drugbank-all , pubchem-rnd-100k , nci-250k , chembl_21 , emolecules-plus |
CFP-7 , ECFP-4 , MACCS-166 |
rest-api-xlarge.sh |
20 G | 40 min | 22 s | -E -S |
~ 26 M | ~ 26 M | vitamins , drugbank-all , pubchem-rnd-100k , nci-250k , chembl_21 , emolecules-plus , surechembl |
CFP-7 |
rest-api-xlarge-ecfp.sh |
20 G | 1 h 00 min | 32 s | -E -S |
~ 26 M | ~ 52 M | vitamins , drugbank-all , pubchem-rnd-100k , nci-250k , chembl_21 , emolecules-plus , surechembl |
CFP-7 , ECFP-4 |
rest-api-xlarge-ecfp-maccs.sh |
20 G | 5 h 00 min | 51 s | -E -S |
~ 26 M | ~ 78 M | vitamins , drugbank-all , pubchem-rnd-100k , nci-250k , chembl_21 , emolecules-plus , surechembl |
CFP-7 , ECFP-4 , MACCS-166 |
rest-api-xxlarge.sh |
28 G | 57 min | 57 s | -E -S -Z |
~ 54 M | ~ 54 M | vitamins , drugbank-all , pubchem-rnd-100k , nci-250k , chembl_21 , emolecules-plus , surechembl , zinc-all |
CFP-7 |
rest-api-xxlarge-ecfp.sh |
28 G | 1 h 26 min | 62 s | -E -S -Z |
~ 54 M | ~ 108 M | vitamins , drugbank-all , pubchem-rnd-100k , nci-250k , chembl_21 , emolecules-plus , surechembl , zinc-all |
CFP-7 , ECFP-4 |
rest-api-xxlarge-ecfp-maccs.sh |
28 G | 6 h 56 min | 239 s | -E -S -Z |
~ 54 M | ~ 162 M | vitamins , drugbank-all , pubchem-rnd-100k , nci-250k , chembl_21 , emolecules-plus , surechembl , zinc-all |
CFP-7 , ECFP-4 , MACCS-166 |
Notes:
-
Memory used is the amount of memory granted to tool
gui.sh
and other tools using option-Xmx...
. Please note that some examples use further JVM memory adjustments (setting-XX:NewRatio=...
). -
Preparation time is the time required for preprocessing the structures and launch the embedded server. Time measured on a desktop machine equipped with an Intel Core i7-4790 CPU, 32 GB RAM, running Ubuntu Linux. For more details on performance using various hardwares see document Performance.
-
Load time is the time required to load all the serialized binary files and profiling data by the embedded server on startup. Note that there is an additional few seconds required by the JVM and the embedded server to start listening.
-
Molecule count is the total number of molecules exposed.
-
Descriptor count is the total number of descriptors exposed.
-
Sets to download contains command line arguments to pass to
download-molecules.sh
to download and prepare publicly available datasets required to run the example. -
Data sets shipped with this distribution are expected to be found in file
<APP_HOME>/data/molecules/<SET_NAME>/<FILE_NAME>.smi.gz
. -
Downloaded datasets are expected to found in either of the following two locations:
<MOLS_DIR>/<SET_NAME>/<FILE_NAME>.smi.gz
or<MOLS_DIR>/<FILE_NAME>.smi.gz
, where value of<MOLS_DIR>
defaults to<APP_HOME>/examples-tmp/download-molecules/
and can be customized by option-m <MOLS_DIR>
. With this behavior a data set downloaded bydownload.sh
is found without further configuration, while manually downloaded sets can be located in a flat directory structure. -
Sets
pubchem-compound
andgbb
are not used by the example scripts. -
The
rest-api-XXXX
example scripts wont launch the default browser unless option-b
specified.
Other scripts
-
download-molecules.sh
: Download and prepare public molecule sets not supplied with this distribution. For usage help launch with opion-h
. For details of the downloaded sets see document Prepare molecules. -
verify-concurrent-generation.sh
: Execute comparison workflow detailed in document Verification and benchmarking of concurrent implementations.
Notes on non-workflow specific content of the scripts
The scripts usually provide the following common functionalities.
-
Option
-h
will print command line help and exit. -
They place generated files into a working directory. By default this directory is created in
examples-tmp/<SCRIPT_NAME>
from the distribution root. If option-w <WORKDIR>
specified the generated files are placed in the specified directory. -
They expect
data
andbin
directories of the distribution beside the directory of the example script. (The example scripts normally can be found in directoryexamples
opening from the distribution directory. The location of the distribution home (with these directories) can alternatively specified using option-a <APPHOME>
. -
They write detailed verbose information during execution to the output and to a log file. Log file can be specified with option
-l <LOGFILE>
-
Before launching long running tasks they usually check if the output to be written already exists and skip the invocation.
-
For some scripts option
-m
specify directory containing input molecule files which are not part of the distribution. -
For script
rest-api-example.sh
option-j
can specify Marvin JS license file which will be exposed to the launched Web UI. -
For script
rest-api-example.sh
andrest-api-medium.sh
option-e
invokes the embedded server with option-earlyStart
. For details see document Asynchronous server loading. -
Scripts collect and store system details like processor type, processor count and available memory. They are stored as system properties in execution statistics files. See Profiling and execution statistics for details.