Self contained examples

Directory examples/ contains example scripts. These scripts implement workflow specific self contained examples and serve as a starting point for evaluating the functionality provided. The examples use a working directory to store downloaded/generated files. The working directory is usually a subdirectory of the distributions examples-tmp/ directory.

Most of these scripts can be launched after unpacking the distribution and installing the supplied license file without further arguments. Some of them require to download publicly available datasets (see details below). Note that the behavior of these scripts can be customized; see details below. See document Getting started guide for an overview of some of these examples.

The self contained example scripts eventually use the MadFast command line tools as the building blocks of the implemented workflows. See document Command line interfaces (CLIs) for details.

Please note that a possible race condition in file IO might result in an Exception (Exception in thread "main" java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: Empty stream cannot be read.) when reading files/standard input.

Command line processing examples

These examples wont launch the embedded server/web based UI. They quit after processing and searching is done.

search-workflow.sh: Complete example of the basic search usage introduced in Basic search workflow. This prepares the input set drugbank-all and invokes similarity searches against it.
custom-binaryfp-workflow-vitamins.sh: Custom binary fingerprint based workflow using the small vitamins dataset. This script creates custom binary fingerprint, imports it and invokes search. For details see document Using custom binary descriptors.
custom-floatv-workflow.sh: Custom float vector descriptor based workflow using the a small dataset (data/floatdesc.txt) containing 2D vectors. This script imports the dataset and invokes search. For details see document Using custom float descriptors.

REST API/Web UI examples

These examples launch the embedded web server contained in tool gui.sh. They calculate descriptors and launch an embedded server which provides REST API and Web UI for real time searches. Scripts exposing larger datasets collect profiling and execution statistics and expose them on the web UI. The following scripts use only the shipped datasets. They can be launched without downloading any further public datasets:

Script	Memory used	Preparation time	Load time	Molecule count	Descriptor count	Molecule sets	Descriptors
`rest-api-example.sh`	java default	< 20 s	1 s	30	51	`vitamins`, N/A (custom descriptors with no attached molecules)	`CFP-7`, custom binary and floats
`rest-api-small.sh`	0.5 G	< 1 min	1 s	249 k	249 k	`nci-250k`	`CFP-7`
`rest-api-medium.sh`	1 G	5 min	4 s	1.9 M	3.8 M	`vitamins`, `drugbank-all`, `pubchem-rnd-100k`, `nci-250k`, `chembl_21`	`CFP-7`, `ECFP-4`
`rest-api-medium-maccs.sh`	2 G	20 min	4 s	1.9 M	5.8 M	`vitamins`, `drugbank-all`, `pubchem-rnd-100k`, `nci-250k`, `chembl_21`	`CFP-7`, `ECFP-4`, `MACCS-166`

The following scripts exercise overlap analysis calculation (see Introduction to overlap analysis):

Script	Memory used	Preparation time
`rest-api-example.sh`	java default	< 20 s
`overlap-example.sh`	java default	40 min

The following scripts expose one or more publicly available datasets which are not included in the distribution. They must be downloaded prior to execution, either manually (as described in document Prepare example molecule sets) or by using script examples/download-molecules.sh. Column Sets to download lists options needed to pass to this script.

Script	Memory used	Preparation time	Load time	Sets to download	Molecule count	Descriptor count	Molecule sets	Descriptors
`rest-api-large.sh`	8 G	11 min	38 s	`-E`	~ 19 M	~ 19 M	`vitamins`, `drugbank-all`, `pubchem-rnd-100k`, `nci-250k`, `chembl_21`, `emolecules-plus`	`CFP-7`
`rest-api-large-ecfp.sh`	10 G	25 min	28 s	`-E`	~ 19 M	~ 38 M	`vitamins`, `drugbank-all`, `pubchem-rnd-100k`, `nci-250k`, `chembl_21`, `emolecules-plus`	`CFP-7`, `ECFP-4`
`rest-api-large-ecfp-maccs.sh`	12 G	2 h 30 min	40 s	`-E`	~ 19 M	~ 58 M	`vitamins`, `drugbank-all`, `pubchem-rnd-100k`, `nci-250k`, `chembl_21`, `emolecules-plus`	`CFP-7`, `ECFP-4`, `MACCS-166`
`rest-api-xlarge.sh`	20 G	40 min	22 s	`-E -S`	~ 26 M	~ 26 M	`vitamins`, `drugbank-all`, `pubchem-rnd-100k`, `nci-250k`, `chembl_21`, `emolecules-plus`, `surechembl`	`CFP-7`
`rest-api-xlarge-ecfp.sh`	20 G	1 h 00 min	32 s	`-E -S`	~ 26 M	~ 52 M	`vitamins`, `drugbank-all`, `pubchem-rnd-100k`, `nci-250k`, `chembl_21`, `emolecules-plus`, `surechembl`	`CFP-7`, `ECFP-4`
`rest-api-xlarge-ecfp-maccs.sh`	20 G	5 h 00 min	51 s	`-E -S`	~ 26 M	~ 78 M	`vitamins`, `drugbank-all`, `pubchem-rnd-100k`, `nci-250k`, `chembl_21`, `emolecules-plus`, `surechembl`	`CFP-7`, `ECFP-4`, `MACCS-166`
`rest-api-xxlarge.sh`	28 G	57 min	57 s	`-E -S -Z`	~ 54 M	~ 54 M	`vitamins`, `drugbank-all`, `pubchem-rnd-100k`, `nci-250k`, `chembl_21`, `emolecules-plus`, `surechembl`, `zinc-all`	`CFP-7`
`rest-api-xxlarge-ecfp.sh`	28 G	1 h 26 min	62 s	`-E -S -Z`	~ 54 M	~ 108 M	`vitamins`, `drugbank-all`, `pubchem-rnd-100k`, `nci-250k`, `chembl_21`, `emolecules-plus`, `surechembl`, `zinc-all`	`CFP-7`, `ECFP-4`
`rest-api-xxlarge-ecfp-maccs.sh`	28 G	6 h 56 min	239 s	`-E -S -Z`	~ 54 M	~ 162 M	`vitamins`, `drugbank-all`, `pubchem-rnd-100k`, `nci-250k`, `chembl_21`, `emolecules-plus`, `surechembl`, `zinc-all`	`CFP-7`, `ECFP-4`, `MACCS-166`

Notes:

Memory used is the amount of memory granted to tool gui.sh and other tools using option -Xmx.... Please note that some examples use further JVM memory adjustments (setting -XX:NewRatio=...).
Preparation time is the time required for preprocessing the structures and launch the embedded server. Time measured on a desktop machine equipped with an Intel Core i7-4790 CPU, 32 GB RAM, running Ubuntu Linux. For more details on performance using various hardwares see document Performance.
Load time is the time required to load all the serialized binary files and profiling data by the embedded server on startup. Note that there is an additional few seconds required by the JVM and the embedded server to start listening.
Molecule count is the total number of molecules exposed.
Descriptor count is the total number of descriptors exposed.
Sets to download contains command line arguments to pass to download-molecules.sh to download and prepare publicly available datasets required to run the example.
Data sets shipped with this distribution are expected to be found in file <APP_HOME>/data/molecules/<SET_NAME>/<FILE_NAME>.smi.gz.
Downloaded datasets are expected to found in either of the following two locations: <MOLS_DIR>/<SET_NAME>/<FILE_NAME>.smi.gz or <MOLS_DIR>/<FILE_NAME>.smi.gz, where value of <MOLS_DIR> defaults to <APP_HOME>/examples-tmp/download-molecules/ and can be customized by option -m <MOLS_DIR>. With this behavior a data set downloaded by download.sh is found without further configuration, while manually downloaded sets can be located in a flat directory structure.
Sets pubchem-compound and gbb are not used by the example scripts.
The rest-api-XXXX example scripts wont launch the default browser unless option -b specified.

Other scripts

download-molecules.sh: Download and prepare public molecule sets not supplied with this distribution. For usage help launch with opion -h. For details of the downloaded sets see document Prepare molecules.
verify-concurrent-generation.sh: Execute comparison workflow detailed in document Verification and benchmarking of concurrent implementations.

Notes on non-workflow specific content of the scripts

The scripts usually provide the following common functionalities.

Option -h will print command line help and exit.
They place generated files into a working directory. By default this directory is created in examples-tmp/<SCRIPT_NAME> from the distribution root. If option -w <WORKDIR> specified the generated files are placed in the specified directory.
They expect data and bin directories of the distribution beside the directory of the example script. (The example scripts normally can be found in directory examples opening from the distribution directory. The location of the distribution home (with these directories) can alternatively specified using option -a <APPHOME>.
They write detailed verbose information during execution to the output and to a log file. Log file can be specified with option -l <LOGFILE>
Before launching long running tasks they usually check if the output to be written already exists and skip the invocation.
For some scripts option -m specify directory containing input molecule files which are not part of the distribution.
For script rest-api-example.sh option -j can specify Marvin JS license file which will be exposed to the launched Web UI.
For script rest-api-example.sh and rest-api-medium.sh option -e invokes the embedded server with option -earlyStart. For details see document Asynchronous server loading.
Scripts collect and store system details like processor type, processor count and available memory. They are stored as system properties in execution statistics files. See Profiling and execution statistics for details.