Glossary of most common terms

This documents collects the interpretation of the possibly ambiguous terms used in the documentation and in the fast similarity search / overlap / descriptors API.

Crossfilter: A JavaScript library providing fast multidimensional filtering for coordinated views. See page http://crossfilter.github.io/crossfilter/ for deeper introduction and examples.

Dimension: A projection of the data handled by a Crossfilter based visualization. Typically the data can be considered as rows in a table, where each row spans over multiple columns. The columns, derived data from them and their combinations can be considered as dimensions. Data in separate dimensions can be visualized (in a histogram) or filtered. See Crossfilter API docs for details at https://github.com/crossfilter/crossfilter/wiki/API-Reference#wiki-dimension.

Array valued dimension: Data in these Crossfilter dimensions is an array of primitive values. See Crossfilter API docs for details at https://github.com/crossfilter/crossfilter/wiki/API-Reference#wiki-dimension_with_arrays.

Missing values: Real life datasets might contain rows which does not have data for certain dimensions. Since the underlying Crossfilter library expects data to be present for each row the MadFast WebUI porvides a workaround for storage and visualization. See Crossfilter API for details at https://github.com/crossfilter/crossfilter/wiki/Crossfilter-Gotchas#natural-ordering-of-dimension-and-group-values.

Word cloud: A visualization component in MadFast WebUI for discrete values of a CrossFilter dimension. Each unique value is represented by a label which can be clicked to filter the matching rows. Missing values are represented by a special label which is also selectable.

Binary fingerprint: A fingerprint where one or more bits are associated to the features found in the input molecule.

Fingerprint: A kind of descriptor which is derived from features found in the input molecule.

Descriptor: A molecular descriptor is a derived information of a molecule which can be compared against an other descriptor. Such comparison is resulting in a similarity or dissimilarity value. Note that the term descriptor can refer to the class of the derived information and to the actual instance of such information calculated from a specific molecule.

Custom descriptor: A descriptor which is calculated by an external tool. The raw representation of a custom descriptor can be imported to make similarity calculations. See Custom binary descriptors and Custom float descriptors for details.

Metric: A rule which assigns a similarity/dissimilarity score (a scalar value) for two descriptors. Since some metrics are asymmetric the two descriptors compared are distinguished: target and query sides are defined. Document Metric customization describes the setting of various metrics.

Similarity/dissimilarity score: A scalar value assigned to a pair of descriptors by a suitable metric. The more similar the two descriptors the greater the similarity score and lower the dissimilarity score is. Note that the minimum of these scores is not necessarily 0.0, the maximum is not necessarily 1.0, the range of similarity and dissimilarity scores are not necessarily the same. There are cases (for example comparing vector descriptors with non bound coordinate values using euclidean metric) when the dissimilarity (distance) can be interpreted easily while similarity is not.

Descriptor generator: Object responsible for calculating a specific descriptor for input molecules, collecting the available dissimilarity calculations and supporting the handling of various descriptor representations. See apidoc

Descriptor comparator: Object responsible for comparing two descriptors (in the bare or rich form) in a type safe way by checking guard objects. A DescriptorComparator instance represents a metric. A DescriptorComparator instance can compare descriptors which were generated by the same DescriptorGenerator created the comparator. See apidoc

Bare form of descriptors: A descriptor representation providing consistency checks (guard objects) and only the information needed for dissimilarity calculation. For example the bare form of Ecfp descriptors is the BinararyVectorDescriptor.

Rich form of descriptors: A descriptor representation extending the bare form possibly providing further functionality not necessarily needed for dissimilarity calculation, like feature retrieval. Currently rich forms are present in the type system but they do not provide such additional functionality.

Unguarded form of descriptors: Usually the internal representation of the descriptors. Consistency related contracts can not be enforced but efficient storage and manipulation of descriptors is possible. For example the unguarded form of most binary fingerprints is long [].

Unguarded extractor: The transformation from a descriptors bare to its unguarded form. Note that multiple valid unguarded forms are possible.

Scripting hook: Many of the command line tools allows the injection of executable code to provide customization of its settings. The language of such hooks is JavaScript and are interpreted by the built in script engine in Java. A brief documentation of the context of the executed hooks is printed along with the help by option -h. It might needed to consult the Java APIdoc and the examples in the documentation. Heavy usage of scripting hook can be found in the Basic overview of the concepts of overlap analysis context documentation.

Regular expression: A regular expression is a powerful tool for text processing in scripting hooks. Further details of the regular expressions used in JavaScript can be found in the JavaScript String Reference, in the JavaScript RegExp reference and in MDN's JavaScript Guide.

Standardizer: Object providing standardization by transforming an input molecule. Typical standardization used with the similarity searches is aromatization.

Aromatization: Assigning a special (*aromatic*) bond type to bonds in aromatic rings instead of the single and double bonds.

Profiling: Collecting periodic measures during the execution of a running program for optimization of settings or for troubleshooting. Measures include various JVM parameters, like memory usage, garbage collection activity, system load or the progression of the tasks executed. For details see document Profiling and execution statistics.

Execution statistics: A single structured information of various measures of an execution and its environment. Include command line argument, system properties, environment variables and execution time/speed measurements. For details see document Profiling and execution statistics.

Serialized file: A binary file, usually in a proprietary format containing data readable by the JVM. The contents of a serialized file are usually read into one or more objects by the running program. Note that the fast similarity search tools use custom serialization and enforce several further contracts.

Master molecule storage: A serialized file containing chemical structures, usually in SMILES format. The stored structures are indexed by the master index which is a 0 based continuous integer sequence. The file can be read efficiently without the need of parsing each structure during read time. Tool createMMS is used to parse a structure file and create a master molecule storage. For usage example see Basic similarity search workflow.

Master ID storage: A serialized file containing arbitrary strings, usually associated to structures in a master molecule storage. Tool createMMS is used to parse a structure file and extract molecule name or SDF properties to create a master ID storage. For usage example see Basic similarity search workflow.

Descriptor storage: A serialized file containing molecular descriptors (fingerprints) for molecules. The file also contains a serialized form of the associated OverlapAnalysisContext.

Embedded server: A Web/HTTP server which is used as a component in a software. This distribution uses Jetty for this purpose. Since the Web server is contained by the software no separate server installation and configuration is needed for getting started. Also the distribution can be used as a standalone software, it is not needed to deploy it into a preexisting server/container. The embedded server in this distribution provide a standalone GUI-like interface and a REST API for remote clients.

GUI: Graphical User Interface.

Web UI: Used to denote the web technologies (HTTP/HTML/DHTML/JavaScript/REST) based user interface provided by the embedded server. The server (in this case the embedded server) and the client (web browser displaying the user interface) can run on the same machine without further installation/deployment, like a traditional GUI based application.

REST: Is an architectural style of designing web services over HTTP. For details on this architectural style see Wikipedia article. For details on the REST API provided by this tool see document REST API exmple and the REST API documentation.

Asynchronous request: Capability of the embedded REST server (introduced in version 0.3.4) to return responses for requests before the requested calculation/task is finished. A REST API request which launches such task returns a response which contains details about the task itself. To poll the status and fetch the results the REST API client must invoke further REST API calls to the task descriptor API endpoint. For further details see Asynchronous search tasks.

Task: A task is a possibly time consuming background computation/processing (such as similarity searcg) done by the embedded server.