Prepare example molecule sets

Overview

Several large molecule sets are available from public sources. In the examples some such sets are used as an input. This document details the download and preparation of the following ones (for the sets included in this distribution the path of the structure file is shown):

Name Version / access date Processed file Molecule count Download size Processed size
vitamins data/molecules/vitamins/vitamins.smi 30 N/A 3 k
antibiotics data/molecules/antibiotics/antibiotics.smi 146 N/A 16 k
who-essential-medicines data/molecules/who-essential-medicines/who-essential-medicines.smi 342 N/A 26 k
drugbank-all data/molecules/drugbank/drugbank-all.sdf.gz 7127 2.3 M 2.3 M
nci-250k data/molecules/nci/nci-250k.smi.gz 249 k 3.2 M 2.8 M
chembl 21 data/molecules/chembl/chembl-21.smi.gz 1.5 M 507 M 24 M
chebi 2019-03-07 chebi.smi.gz 101 k 67 M 1.3 M
emolecules-plus 2019-08-01 emolecules-plus.smi.gz 22.4 M 242 M 248 M
surechembl 2019-03-07 surechembl.smi.gz 18.8 M 1.4 G 216 M
zinc-all 2014-11-28 zinc-all.smi.gz 16.6 M 142 M 142 M
pubchem-compound 2019-03-07 pubchem-compound.smi.gz 97.1 M 69 G 877 M
pubchem-compound-rnd pubchem-compound-rnd.smi.gz 97.1 M N/A 1.8 G
pubchem-compound-rnd-1k data/molecules/pubchem-compound/pubchem-compound-rnd-1k.smi.gz] 1 k N/A 20 k
pubchem-compound-rnd-10k data/molecules/pubchem-compound/pubchem-compound-rnd-10k.smi.gz 10 k N/A 189 k
pubchem-compound-rnd-100k data/molecules/pubchem-compound/pubchem-compound-rnd-100k.smi.gz 100 k N/A 1.9 M
pubchem-compound-rnd-1000k pubchem-compound-rnd-1000k.smi.gz 1 M N/A 19 M
gdb-13 gdb-13.smi.gz 977 M 2.7 G 2.7 G
gdb-12 gdb-12.smi.gz 123 M N/A 334 M

Download script

Script examples/download-molecules.sh can download and prepare the molecule sets described here. Launch the script with option -h to access usage help.

Vitamins

File data/molecules/vitamins/vitamins.smi contains 30 molecules in <SMILES> <NAME> format. It needs no further preparation. A gzipped version (data/molecules/vitamins/vitamins.smi.gz) is also available. Contents of the file is based on page http://en.wikipedia.org/wiki/Vitamins.

Antibiotics

File data/molecules/antibiotics/antibiotics.smi contains 146 molecules in <SMILES> <NAME> format. It needs no further preparation. A gzipped version (data/molecules/antibiotics/antibiotics.smi) is also available. Contents of the file is based on page https://en.wikipedia.org/wiki/List_of_antibiotics.

WHO Model List of Essential Medicines

File data/molecules/who-essential-medicines/who-essential-medicines.smi contains 342 structures from the WHO Model List of Essential Medicines (*adult list* of 19th edition, April 2015), based on https://en.wikipedia.org/wiki/WHO_Model_List_of_Essential_Medicines, created using ChemAxon ChemCurator.

Drugbank all

DrugBank "Open Data dataset" is available as a zipped SDF file. For details see http://www.drugbank.ca/. Download page can be found at http://www.drugbank.ca/releases/latest#open-data. Preparation involves repacking .zip archive into gzipped SDF format. Also create two gzipped SMILES versions where the structure name is derived from field COMMON_NAME and DRUGBANK_ID.

Please note that the SMILES versions currently miss a few structures.

wget http://www.drugbank.ca/releases/5-0-1/downloads/all-open-structures -O drugbank-all.sdf.zip
unzip -p drugbank-all.sdf.zip | gzip > drugbank-all.sdf.gz
gzip -dc drugbank-all.sdf.gz | bin/prepareMolecules.sh -in - -out - -namefromprop COMMON_NAME | gzip -9 > drugbank-common_name.smi.gz
gzip -dc drugbank-all.sdf.gz | bin/prepareMolecules.sh -in - -out - -namefromprop DRUGBANK_ID | gzip -9 > drugbank-drugbank_id.smi.gz

According to http://www.drugbank.ca/releases/latest#open-data page

The DrugBank Open Data datasets are public domain datasets that can be used freely in your application or project (including commercial use). It is released under a Creative Common’s CC0 International License. To the extent possible under law, the person who associated CC0 with the DrugBank Open Data has waived all copyright and related or neighboring rights to the DrugBank Open Data. This work is published from: Canada.

A repackaged (.sgf.gz) version of the downloaded DrugBank Open Data dataset, according to this license is currently available in directory data/molecules/drugbank/.

NCI-250k

NCI Release 1; ~250k structures in gzipped SMILES, see details at http://cactus.nci.nih.gov/download/nci/. Notes:

wget http://cactus.nci.nih.gov/download/nci/NCISMA99.sdz
gzip -dc NCISMA99.sdz | awk '{print $2 " NCI" $1}' | sed "s/\[\([BCNOPSF]\)\]/\1/g" | gzip > nci-250k.smi.gz

According to the publisher of this dataset the structures are in the public domain. The structures in fixed SMILES format are available in directory data/molecules/nci.

ChEMBL

Structures from CheEMBLdb release 21 are available in gzipped SDF file. FTP site of the downloadable content is at ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_21/. For details see https://www.ebi.ac.uk/chembl/ and https://www.ebi.ac.uk/chembl/downloads. Structures are converted to SMILES with included tool prepareMolecules with preserving ChEMBL IDs. Please note that the SMILES version might miss a few structures.

wget ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_21/chembl_21.sdf.gz -O chembl-21.sdf.gz
gzip -dc chembl-21.sdf.gz | bin/prepareMolecules.sh -in - -out - -namefromprop chembl_id | gzip -9 > chembl-21.smi.gz

According to ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_21/LICENSE and ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_21/REQUIRED.ATTRIBUTION files the data is covered by the Creative Commons Atrubution-ShareAlike 3.0 Unported license. The downloaded and SMILES converted structure data is available in directory data/molecules/chembl. According to the information the following attributions are required:

ChEBI

Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on small chemical compounds. Structures in sdf format can be downloaded from the FTP site ftp://ftp.ebi.ac.uk/pub/databases/chebi/SDF/:

wget ftp://ftp.ebi.ac.uk/pub/databases/chebi/SDF/ChEBI_complete.sdf.gz
gzip -dc ChEBI_complete.sdf.gz | awk '{
    if ($0 == "> <SMILES>") { getline ; SMI = $0; }
    else if ($0 == "> <ChEBI ID>") { getline ; CID = $0; }
    else if ($0 == "$$$$") { print SMI " " CID; SMI = ""; CID = "" }
}' | gzip > chebi.smi.gz

eMolecules Plus

Free version of eMolecules Plus Database can be downloaded from https://www.emolecules.com/info/plus/download-database. The first line of the zipped file is a header and two IDs (version_id and parent_id) are concatenated. We remove the first line and add a - character between these two IDs.

wget http://downloads.emolecules.com/free/2019-08-01/version.smi.gz
gzip -dc version.smi.gz | tail -n +2 | awk '{ print $1 " " $2 "-" $3 }' | gzip -9 > emolecules-plus.smi.gz

SureChEMBL

SureChembl compound data dump can be downloaded in txt and sdf formats from FTP site ftp://ftp.ebi.ac.uk/pub/databases/chembl/SureChEMBL/data/. Details are available in file ftp://ftp.ebi.ac.uk/pub/databases/chembl/SureChEMBL/data/README. The txt format is a tab separated file containing ID, SMILES, InChI and InChIKey informations. During processing the first line of these files is dropped and the first two fields are used to construct the desired output in the form of <SMILES> <ID> lines. Use the FTP directory to get the list of files to download and process.

rm -f surechembl.smi.gz
wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/chembl/SureChEMBL/data/ | \
    tr "\"" "\n" | \
    grep "^ftp://ftp\.ebi\.ac\.uk.*\.txt\.gz$" | \
    sed -e 's|.*/\(.*\)|\1|' | \
    while read file
    do
        echo "Download/process $file"
        if [ ! -e "$file" ]
        then
            wget "ftp://ftp.ebi.ac.uk/pub/databases/chembl/SureChEMBL/data/$file"
        fi
        gzip -dc "$file" | tail -n +2 | awk '{ print $2 " " $1 }' | gzip -9 >> surechembl.smi.gz
    done

According to the README file at ftp://ftp.ebi.ac.uk/pub/databases/chembl/SureChEMBL/data/README:

The data content in SureChEMBL is licensed under a highly permissive Creative Commons license - specifically the "CC Attribution-ShareAlike 3.0 Unported license", see LICENSE file. The required attribution should contain the url of the SureChEMBL resource (https://www.surechembl.org/) and should be visible on the entry portal for a web resource in which SureChEMBL is integrated, or contained with the documentation for any further distribution.

The required attribution according to file ate ftp://ftp.ebi.ac.uk/pub/databases/chembl/SureChEMBL/data/REQUIRED.ATTRIBUTION:

The data in SureChEMBL is covered by the licence in the file LICENSE.

Under the -BY clause, we request attribution for subsequent use of SureChEMBL data.

For publications using SureChEMBL data, the primary current citation is:

  1. G. Papadatos, M. Davies, N. Dedman, J. Chambers, A. Gaulton, J. Siddle, R. Koks, S. A. Irvine, J. Pettersson, N. Goncharoff, A. Hersey, J. P. Overington (2016). SureChEMBL: a large-scale, chemically annotated patent document database. Nucleic Acids Research Database Issue, 44, D1220-D1228, DOI:10.1093/nar/gkv1253, PMID:26582922.

If SureChEMBL is incorporated into other works, we ask that the SureChEMBL IDs are preserved, and that the release date of SureChEMBL is clearly displayed.

Zinc-all

ZINC All Purchasable subset in gzipped SMILES (Reference pH 7 set), see download link at http://zinc.docking.org/subsets/all-purchasable (Downloads tab).

wget http://zinc.docking.org/db/bysubset/6/6_p0.smi.gz -O zinc-all.smi.gz

PubChem Compound

PubChem Compound (homepage: http://www.ncbi.nlm.nih.gov/pccompound) can be downloaded in gzipped SDF format from PubChem FTP site ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/. Specifications are available at ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/. To get a gzipped SMILES file fields PUBCHEM_OPENEYE_ISO_SMILES and PUBCHEM_COMPOUND_CID are collected from the SDF file using an awk script. Note that this set contains close to 100M structures, so the download size is over 60 GB (over 5000 files) and the execution time of SMILES extraction can be more than two hours. For the sake of simplicity the awk script used below is not compliant with the full SDfile format data header and data value specification.

wget -nd -np -r ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/*
gzip -dc *.sdf.gz | awk '{
    if ($0 == "> <PUBCHEM_OPENEYE_ISO_SMILES>") { getline ; SMI = $0; }
    else if ($0 == "> <PUBCHEM_COMPOUND_CID>") { getline ; CID = $0; }
    else if ($0 == "$$$$") { print SMI " " CID; SMI = ""; CID = "" }
}' | gzip -9 > pubchem-compound.smi.gz

PubChem Compound random ordering

A randomized ordering of the PubChem Compound structures can be created by shuffling the extracted SMILES file. Note that the execution time of the following script can be more than an hour.

gzip -dc pubchem-compound.smi.gz | \
    awk 'BEGIN { srand(0) ; }{ printf "%f%f %s\n", rand(), rand(), $0 }' | \
    sort | \
    sed -r 's/^[01]\.[0-9]+[01]\.[0-9]+ //' | \
    gzip -9 > pubchem-compound-rnd.smi.gz

Note that by using a fixed seed (srand(0)) for random number generation (rand()) the shuffling script above is expected to produce the same ordering (for the same input) across multiple runs. Note that prefixes of the resulting file are random subsets of the input set.

PubChem Compound random subsets

Derive a random subsets by taking prefixes of the random ordered version described above. These random subsets can be used for benchmark and verification.

gzip -dc pubchem-compound-rnd.smi.gz | head -1000 | gzip -9 > pubchem-compound-rnd-1k.smi.gz
gzip -dc pubchem-compound-rnd.smi.gz | head -10000 | gzip -9 > pubchem-compound-rnd-10k.smi.gz
gzip -dc pubchem-compound-rnd.smi.gz | head -100000 | gzip -9 > pubchem-compound-rnd-100k.smi.gz
gzip -dc pubchem-compound-rnd.smi.gz | head -1000000 | gzip -9 > pubchem-compound-rnd-1000k.smi.gz

Note that the above execution fails when set -o pipefail is set because head closes its input prematurely. See https://stackoverflow.com/questions/41516177/bash-zcat-head-causes-pipefail.

According to ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/README-Compound-SDF "Fair Use Disclaimer":

Databases of molecular data on the NCBI FTP site include such examples as nucleotide sequences (GenBank), protein sequences, macromolecular structures, molecular variation, gene expression, and mapping data. They are designed to provide and encourage access within the scientific community to sources of current and comprehensive information. Therefore, NCBI itself places no restrictions on the use or distribution of the data contained therein. However, some submitters of the original data may claim patent, copyright, or other intellectual property rights in all or a portion of the data they have submitted. NCBI is not in a position to assess the validity of such claims and, therefore, cannot provide comment or unrestricted permission concerning the use, copying, or distribution of the information contained in the molecular databases.

Random subsets 1k, 10k and 100k are available in directory data/molecules/pubchem-compound.

Pubchem Compound random subsets SDF source

SDF source (containing SDF properties) of the 1k random subset is retrieved from PubChem Power User Gateway (see https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html).

gzip -dc pubchem-compound-rnd-1k.smi.gz | \
    awk '{ print $2 }' | \
    while read cid
    do
        curl "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/${cid}/record/SDF" >> pubchem-compound-rnd-1k.sdf
        # According to PubChem Power User Gateway usage policy
        # (at https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest ) the service is not designed for very large
        # volumes (millions) of requests. Users are requested to make no more than 5 requests per
        # second. The 1s sleep ensures that at most 1 request is made per second.
        sleep 1
    done
gzip pubchem-compound-rnd-1k.sdf

The SDF source of random subset 1k is available in directory data/molecules/pubchem-compound.

GDB-13 and GDB-12

According to http://gdb.unibe.ch/downloads/

GDB-13 enumerates small organic molecules up to 13 atoms of C, N, O, S and Cl following simple chemical stability and synthetic feasibility rules. With 977 468 314 structures, GDB-13 is the largest publicly available small organic molecule database to date.

970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. Blum L. C.; Reymond J.-L. J. Am. Chem. Soc., 2009, 131, 8732-8733.

After accepting Terms and Conditions download "Entire GDB-13 (including all C/N/O/Cl/S molecules)". The archive file contains several individual files for various atom counts. By excluding 13.smi we get enumeration up to 12 atoms. Note that converting to the flat smi.gz file takes several minutes.

wget http://gdbtools.unibe.ch:8080/cdn/gdb13.tgz
tar xvzfO gdb13.tgz --exclude 13.smi | gzip > gdb-12.smi.gz
tar xvzfO gdb13.tgz | gzip > gdb-13.smi.gz