Scaffolds in GFP Chromophore Formation Inhibitors

Assignment 7

Introduction: A molecular scaffold generally consists of the ring structures and any linkers between rings in a particular molecule. Scaffolds are of interest in drug and agrochemical discovery because they can often be associated with specific biological activities (Hu, 2011). For this study we extract scaffolds from the compounds of a particular bioassay and determine if particular scaffolds can be associated with biological activity in the compounds.

Methods: Bioassay 434968 (Fluorescence Cell-Free Homogeneous Counter Screen to Identify Inhibitors of GFP Chromophore Formation) was downloaded from the PubChem website with the active and inactive compounds in separate files. The application Strip-It from Silicos IT was used to generate scaffolds from the compounds. Finally, a list of scaffolds was generated that were involved with multiple active compounds in the assay, but were not a part of any inactive molecules.

Results: A "highly active" scaffold was considered one that occurred in at least ten active compounds without occurring in a single inactive compound. Eight such scaffolds were identified (Table 1)

Scaffold SMILES string
Table 1: SMILES strings for highly active scaffolds

Figure 1: Images of the four most common scaffolds

Discussion: The bioassay consisted of 1,764 active compounds and 349 inactive compounds. It is thus unsurprising that there were many scaffolds that occurred only in active compounds. However, the scaffold C1CCC(CC1)CCC1CCCC2C1CCCC2 occurred in 36 compounds without occurring in a single active compound. If we picked 36 compounds at random we could expect to find 30 active compounds and six inactive compounds, so it seems likely that this scaffold, at least, occurs in a statistically significant number of active compounds.

Supplemental files: A Python script was created to perform the test.


Strip-It software package v1.0.2 retrieved from

Hu, 2011. Systematic Identification of Scaffolds Representing Differernt Types of Structure-Activity Relationships. Retrieved from

Molecular images produced with Jmol.

Analyzing a pharmacophore based on a PknB assay from Mycobacterium Tuberculosis.

Assignment 6

Introduction: A pharmacophore is a set of molecular features that describes how a potential ligand might bind to a macromolecule. The application LigandScout allows the automatic generation of a pharmacophore on a set of compounds. We generate a pharmacophore from a tuberculosis bioassay and perform a virtual screen using it.

Methods: BioAssay 624753 (Binding constant for PKNB(M.tuberculosis) kinase domain) was downloaded from the PubChem website and split into separate active and inactive compound sets. The active compounds were loaded into LigandScout and clustered. The clustering process revealed a group that contained fourteen active molecules; thirteen of these were used to create a pharmacophore (Figure 1). (The fourteenth was reserved for use in a test set).

The pharmacophore that was created contained five features clustered into two general areas. One area contained two adjacent hydrogen bond acceptors with a nearby, directionally-constrained hydrogen bond acceptor, while the other area contained a hydrogen bond acceptor adjacent to a donor.

Figure 1. A generated pharmacophore

To establish the pharmacophore's validity, it was tested against a set of 1035 compounds with known activity levels. The compounds were provided in a single file in which the first 35 compounds were active, so the file was split using a text editor into active and inactive compounds, and the two files were converted to LDF format. These files were loaded into LigandScout and a screen was performed.

Results: Of the 1035 compounds, the pharmacophore classified 363 (35.11%) as active, with the remaining inactive. Of these, twenty were actually active. Therefore, the screen had a sensitivity of .57 and a specificity of .66. This compares unfavorably to similar pharmacophores generated by (Seal et al. 2013) which had a sensitivity as high as .74 and a specificity as high as .86.

Figure 2: ROC curve of the generated pharmacophore

%Yield Of Actives
Goodness of Hit(GH SCORE)
Table 1: Statistics of the generated pharmacophore

The enrichment factor is the ability of the pharmacophore to find a greater percentage of active compounds than would be found by randomly selecting compounds. It is measured by taking a percentage of the compounds marked most likely to be active and determining the actual hit rate, divided by the fraction of active compounds in the dataset.

Table 2: Enrichment factor at various hit percentages

An attempt was made to run the pharmacophore against the Asinex platinum collection dataset, but LigandScout was unable to complete the screen on a four-core CPU with 8 gigabytes of memory. Presumably an even larger machine would be needed.

An analysis of the predictive capabilities of various types of chemical fingerprinting based on inhibitors of Plasmodium falciparum

Assignment 5

Introduction: Predicting the activity of chemical compounds may be accomplished by comparing various features of the compound to identical features in compounds already known to be active or inactive. The effectiveness of this approach is influenced by the set and number of features chosen for comparison. This study compares the effectiveness of three different feature sets, or fingerprints, created from a set of compounds in a single bioassay.

Methods: BioAssay 504318 (Inhibitors of Plasmodium falciparum) was downloaded from the PubChem website. The assay included 92 active and 1395 inactive compounds. The OpenBabel application was used to remove salts from the compounds and to remove duplicates. The result of this filtration was two SDF files, one containing active compounds and one containing inactive compounds.

Next, the two files were converted into a standard space-separated text file, containing three items for each compound: a SMILES string, the compound's CID, and whether the compound was active or inactive. At this point also, the number of hetero-atoms in the compound was calculated, and the compound was removed from the result if the compound had fewer than ten or more than 60 hetero-atoms. This work was done using a Python script and the rdkit package.

The resulting file, consisting of 219 compounds, was processed by an R script. For each compound, three fingerprints were generated using the RCDK package: a Maccs fingerprint, an Extended fingerprint, and a PubChem fingerprint. Each fingerprint type was then split into 80% training data and 20% test data, and several random forest runs were made using samples of the training data. The final results were given by taking the average of the run results.

Results: The three fingerprints ranged in overall activity prediction accuracy from 76% to 81% (Table 1), with Maccs performing at the low end and PubChem at the high. In correctly detecting inactive compounds, the ranges were similar (76%-80%) and the fingerprints ranked in the same order. In correctly detecting active compounds, the range was 75% to 100%, with the Extended fingerprint having the best result and the Maccs fingerprint faring the worst.

1 Score
Table 1: Sensitivity and specificity of various fingerprint methods

Discussion: The Pubchem fingerprint showed a slightly greater accuracy than the the other two fingerprint types, and had also a slightly greater specificity. The Extended fingerprint perfectly identified all active compounds, although it also identified about as many inactive compounds as active as either of the other two. The Maccs fingerprint, which only consists of around 20% the feature count of the others, did not fare as well; however it was not as far off as might be expected given its much smaller feature count. It seems likely that calculation of the Maccs fingerprint would be faster and almost as accurate in the analysis of a large dataset.

Figure 1: Comparison of true and false positive rates

Supplemental Files:
Python script to remove heteroatoms and combine active and inactive compounds

R script for random forest testing and graph generation

Structural Feature Heatmaps in Pubchem Bioassays

Assignment 2

Overview: An assay of compounds that sensitize Myobacterium Tuberculosis to certain antibiotics was made available on PubChem. A search for structural similarities in the compounds was performed using the R statistical package and its internal packages RPubChem and RCDK, interfaces which allow usage in R of the PubChem REST interface and the Chemical Development Kit, respectively.

Description: The compounds were compared using the following strategy: Download the assay from the PubChem website, and retrieve the SMILES strings representing each compound into a dataset. The compounds were divided into Active and Inactive sets, and the Fingerprint package was then run against the SMILES strings to assign each compound a set of structural features. These features were then compared for similarity against each other using a heatmap. Script source code is available on Github.

Results: Similarities were extremely superficial among the active compounds, with only a few areas of even minimal similarities (Figure 1). This probably indicates that there were not enough active compounds to find matches – just 30 compounds in the assay were active. The inactive compounds fared slightly better with 127 compounds; they demonstrated a few areas of similarity (Figure 2).

Figure 1: Active Compound Heatmap

Figure 2: Inactive Compound Heatmap