Structural diversity of biologically interesting datasets: a scaffold analysis approach
Journal of Cheminformatics 2011, 3:30 doi: 10.1186/1758-2946-3-30
Varun Khanna and Shoba Ranganathan

Introduction and methods
This paper describes about the metabolites and the natural products (NPs) in drug design and designing of the compound lead libraries. The conception of the paper focuses on that the natural products and metabolites are recognized by one of the protein in the biosphere . Since the metabolites and Natural products are optimized by nature to bind one of the biological targets and it is likely that lead libraries designed with scaffolds and fragments of the NPs and metabolites will result in molecules with better ADMET properties.
For the study different datasets were considered i.e the Drugs( taken from drug bank and KEGG drugs),Metabolites(HMDB, HumanCYC, BiGG),Toxic( DSSTox, FDA Carcinogeneticity, ITER, Super Tox icity), Natural Products(ZINC NP database),Leads(BIONET, Maybridge),NCI and CHEMBL.From the compounds duplicates entries ,organic ions, metal ions are removed and also corrupted or missing structure are removed. After all the filtering process the data was clustered in Pipleline pilot “Clara” program using ECFP_4 or FCFP_4 fingerprints. Physicochemical analysis was done using clutering with respect to the Lipinski properties: molecular weight, the number of hydrogen bond acceptors, AlogP (a hydrophobicity measure) and the number of hydrogen bond donors and other descriptors such as the molecular polar surface area, Molecular solubility, number of rings, number of rotatable bonds. A scaffold analysis was also done and also the results are analysed.

Similarity Analysis

In this paper a fragment based approach has been taken in which compounds are broken down to fragments to low molecular weight drug like fragments such as the ring systems functional groups ,side chains, linkers and fingerprints.
From the diversity analysis report it is found that the CHEMBL dataset generated maximum number of fragments than the others and seems to be much more diverse. Whereas the metabolites produced least number of fragments which means that metabolite compounds are not much more diversed and they occupy limited chemical space. Other drug datasets were found moderately diverse. Tanimoto analysis were also done on the datasets using a different approach given in Equation below
.


Tanimoto_relation.png



Here xiAand xiB are the number of times the ith fragment occurs in A and B over the n elements of each finger print.
The FCFP fingerprint were generated and tanimoto measure was calculated among the various datasets. It was found that the drugs and toxic substances shows 0.91 similarity and on the drugs were least similar to the metabolites. The fragments found in the metabolites are least similar to the Natural products.

Physicochemical properties Analysis

Lipinksi rule of 5 predicts the drugs bioavailability. In the clustered sets 25% of the drugs do not stick to Ro5 also 68% of metabolites lie outside the rules. But after removal of the lipids and the metabolite ratio reduced to 20%. Also around 26% of the toxic compounds fails the Ro5 and only 16% of the Natural products fails the Ro5 and lead molecules rate was 19%.It was also studied that metabolites showed higher solubility,higher molecular surface area,low molecular complexity compared to that of drugs.

Scaffold Analysis

From the scaffold part its being observed that the drugs being having the maximum number of scaffolds (50%) followed by the toxic(42%) with lowest is the metabolites with (14%) .The high values indicate the diversity of the compouds in the chemical space.There was more than 70% singletons in the CHEMBL and the NCI dataset . Also in the datasets such as the natural products,metabolites and leads 64%,39% and 34% recurring scaffolds occur meaning that the compounds are concentrated in a certain area.
A search in the aromatic rings when done indicating that the 85% of the drugs have aromatics ring as scaffold and 97.4 % was found in the lead compounds.In the top five scaffolds that are analysed benzene is the most abundant in all of the systems followed by Pyridine ,steroids ,purines and imidazoles.
Out of the 296 non redundant scaffolds found in the metabolites 42% shared by the drugs and 23 % shared by leads, which indicatd optimization of structures to become more like metabolite.Also large part of the scaffolds of metabolites are present in the natural molecules i.e around 47% NCI(78%) and CHEMBL(73%)
With the above analysis keeping in mind it is possible to suggest that the natural compounds and metabolites are important molecules in the drug discovery as most the biological targets uses one of these compounds.The scaffold of the NPs and metabolites are important in designing lead libraries.