CREDO: A Structural Interactomics Database For Drug Discovery
Monday, 20 February 2012
Blog has moved
The CREDO blog has moved to http://aschreyer.github.com/credodb .
Tuesday, 1 November 2011
Speeding up fingerprint similarity searches in the OpenEye PostgreSQL extension with GIST indexes
PostgreSQL has a unique feature, namely support for GIST (or Generalized Search Tree) indexes, which unlike others, are completely agnostic about the data type or the queries being used. The GIST API requires only seven functions to define a new index for a data type and implementing those for binary fingerprints was pretty straightforward; at least after spending a whole weekend on it. Prior to this however, I created a new data type for OpenEye fingerprints in PostgreSQL and added all the other functions (and operators) necessary for screening, i.e. input/output methods, fingerprint-generating functions (Path, Circular, Tree, MACCS166) and all GraphSim TK similarity metrics (Cosine, Dice, Euclidean, Manhattan, Tanimoto, Tversky).
The GIST API expects the condition used in the WHERE clause to return a Boolean value - the similarity metrics however return float values obviously, therefore variables for similarity limits had to be added internally. The oefp GIST implementation will check for consistency with the query by comparing the calculated similarity value (key, query) against these variables. They are set in the PostgreSQL OpenEye extension with:
As you can see, Tversky similarity is supported by the GIST index as well.
All the following queries were executed on a virtual machine with 2 GB of memory and a default PostgreSQL configuration. The table that was screened contained the circular fingerprints of 985,961 molecules from ChEMBL. Screening was started by restarting the server and using a randomly selected molecule to load the index into memory. The time to build a GIST index is an import benchmark as well; for the mentioned number of fingerprints it took exactly 82203.132 ms (or 82.2 seconds), which is quite fast considering that a virtual machine is used. The following function was used to execute the similarity queries against ChEMBL:
PostgreSQL will cache queries hence it is important to not re-use the same SMILES string over and over again, otherwise the results will not accurately reflect the performance of the GIST index. The following function returns a random SMILES string from ChEMBL:
Below are the results for five randomly selected molecules with varying size and complexity. I am very happy with the average performance of the GIST index, given the fact that this is the first iteration. The result of the fragment with a very common structure indicates that in this particular case the index had to search considerably more trees in the index. 2.7 seconds is still a very good result for a search against almost 1M fingerprints.
Below is an example of the GIST query support for the Tversky similarity. The Tversky similarity is very useful because it can be used to force a sub- or superstructure behaviour depending on the parameters alpha and beta. In addition, it is the same as the Tanimoto similarity with alpha=beta=1 and the same as the Dice similarity with alpha=beta=0.5. This feature is shown in the statements below:
This was just the first version of the GIST implementation for OpenEye fingerprints and there are certainly more aspects of it that can be optimised, the picksplit and penalty functions in particular. I am quite pleased with the results nevertheless!
The GIST API expects the condition used in the WHERE clause to return a Boolean value - the similarity metrics however return float values obviously, therefore variables for similarity limits had to be added internally. The oefp GIST implementation will check for consistency with the query by comparing the calculated similarity value (key, query) against these variables. They are set in the PostgreSQL OpenEye extension with:
As you can see, Tversky similarity is supported by the GIST index as well.
Benchmarking the new index with similarity queries
All the following queries were executed on a virtual machine with 2 GB of memory and a default PostgreSQL configuration. The table that was screened contained the circular fingerprints of 985,961 molecules from ChEMBL. Screening was started by restarting the server and using a randomly selected molecule to load the index into memory. The time to build a GIST index is an import benchmark as well; for the mentioned number of fingerprints it took exactly 82203.132 ms (or 82.2 seconds), which is quite fast considering that a virtual machine is used. The following function was used to execute the similarity queries against ChEMBL:
PostgreSQL will cache queries hence it is important to not re-use the same SMILES string over and over again, otherwise the results will not accurately reflect the performance of the GIST index. The following function returns a random SMILES string from ChEMBL:
Below are the results for five randomly selected molecules with varying size and complexity. I am very happy with the average performance of the GIST index, given the fact that this is the first iteration. The result of the fragment with a very common structure indicates that in this particular case the index had to search considerably more trees in the index. 2.7 seconds is still a very good result for a search against almost 1M fingerprints.
Tversky similarity
Below is an example of the GIST query support for the Tversky similarity. The Tversky similarity is very useful because it can be used to force a sub- or superstructure behaviour depending on the parameters alpha and beta. In addition, it is the same as the Tanimoto similarity with alpha=beta=1 and the same as the Dice similarity with alpha=beta=0.5. This feature is shown in the statements below:
This was just the first version of the GIST implementation for OpenEye fingerprints and there are certainly more aspects of it that can be optimised, the picksplit and penalty functions in particular. I am quite pleased with the results nevertheless!
Labels:
chembl,
cheminformatics,
openeye,
postgresql extension
Monday, 24 October 2011
KNN-GIST operators for the RDKit PostgreSQL cartridge
The latest version of PostgreSQL (9.1) introduced an extension to GIST called KNN-GIST. The great advantage of KNN-GIST is that the rows to be returned are already sorted in the required order and the N-nearest neighbours of a query are already known (used with
LIMIT). I have now written the necessary code to add KNN-GIST ORDER BY operators to the PostgreSQL cartridge (currently only bfp data type). The two queries below together with their plans show the difference. In the first the query is executed without KNN-GIST and the query plan shows that the rows returned by the index have to be sorted first in order to get the 10 most similar.With KNN-GIST however, rows are already sorted in the index (as shown in the query plan), leading to a 30-40% performance increase with my initial version.
The changed will be pushed to my RDKit repository later and they should soon appear in the official SVN repository (or future release).
Labels:
postgresql,
postgresql extension,
rdkit
Wednesday, 14 September 2011
ChEMBLdb integration
One of my goals for CREDO is to implement an (almost) seamless transition between the structural interactions in CREDO and the activity data from ChEMBLdb. Today I finished an extension for credoscript that extends the ORM to include the ChEMBL database schema and a lot of other goodies. The ChEMBL extension is not stand-alone and will neither work on MySQL nor Oracle. There is however pychembl, which works with the former. Here are a couple of code examples:
Cheminformatics routines
The cheminformatics routines from the
ChemCompAdaptor class are also implemented in the ChEMBL equivalent MoleculeAdaptor, which allows querying ChEMBL from within credoscript, based on sub-, superstructure or similarity with the help of fingerprints. All methods are implemented with the RDKit PostgreSQL cartridge, which means that all molecules and fingerprints are indexed and clustered.Activity cliffs
An interesting application of the CREDO/ChEMBL integration is the analysis of so-called activity cliffs. These cliffs can easily be identified (and quantified) in ChEMBL with the server-side cheminformatics routines and the assays, in which the cliffs appear, linked to protein-ligand complexes in CREDO. One could then overlay the chemical structures (shape or MCS) to investigate which contacts would be gained or lost. Activity cliffs are calculated on the fly, i.e. the
ActivityCliff class is mapped against a query and not a table.Monday, 22 August 2011
Protein-ligand complexes and buried surface areas
The buried accessible surface area of a protein-ligand complex is known to be linked to thermodynamic parameters. Olsson et al. for example investigated the correlation between the change in solvation and thermodynamic properties of protein-ligand complexes. One observation was that synthetic ligands, compared to biological ligands, gain affinity mostly through more favourable entropy changes upon binding, i.e. burial of apolar surface area (hydrophobic interactions). Endogenous ligands on the other hand normally gain more affinity (which can vary a lot) through polar interactions, whereas synthetic, exogenous ligands are much more limited in this aspect because of severe ADME restraints. the relationship between binding and buried surface area is also very relevant in the context of the disruption of protein-protein interactions, where synthetic, comparatively small molecules have to compete with large polypeptide chains that form many polar interactions.
To cut a long story short, having the buried surface area of all protein-ligand complexes in the PDB calculated would be very useful to for analysis and comparison. Therefore I decided to add this data as a new feature to CREDO, calculated as surface contributions for all atoms in a protein-ligand complex that are interacting. Hence, it is not only possible to calculate the buried surface area of the whole complex but also for specific atoms, e.g. only atoms that belong to the ligand or the binding site or only (a)polar atoms. A credoscript example is shown below for PDB entry 3CS9. The code example also contains a new feature in CREDO that I would call meta atom types (the
To cut a long story short, having the buried surface area of all protein-ligand complexes in the PDB calculated would be very useful to for analysis and comparison. Therefore I decided to add this data as a new feature to CREDO, calculated as surface contributions for all atoms in a protein-ligand complex that are interacting. Hence, it is not only possible to calculate the buried surface area of the whole complex but also for specific atoms, e.g. only atoms that belong to the ligand or the binding site or only (a)polar atoms. A credoscript example is shown below for PDB entry 3CS9. The code example also contains a new feature in CREDO that I would call meta atom types (the
Atom.is_polar attribute) - more about this in a future blog post.
Labels:
buried surface area,
credo,
credoscript
Monday, 15 August 2011
Source code for PostgreSQL Eigen extension now publicly available
The source code for the PostgreSQL Eigen extension is now publicly available on bitbucket. The extension requires Eigen 3.x and PostgreSQL 9.x. It is at a fairly early stage and does not come with documentation but if you are familiar with extending PostgreSQL and the Eigen template library then the code should be pretty self-explanatory and easy to extend. The source code of the pgeigen extension is released under the MIT license.
Labels:
eigen,
postgresql extension
Friday, 12 August 2011
Tuesday, 9 August 2011
Tuesday, 14 June 2011
Poor man's LINGO with the PostgreSQL pg_trgm extension
LINGO is a method used in cheminformatics to compare molecules with the help of their canonical SMILES strings. The algorithm works by fragmenting SMILES strings into overlapping substrings of a defined size. The resulting LINGO profile can then be compared with others to determine chemical similarity or even physicochemical properties.
Interestingly, there is an extension for PostgreSQL called pg_trgm that provides functions and operators for determining the similarity of text based on trigram matching. More importantly, it also comes with different index operator classes including the latest KNNGIST to speed up similarity searches.
This approach is not as sophisticated as other methods that set all ring numbers in a SMILES string to 1 for example, but the results are nevertheless very promising. The clear advantage is that cheminformatics routines are not required and that it is extremely fast - the query shown above takes less than 200ms to return the top 20 hits out of more than 650.000 SMILES strings on a slow database server.
Interestingly, there is an extension for PostgreSQL called pg_trgm that provides functions and operators for determining the similarity of text based on trigram matching. More importantly, it also comes with different index operator classes including the latest KNNGIST to speed up similarity searches.
This approach is not as sophisticated as other methods that set all ring numbers in a SMILES string to 1 for example, but the results are nevertheless very promising. The clear advantage is that cheminformatics routines are not required and that it is extremely fast - the query shown above takes less than 200ms to return the top 20 hits out of more than 650.000 SMILES strings on a slow database server.
The hits from the query using the ChEMBL database are shown below using the new SVG functionality in Open Babel.
Labels:
chembl,
cheminformatics,
lingo,
open babel,
postgresql extension
Wednesday, 1 June 2011
Binding site similarity searching in CREDO
A fairly common task in structural biology is to identify putative ligands of a binding site - either because the function of the protein is unknown or because the protein might be promiscuous and binding other ligands as well, for example. There are more or less two ways of achieving this: by comparing the structure of the ligands (shape or topology) or by comparing the features of the binding pocket. Many algorithms have been developed in the past for both strategies. The Similarity ensemble approach (SEA) from the Shoichet lab is a more prominent example of the ligand-based approach. These methods have the advantage of not requiring any structural information but are crucially dependent on already existing ligands.
More interesting from the CREDO point of view are the algorithms that utilise structural information to compare and cluster protein ligand-binding sites. In addition, I was only interested in those methods that are alignment-free and encode the binding site features into an object that can easily be serialised and stored in a database. The FuzCav method, which describes a binding site as a fingerprint of pharmacophore triplet counts, fulfilled this criterion and was subsequently implemented (using the new Eigen PostgreSQL extension).
After benchmarking my implementation using old CREDO, some modifications and enhancements where necessary. More (dis)similarity metrics were added as well as a completely new fingerprint type based on the position of side chain reference atoms. This type of fingerprint relies much less on the conservation of C-alpha atoms and is capable of picking up more distant proteins. Data & group fusion methods to combine several metrics or queries were also implemented.
Two examples are shown below: The first one shows the best hits (first hit is query) for the C-alpha, the second the hits for the representatives version. The underlying database contains only a small subset of the PDB so the queries are only shown to demonstrate the technical implementation. The SQL-based version is fairly quick (thanks to Eigen) and currently requires 0.5s on my virtual machine for the queries shown here and roughly 2000 fingerprints (with 4833 positions).
The FuzCav method works with secondary structure elements/fragments as well, which allows the comparison between protein-protein interfaces and ligand binding sites. Since secondary structure fragments are already stored in CREDO, the technical implementation is trivial but more on this in a future blog post.
More interesting from the CREDO point of view are the algorithms that utilise structural information to compare and cluster protein ligand-binding sites. In addition, I was only interested in those methods that are alignment-free and encode the binding site features into an object that can easily be serialised and stored in a database. The FuzCav method, which describes a binding site as a fingerprint of pharmacophore triplet counts, fulfilled this criterion and was subsequently implemented (using the new Eigen PostgreSQL extension).
After benchmarking my implementation using old CREDO, some modifications and enhancements where necessary. More (dis)similarity metrics were added as well as a completely new fingerprint type based on the position of side chain reference atoms. This type of fingerprint relies much less on the conservation of C-alpha atoms and is capable of picking up more distant proteins. Data & group fusion methods to combine several metrics or queries were also implemented.
Two examples are shown below: The first one shows the best hits (first hit is query) for the C-alpha, the second the hits for the representatives version. The underlying database contains only a small subset of the PDB so the queries are only shown to demonstrate the technical implementation. The SQL-based version is fairly quick (thanks to Eigen) and currently requires 0.5s on my virtual machine for the queries shown here and roughly 2000 fingerprints (with 4833 positions).
The FuzCav method works with secondary structure elements/fragments as well, which allows the comparison between protein-protein interfaces and ligand binding sites. Since secondary structure fragments are already stored in CREDO, the technical implementation is trivial but more on this in a future blog post.
Binding site similarity searching using C-alpha atoms
Binding site similarity searching using side chain representative atoms
Subscribe to:
Posts (Atom)