Denson's Bioinformatics Blog: 2014

Length-dependent prediction of protein intrinsic disorder

Due to the functional importance of intrinsically disordered proteins or protein regions, prediction of intrinsic protein disorder from amino acid sequence has become an area of active research as witnessed in the 6th experiment on Critical Assessment of Techniques for Protein Structure Prediction (CASP6). Since the initial work by Romero et al. (Identifying disordered regions in proteins from amino acid sequences, IEEE Int. Conf. Neural Netw., 1997), our group has developed several predictors optimized for long disordered regions (>30 residues) with prediction accuracy exceeding 85%. However, these predictors are less successful on short disordered regions (≤30 residues). A probable cause is a length-dependent amino acid compositions and sequence properties of disordered regions.

BMC Bioinformatics | Full text | Length-dependent prediction of protein intrinsic disorder

Extremely randomized trees Pierre Geurts Damien Ernst Louis Wehenkel

This paper proposes a new tree-based ensemble method for supervised classifica- tion and regression problems. It essentially consists of randomizing strongly both attribute and cut-point choice while splitting a tree node. In the extreme case, it builds totally random- ized trees whose structures are independent of the output values of the learning sample. The strength of the randomization can be tuned to problem specifics by the appropriate choice of a parameter. We evaluate the robustness of the default choice of this parameter, and we also provide insight on how to adjust it in particular situations. Besides accuracy, the main strength of the resulting algorithm is computational efficiency. A bias/variance analysis of the Extra-Trees algorithm is also provided as well as a geometrical and a kernel characterization of the models induced.

orbi.ulg.ac.be/bitstream/2268/9357/1/geurts-mlj-advance.pdf

Large-scale prediction of long disordered regions in proteins using random forests

Many proteins contain disordered regions that lack fixed three-dimensional (3D) structure under physiological conditions but have important biological functions. Prediction of disordered regions in protein sequences is important for understanding protein function and in high-throughput determination of protein structures. Machine learning techniques, including neural networks and support vector machines have been widely used in such predictions. Predictors designed for long disordered regions are usually less successful in predicting short disordered regions. Combining prediction of short and long disordered regions will dramatically increase the complexity of the prediction algorithm and make the predictor unsuitable for large-scale applications. Efficient batch prediction of long disordered regions alone is of greater interest in large-scale proteome studies.

BMC Bioinformatics | Full text | Large-scale prediction of long disordered regions in proteins using random forests

Research Collaboratory for Structural Bioinformatics Protein Databank (RCSB PDB)

The Protein Data Bank (PDB) archive is the single worldwide repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids. These are the molecules of life that are found in all organisms including bacteria, yeast, plants, flies, other animals, and humans. Understanding the shape of a molecule helps to understand how it works. This knowledge can be used to help deduce a structure's role in human health and disease, and in drug development. The structures in the archive range from tiny proteins and bits of DNA to complex molecular machines like the ribosome.

The PDB archive is available at no cost to users. The PDB archive is updated each week at the target time of Wednesday 00:00 UTC (Coordinated Universal Time). The most recent release is timestamped and linked on every page in the top right header.

The PDB was established in 1971 at Brookhaven National Laboratory under the leadership of Walter Hamilton and originally contained 7 structures. After Hamilton's untimely death, Tom Koetzle began to lead the PDB in 1973, and then Joel Sussman in 1994. In 1998, the Research Collaboratory for Structural Bioinformatics (RCSB) became responsible for the management of the PDB. In 2003, the wwPDB was formed to maintain a single PDB archive of macromolecular structural data that is freely and publicly available to the global community. It consists of organizations that act as deposition, data processing and distribution centers for PDB data.

In addition, the RCSB PDB supports a website where visitors can perform simple and complex queries on the data, analyze, and visualize the results. Details about the history, function, progress, and future goals of the RCSB PDB can be found in our Annual Reports and Newsletters.

The PDB Advisory Notice defines the conditions for using data from the PDB archive. Our Policies & References page describes copyright restrictions on RCSB PDB materials, our privacy policy, and citation information. Data deposition and release policies are available from deposit.pdb.org.

RCSB PDB staff are located at Rutgers, The State University of New Jersey and the University of California, San Diego. Job listings for open positions are posted online.

The Rutgers group has been featured on RU-tv and Biophysical Society TV.

http://www.rcsb.org/

SPINE-D: Intrinsic disorder prediction using neural networks

Short and long disordered regions of proteins have different preference for different amino acid residues. Different methods often have to be trained to predict them separately. In this study, we developed a single neural-network-based technique called SPINE-D that makes a three-state prediction first (ordered residues and disordered residues in short and long disordered regions) and reduces it into a two-state prediction afterwards. SPINE-D was tested on various sets composed of different combinations of Disprot annotated proteins and proteins directly from the PDB annotated for disorder by missing coordinates in X-ray determined structures. While disorder annotations are different according to Disprot and X-ray approaches, SPINE-D's prediction accuracy and ability to predict disorder are relatively independent of how the method was trained and what type of annotation was employed but strongly depend on the balance in the relative populations of ordered and disordered residues in short and long disordered regions in the test set. With greater than 85% overall specificity for detecting residues in both short and long disordered regions, the residues in long disordered regions are easier to predict at 81% sensitivity in a balanced test dataset with 56.5% ordered residues but more challenging (at 65% sensitivity) in a test dataset with 90% ordered residues. Compared to eleven other methods, SPINE-D yields the highest area under the curve (AUC), the highest Mathews correlation coefficient for residue-based prediction, and the lowest mean square error in predicting disorder contents of proteins for an independent test set with 329 proteins. In particular, SPINE-D is comparable to a meta predictor in predicting disordered residues in long disordered regions and superior in short disordered regions. SPINE-D participated in CASP 9 blind prediction and is one of the top servers according to the official ranking. In addition, SPINE-D was examined for prediction of functional molecular recognition motifs in several case studies. The server and databases are available at

http://sparks-lab.org/SPINE-D/

Database of Protein Disorder

The Database of Protein Disorder (DisProt) is a curated database that provides information about proteins that lack fixed 3D structure in their putatively native states, either in their entirety or in part. DisProt is a collaborative effort between Center for Computational Biology and Bioinformatics at Indiana University School of Medicine and Center for Information Science and Technology at Temple University.

Disprot - Database of Protein Disorder

Wednesday, September 17, 2014