Supporting data for "A close look at protein function prediction evaluation protocols".

Dataset type: Software, Proteomic
Data released on August 27, 2015

Kahanda I; Funk C; Ullah F; Verspoor K; Ben-Hur A (2015): Supporting data for "A close look at protein function prediction evaluation protocols". GigaScience Database. https://doi.org/10.5524/100153

DOI10.5524/100153

The recently held Critical Assessment of Functional Annotation challenge (CAFA2) required its participants to submit predictions for a large number of target proteins regardless of whether they have previous annotations or not. This is in contrast to the original CAFA challenge in which participants were asked to submit predictions for proteins with no existing annotations. The CAFA2 task is more realistic, in that it more closely mimics the accumulation of annotations over time. In this study we compare these tasks in terms of their difficulty, and determine if cross-validation provides a good estimate of performance. The CAFA2 task is a combination of two sub-tasks: making predictions on annotated proteins and making predictions on previously unannotated proteins. In this study we analyze the performance of several function prediction methods in these two scenarios. Our results show that several methods (GOstruct, binary SVMs, and guilt by association) find it hard to achieve the same level of accuracy on these two tasks compared to cross-validation, and that predicting novel annotations for previously annotated proteins is a harder problem than predicting annotations for uncharacterized proteins. We also find that different methods have different performance characteristics in these tasks, and that cross-validation is not adequate at estimating performance and ranking methods.

Keywords:

Additional details

Read the peer-reviewed publication(s):

Kahanda, I., Funk, C. S., Ullah, F., Verspoor, K. M., & Ben-Hur, A. (2015). A close look at protein function prediction evaluation protocols. GigaScience, 4(1). https://doi.org/10.1186/s13742-015-0082-5

Additional information:

https://zenodo.org/record/20553

http://sourceforge.net/projects/strut/

http://pyml.sourceforge.net/

http://bionlp.sourceforge.net/nlp-pipelines/

Files
History

Click on a table column to sort the results.

Table Settings

File Name	Description	Sample ID	Data Type	File Format	Size	Release Date	File Attributes	Download
readme.txt			Readme	TEXT	915 B	2015-08-11	MD5 checksum: 4ca4704c6b77b5da37b5877f09ff7fbc
KahandaCAFA2GigaScience2014_data.tar.bz2	contain all the input data (both features and labels) and predictions from the three methods (GOStruct, SVM and GBA).		Other	TAR	1.01 GB	2015-08-11	MD5 checksum: e38ef72e57fa31c22843c2e1b02408f2

Date	Action
August 27, 2015	Dataset publish
September 25, 2015	Manuscript Link added : 10.1186/s13742-015-0082-5