Motivation of PCTFPeval

Computational identification of cooperative transcription factor (TF) pairs helps understand the combinatorial regulation of gene expression in eukaryotic cells. Many advanced algorithms have been proposed to predict cooperative TF pairs in yeast. However, it is still difficult to conduct a comprehensive and objective performance comparison of different algorithms because of lacking sufficient performance indices and adequate overall performance scores. To solve this problem, in our previous study, we adopted/proposed eight performance indices and designed two overall performance scores to compare the performance of 14 existing algorithms for predicting cooperative TF pairs in yeast. Most importantly, our performance comparison framework can be applied to comprehensively and objectively evaluate the performance of a newly developed algorithm. However, to use our framework, researchers have to put a lot of effort to construct it first. To save researchers time and effort, here we develop a web tool which implements our performance comparison framework, featuring fast data processing, a comprehensive performance comparison and an easy-to-use web interface.

What is PCTFPeval?

PCTFPeval (Predicted Cooperative Transcription Factor Pair evaluator) is a web tool for a comprehensive performance comparison of a newly developed algorithm to various existing algorithms for predicting cooperative transcription factor pairs in yeast. The friendly web interface allows users to input a list of predicted cooperative TF pairs from their algorithm and select (i) the compared algorithms among the 15 existing algorithms, (ii) the performance indices among the eight existing indices, and (iii) the overall performance scores from two possible choices. The comprehensive performance comparison results are then generated in tens of seconds and shown as both bar charts and tables. The original comparison results of each compared algorithm and each selected performance index can be downloaded as text files for further analyses.

Fifteen existing algorithms used for performance comparison

PCTFPeval provides 15 existing algorithms for users to conduct a performance comparison. As far as we know, this is the most comprehensive collection of the existing algorithms whose lists of the predicted cooperative TF pairs in yeast are available. The details of these 15 algorithms can be seen in the following table.

Algorithm Data Sources Integrated Method Description # of PCTFPs
Banerjee and Zhang (NAR 2003) ChIP-chip data and gene expression data They inferred a cooperative TF pair under the assumption that the genes regulated by both TFs are more coexpressed than those genes regulated by either TF alone. 31
Harbison et al. (Nature 2004) ChIP-chip data They inferred a cooperative TF pair under the assumption that their binding sites occur more frequently in the same promoter region than random expectation. 94
Nagamine et al. (NAR 2005) ChIP-chip data and PPI data They inferred a cooperative TF pair under the assumption that the existence of interaction between two TFs suggests that they contribute to the same or similar biological process. 24
Tsai et al. (PNAS 2005) ChIP-chip data and gene expression data They used statistical methods to identify yeast cell cycle TFs and synergistic TF pairs. 18
Chang et al. (Bioinformatics 2006) ChIP-chip data and gene expression data They employed a stochastic system model to assess TF cooperativity. 55
He et al. (IEEE GCCW 2006) ChIP-chip data and gene expression data They adopted the gene expression data to predict the cooperative TF pairs by testing whether the expression of the target genes is significantly influenced by their cooperative effect with the multivariate method, ANOVA. 30
Yu et al. (NAR 2006) ChIP-chip data They proposed a method called Motif-PIE, which predicts interacting TF pairs by using a motif discovery procedure. 300
Wang J (JBI 2007) ChIP-chip data, gene expression data and TFBS data They developed a new framework to infer the combinatorial control of TFs by integrating heterogeneous functional genomic datasets. 14
Elati et al. (Bioinformatics 2007) Gene expression data They adopted a data mining system to learn transcriptional regulation relationship from gene expression data. 20
Datta and Zhao (Bioinformatics 2007) ChIP-chip data They used a log-linear model to study cooperative binding among TFs and developed an Expectation-Maximization algorithm for statistical inferences. 25
Chuang et al. (BMC Bioinformatics 2009) ChIP-chip data, gene expression data and PWM data They developed a fuzzy logic approach called ANFIS to identify potential transcriptional interactions. 13
Wang Y et al.(NAR 2009) ChIP-chip data, TFBS data, PPI data and MIPS complex catalogue data They developed a supervised learning approach to predict TF cooperativity using Bayesian networks. 159
Yang et al. (Cell Research 2010) ChIP-chip data and TF knockout data They predicted cooperativity between TFs by identifying the most statistically significant overlap of the target genes regulated by two TFs in ChIP-chip data and TF knockout data. 186
Chen et al. (Bioinformatics 2012) ChIP-chip data They facilitated identification of interactions between TFs by using the motif discovery method when detecting the overlapping targets of TFs based on ChIP-chip data. 221
Lai et al. (BMC Systems Biology 2014) ChIP-chip data, TF knockout data, nucleosome occu-pancy data and TFBS data They inferred a cooperative TF pair under the assumption that (i) these two TFs have a significantly higher number of common target genes than random expectation and (ii) their binding sites (in the promoters of their common target genes) tend to be co-depleted of nucleosomes in order to make these binding sites simultaneously accessible to TF binding. 27
Eight existing performance indices used for performance evaluation

PCTFPeval implements eight existing performance indices for users to evaluate the performance of an algorithm for predicting cooperative TF pairs in yeast. As far as we know, this is the most comprehensive collection of the existing performance indices. These eight performance indices can be divided into two types: TF-based and target gene based (TG-based).

The TF-based type has four performance indices which are based on the PPI partners overlap of a PCTFP, shortest path length of a PCTFP in the PPI network, the functional similarity of a PCTFP, and the overlap between a set of PCTFPs and a benchmarked set of 27 known cooperative TF pairs.

The TG-based type has four performance indices which are based on the overlap of a PCTFP’s target genes, the expression coherence of a PCTFP’s common target genes, the functional coherence of a PCTFP’s common target genes, and the PPI coherence of a PCTFP’s common target genes.

TF-based performance index 1

Yeast genes are frequently regulated through combinations of TFs. As the existence of protein-protein interaction between two TFs often reflects functional similarity and implies participation in the same regulation, the detection of cooperativity of two TFs can employ PPI data. Therefore, in our previous study, we proposed a performance index based on the PPI partners overlap of a PCTFP using physical PPI data retrieved from BioGRID database. Using the hypergeometric distribution, a score S is assigned to a PCTFP to represent the significance of their PPI partners overlap as follows.

where P is the P-value calculated using the hypergeometric distribution, c is the number of common PPI partners of the two TFs in a PCTFP, N1 is the number of PPI partners of the first TF, N2 is the number of PPI partners of the second TF and N = 6575 is the number of unique genes in Saccharomyces Genome Database (SGD). The greater the S is, the more significant the cooperativity of a PCTFP is. To evaluate the performance of a set of PCTFPs from an algorithm, where each PCTFP has been given a score S , we took the mean of these scores as the final score of this performance index.

TF-based performance index 2

Aguilar and Oliva (2008) observed that a biologically plausible cooperative TF pair may have a shorter path length in the physical PPI network than random expectation. This motivated us to implement a performance index based on the shortest path length of a PCTFP in the physical PPI network. The physical PPI data were retrieved from BioGRID database. A score S is assigned to a PCTFP as the inverse of the shortest path length of this PCTFP in the physical PPI network. The greater the S, the more significant the cooperativity of a PCTFP. To evaluate the performance of a set of PCTFPs from an algorithm, where each PCTFP has been given a score S, we took the mean of these scores as the final score of this performance index.

TF-based performance index 3

Apart from PPI data, GO annotations were often used to computationally measure the semantic similarity of genes. Functionally similar TFs are more likely to cooperate with each other to regulate genes. Therefore, in our previous study, we proposed a performance index based on the functional similarity of a PCTFP. The functional similarity score of a PCTFP is adopted from Yang et al’s study. The greater the functional similarity score, the more significant the cooperativity of a PCTFP. To evaluate the performance of a set of PCTFPs from an algorithm, where each PCTFP has been given a functional similarity score, we took the mean of these scores as the final score of this performance index.

TF-based performance index 4

This index is adopted from Yang et al.’s study. Yang et al. compiled a high-quality benchmarked dataset with 27 pairs of cooperative TFs from MIPS functional complex catalogue. Then they developed a procedure based on Fisher’s exact test to calculate the P-value which represents the significance of the overlap between a set of PCTFPs (from an algorithm) and the benchmarked dataset. Here we define a score S as the negative logarithm of the P-value. The greater the S, the more significant the overlap between a set of PCTFPs (from an algorithm) and the benchmarked dataset.

TG-based performance index 5

This index (adopted from Balaji et al’s study) is based on significance of the overlap of a PCTFP’s target genes, i.e. the significance of the associations of a PCTFP in regulating common target genes. In Balaji et al.’s study, a specific network transformation procedure was used to construct the co-regulatory network called Cnet which described the significant associations among TFs in regulating common target genes. They produced a co-regulatory coefficient dataset with 3459 TF pairs. We employed this dataset to assign a co-regulatory coefficient to each PCTFP. The greater the co-regulatory coefficient, the more significant the cooperativity of a PCTFP. To evaluate the performance of a set of PCTFPs from an algorithm, where each PCTFP has been given a co-regulatory coefficient, we took the mean of these coefficients as the final score of this performance index.

TG-based performance index 6

Various studies suggested that the transcriptional cooperativity of a TF pair can be assessed not only by the significance of the overlap of their target genes but also by the significance of the expression coherence among their common target genes. Therefore, in our previous study, we proposed an index to calculate the significance of the expression coherence among the common target genes of a PCTFP. The common target genes of a PCTFP were retrieved from YEASTRACT database.

This index calculates the expression coherence score (ECS) among the common target genes of a PCTFP. Let A is the set of all possible gene pairs formed by any two common target genes of a PCTFP. For each gene pair in A, its co-expression score is retrieved from the SPELL database. Then the ECS is defined as the fraction of gene pairs in A with co-expression score higher than a threshold T, which was determined to be the 95th percentile co-expression score value of the 39 millions of gene pairs deposited in the SPELL database. Note that 0 <= ECS <= 1. The greater the ECS, the more significant the cooperativity of a PCTFP. To evaluate the performance of a set of PCTFPs from an algorithm, where each PCTFP has been assigned an ECS, we took the mean of these ECSs as the final score of this performance index.

TG-based performance index 7

Various studies suggested that the transcriptional cooperativity of a TF pair can be assessed not only by the significance of the overlap of their target genes but also by the significance of the functional coherence among their common target genes. Therefore, in our previous study, we proposed an index to calculate the significance of the functional coherence among the common target genes of a PCTFP. The common target genes of a PCTFP were retrieved from YEASTRACT database.

This index calculates the functional coherence score (FCS) among the common target genes of a PCTFP. Let A is the set of all possible gene pairs formed by any two common target genes of a PCTFP. For each gene pair in A, its functional similarity score is retrieved from Yang et al.’s study. Then the FCS is defined as the fraction of gene pairs in A with functional similarity score higher than a threshold T, which was determined to be the 95th percentile functional similarity score value of the 13 millions of gene pairs deposited in Yang et al.’s study. Note that 0 <= FCS <= 1. The greater the FCS, the more significant the cooperativity of a PCTFP. To evaluate the performance of a set of PCTFPs from an algorithm, where each PCTFP has been assigned an FCS, we took the mean of these FCSs as the final score of this performance index.

TG-based performance index 8

Various studies suggested that the transcriptional cooperativity of a TF pair can be assessed not only by the significance of the overlap of their target genes but also by the significance of the PPI coherence among their common target genes. Therefore, in our previous study, we proposed an index to calculate the significance of the PPI coherence among the common target genes of a PCTFP. The common target genes of a PCTFP were retrieved from YEASTRACT database.

This index calculates the PPI coherence score (PCS) among the common target genes of a PCTFP. Let A is the set of all possible gene pairs formed by any two common target genes of a PCTFP. For each gene pair in A, its PPI similarity score is defined as the negative decimal logarithm of the P-value using hypergeometric distribution, which represent the significance of the overlap between the PPI partners of this gene pair. Then the PCS is defined as the fraction of gene pairs in A with PPI similarity score higher than a threshold T, which was determined to be the 95th percentile PPI similarity score value of the 14 millions of gene pairs precompiled by us using the physical PPI data retrieved from BioGRID database. Note that 0 <= PCS <= 1. The greater the PCS, the more significant the cooperativity of a PCTFP. To evaluate the performance of a set of PCTFPs from an algorithm, where each PCTFP has been assigned a PCS, we took the mean of these PCSs as the final score of this performance index.

Two existing overall performance scores used for representing the comprehensive performance comparison results

Our tool implements two existing overall performance scores (Lai et al. 2014) to summarize the comparison results of the selected performance indices. The first one is called the comprehensive ranking score defined as the sum of the rankings in the selected performance indices. The ranking of an algorithm in an index is k if its performance ranks #k among all the compared algorithms in that index. For example, the ranking of the best performing algorithm is 1. Therefore, the smaller the comprehensive ranking score, the better the overall performance of an algorithm.

The second overall performance score is called the comprehensive normalized score (CNS) defined as the sum of the normalized scores in the selected performance indices. The CNS of the algorithm i is calculated as follows:

where NSj(i) and OSj(i) is the normalized score and the original score of the algorithm i calculated using the index j, respectively; n is the number of the algorithms being compared; L is the number of the selected indices. Note that 0 <= NSj(i) <= 1 and NSj(i) = 1 if and only if the algorithm i is the best performing algorithm in the index j (i.e. it has the highest original score calculated using the index j). The larger the CNS, the better the performance of an algorithm.

A conceptual flowchart of PCTFPeval

The conceptual flowchart of our tool is shown in Figure 1. The friendly web interface allows users to input a list of the predicted cooperative TF pairs from their algorithm. Then three kinds of settings of our tool have to be specified. First, users have to choose the compared algorithms among the 15 existing algorithms. Second, users have to choose the performance indices among the eight existing indices. Finally, users have to choose the overall performance scores from the comprehensive ranking score and the comprehensive normalized score. After the submission, our tool conducts a comprehensive performance comparison of the user’s algorithm to the compared algorithms using the selected performance indices. The comprehensive performance comparison results are then generated in tens of seconds and shown as both bar charts and tables.

Figure 1: The conceptual flowchart of our tool.

The flowchart shows the procedure of using our tool to conduct a comprehensive performance comparison of the user’s algorithm to many existing algorithms using various performance indices.

A case study

In our tool, a list of 40 TF pairs is provided as a sample data. For demonstration purpose, we regard the sample data as the list of the predicted cooperative TF pairs from a new algorithm and would like to conduct a comprehensive performance comparison of this new algorithm to the various existing algorithms using our tool. As shown in Figure 2, users input the sample data to our tool and select (i) 10 existing algorithms for comparison, (ii) eight performance indices for evaluation, and (iii) the comprehensive ranking score as the overall performance score.

Figure 2: The input and three settings of our tool.

To use our tool, users have to (a) input a list of the predicted cooperative TF pairs (PCTFPs) from their algorithm and select (b) the compared algorithms among the 15 existing algorithms, (c) the performance indices among the eight existing indices, and (d) the overall performance scores from the comprehensive ranking score and the comprehensive normalized score.

After the submission, the comprehensive comparison results are generated and shown as both bar charts and tables (see Figure 3). It can be seen that the new algorithm performs well in the first five performance indices but performs worse in the last three performance indices. The overall performance of the new algorithm ranks three among all the 11 algorithms being compared. Getting the comprehensive comparison results from our tool, researchers immediately know that there is still room to improve the performance of their new algorithm.

Figure 3: The output of our tool.

Here we input the sample data (a list of 40 TF pairs) as a list of the predicted cooperative TF pairs (PCTFPs) from a user’s algorithm and select 10 existing algorithms, eight performance indices, and the comprehensive ranking score as the overall performance score. (a) The comprehensive performance comparison results are shown as a bar chart and a table. It can be seen that the overall performance of the user’s algorithm ranks three among all the 11 algorithms being compared. (b) When clicking the hyperlink of “Index5”, users will get the performance comparison results (shown as both a bar chart and a table) using only the index 5. It can be seen that the user’s algorithm is the best performing algorithm in the index 5. (c) When clicking the hyperlink of “Details of the score of Index5 for each algorithm”, users will get a text file containing the original scores (calculated using the index 5) of all PCTFPs of each algorithm being compared.