Using Semi-supervised Discriminant Analysis to Predict Subcellular Localization of Gram-negative Bacterial Proteins

In this paper, an effective dimension reduction approach called semi-supervised discriminant analysis (SDA) is employed to deal with the protein subcellular localization problem. Firstly, a novel protein sequence encoding method that consists of pseudo amino acid composition (PseAAC) and dipeptide composition (DC) is introduced to represent a protein. Secondly, the SDA algorithm is applied to extract the essential discriminant features from the combined feature data set consisting of PseAAC and DC. Finally, the K-nearest neighbor (K-NN) classiﬁer is used to identify the subcellular localization of Gram-positive bacterial proteins. The proposed method can effective utilize both manifold information and the class information of the protein samples to guide the produce of protein subcellular localization. To evaluate the prediction performance of the proposed algorithm, a jackknife test based on nearest neighbor algorithm is employed on the gram-negative bacterial proteins data set. The results show that we can get a high total accuracy in a low-dimensional feature space, which indicates that the proposed approach is effective and practical.


Introduction
Gram-negative bacteria are a class of bacteria that do not retain crystal violet dye in the Gram staining protocol. Many gram-negative bacteria can cause disease in a host organism. The reliable subcellular localization of a gramnegative bacteria protein based on its sequence information can provide valuable information about its function and is helpful for drug development. Thus it is important to develop methods for accurately predicting protein subcellular localization gram-negative bacteria proteins. Till now, a number of effective computational approaches have been presented for protein subcellular localization prediction [1][2][3]. However, predict protein subcellular localization in an automatic fashion accurately is remain a challenge.
There are different types of information source such as textual descriptions of proteins, sequence-based features and gene ontology annotation can be used to construct the features for protein subcellular localization prediction, where the sequence-based features are used widely in many biological applications. Representatives of sequence-based features include sorting signals, dipeptide composition, amino acid composition and pseudo amino acid composition. Nakashima and Nishikawa proposed to represent a protein sequence with amino acid composition (AAC) [4][5][6]. It has been shown AAC is closely related to protein subcellular localizations. Although AAC is a very simple and effective approach for protein sequence encoding which has achieved very promising performance in many applications, it does not consider the sequence order information. To solve this problem, Shen and Chou developed the pseudo-amino acid compositions(PseAAC) which can represent the sample of a protein in a more effective way [7][8][9]. Since the concept of Chou's PseAAC was introduced, researchers have proposed various PseAAC approaches to enhance the performance of protein classification.
Dipeptide composition(DC) is another sequence-based features that has been successfully used for protein sequence encoding. DC means the occurrence frequencies of two consecutive residues in a protein [10][11][12]. As a result, we can get feature vectors of dimension 400 for DC of every protein. DC has also been applicated to predict the subcellular localization of proteins.
Despite the sequence-based features has been successful applicated to solve many biological sequence problems, they still contain noise and are not suitable for classifica-52 Using Semi-supervised Discriminant Analysis to Predict Subcellular Localization of Gram-negative Bacterial Proteins tion. Recently, some dimension reduction methods have been introduced to extract the essential features of protein samples. For example, Ma used principal component analysis to extract PseAAC features from proteins, and then the elman recurrent neural network (ERNN) is employed to identify the protein sequences [13]. Wang used the linear discriminant analysis (LDA) method to to extract the essential low dimensional features of gram-negative bacterial proteins. [14]. In [15], Wang used the kernel based nonlinear dimensionality reduction method to capture the nonlinear characteristics of gram-negative bacterial proteins, and it is shown that the nonlinear dimensionality reduction method is quite promising in dealing with complicated biological problems.
In fact, PCA can provide an efficient way to compress the biological data without losing much information. Different from PCA, LDA aims to extract the most discriminatory features using data label information so that it is more suitable for solving classification tasks. In this paper, a modified version of LDA, i.e., semi-supervised discriminant analysis (SDA) [16] is used to solve the protein subcellular localization problem. The presented method supplies a novel technique for extracting essential discriminant features from combined vectors consisting of PseAAC and DC. To evaluate the prediction performance of the proposed algorithm, a jackknife test based on nearest neighbor algorithm is employed on the gram-negative bacterial protein data set.
The results indicate that the proposed approach achieves a high prediction performance and is effective and practical.

Pseudo amino acid composition (PseAAC)
In order to make effective use of sequence-order information of proteins, Chou [14] proposed PseAAC approach to represent the protein sequence. According to the concept of PseAAC, the protein P with L amino acid residues S 1 S 2 S 3 ...S L , where S i represents the residue at the sequence position i, can be represented as (1) where the 20 + Λ components are used to reflect the relative frequencies and the sequence order information of the protein which can be expressed as follows: where f k is the occurrence frequencies of 20 native amino acids and τ j is the j-tier sequence correlation factor. The weight factor w is used to control the complexity of the sequence order effect and is set at 0.05 as in Ref. [9]. In this paper, the patermater Λ is set to be 10 so that we can get a 30-dimensional(D) PseAAC feature vector.

Dipeptides composition (DC)
DC is a very effective discrete model which has been successfully applied to predict protein structure information. The main advantage of DC is that it can make full use of the global information about proteins. Suppose the length of the protein P is L, which contains L − 1 dipeptides, i.e. < S 1 , S 2 >, < S 2 , S 3 >, ..., < S L−1 , S L >. Then the feature vector of DC can be calculated as: where i represents the 400 dipeptides and n i denotes the number of each dipeptide. After obtaining the PseAAC and DC features of the protein, we can fuse them to form a combined feature vector. As a result, a protein sequence is encoded by a 30+400=430D feature vector. In this article, we will continue to use SDA to extract the more discriminant features.

Semi-supervised discriminant analysis (SDA)
SDA can be seen as a extension of LDA, which aims to capture the global and local structure of the given data simultaneously. As we all known, LDA is a supervised method which seeks the optimal transformation that maximizing the between-class scatter while at the same time minimizing the within-class scatter. However, it can only find the global Euclidean structure of the data. To solve the problem, the SDA algorithm was developed by Cai [16]. SDA extends LDA to incorporate a locality preserving regularizer illustrated by training samples.
Given a training set X = [x 1 , x 2 , x 2 , . . . , x n ] ∈ R m×n , each column of X is a sample vector. Suppose there are c known pattern classes, and the number of training samples in the i − th class is n i and the total number of training samples is n. The between-class scatter matrix and total scatter matrix can be defined as follows: where u i denotes the mean vector of training samples in i − th class and u denotes the mean vector of all training sample.
The objective function of SDA is as follows: where P is the projection matrix, H(P ) is the regularizer which incorporates intrinsic geometrical structure inferred from unlabeled data points so that the manifold structure can be optimally preserved and α is the regularization parameter that balances the contribution of the model complexity and the empirical loss.
In order to incorporate the manifold structure information of both labeled and unlabeled samples, the regularizer H(P ) can be defined as follows: where L is the Laplacian matrix defined as L = D − S. D is a diagonal matrix with entries D ii = j S ij and S is the adjacency matrix which is defined by where N (x i ) and N (x j ) are the k nearest neighbors of x i and x j , respectively. Then the objective function of SDA can be expressed as: By means of Lagrangian multiplier method, the projection matrix P can be constructed by the eigenvectors of (S t + αXLX T ) −1 S b associated with the first d largest eigenvalues p 1 , p 2 , . . . , p d , i.e. P can be constructed as P = [p 1 , p 2 , . . . , p d ]. Therefore the new data representation of x i can be expressed as:

Evaluation criteria
In order to investigate the performance of the proposed method, the overall prediction accuracy and the subclass accuracy are used to assess the accuracy rate of the prediction system. The overall prediction accuracy is defined as where T is the number of query sequences whose folds have been correctly recognized and n is the total number of sequences in the test dataset. The subclass accuracy can reflect the accuracy rate of each subclass. Suppose there are T i query protein sequences correctly recognized in location i, then where n i is the number of sequences in location i.

Results and discussion
The benchmark dataset described in section 2.1 is used to test the performance of the proposed method, and the MATLAB software is used for data analysis. There are three popular cross-validation method: independent dataset test, subsampling test, and jackknife test. Among the three test methods, the jackknife test is the mostly used due to that it can always yield a unique result for a given benchmark dataset. Therefore, the jackknife test has been widely adopted by researchers to examine the accuracy of various prediction methods. In this paper, we also make use of the jackknife test to test the performance.

Results of the proposed method
The 430D combined feature set is used as the input vectors for SDA. Subsequently, the KNN classifier is applied for classification. It should be pointed out that the choice of the number of neighbors K is a crucial problem for the KNN classifier [17]. In this paper, the value of K was experimentally set to 1 because it can get a comparatively better recognition results.
In general, the recognition rates varies with the dimension of the feature subspace generated by SDA. Tabel 1 shows the recognition rates and the corresponding dimensionality of the proposed methods.
As can be seen, the best result obtained in the optimal subspace is 97.70 % and the corresponding dimensionality is 7, which is very lower than the dimension of original combined feature data set. Moreover, it appears that the performance of the proposed method ascends much quickly when the dimensionality increases form 1 to 7, which indicates that SDA can discover the intrinsic structure of the protein sequences. In addition, table 2

The usefulness of dimension reduction
In order to illustrate that SDA can extract the more effective features and enhance the prediction accuracy, we compare it with the results performed in the original 430D combined featured space. Table 2 shows the subcellular localization type accuracies and the overall accuracies of the two methods, where the dimensionality of SDA is set at 7. It can be seen from table 3 that the overall accuracy obtained by the approach without using dimensionality reduction is 59.11%, which is very lower than the result of the propose method. So we can conclude that the prediction quality can be improved using dimensionality reduction method. Furthermore, the recognition rates of Fimbrium and Flagellum are remarkably enhanced using our method, which indicates that the proposed method is well for recognition the bacterial proteins that belong to the locations that have only a few samples.

Comparison with other dimension reduction methods
In this subsection, the SDA-KNN predictor is compared with other dimension reduction based classification methods such as PCA and LDA.  Table 4 gives the prediction results of three methods. From table 4 we can know that the total accuracies for PCA, LDA and SDA are 66%, 93.57% and 97.70%, respectively. The success rate by the proposed approach is 31.70% and 4.13% higher than the success rates by the PCA and LDA algorithms, respectively. As for dopamine and serotonin receptors, the performance of SDA is also superior to PCA and LDA. The experiments show that SDA outperforms PCA and LDA and is more suitable for solving the classification problems of complex biological patterns.

Conclusions
In this paper, a novel nonlinear dimension reduction method named SDA is utilized for membrane protein type prediction. The proposed method adopts SDA to extract more discriminating features form a combined vector, which consists of pseudo amino acid composition and dipeptide composition. A jackknife test is performed on the protein data set and the experiments illustrate that the proposed method is effective. Also, the SDA method can be combined with other protein sequence encoding and prediction algorithms to become a very useful tool dealing with complicated biological system problems.