Discovery of Gene-disease Associations from Biomedical Texts

Due to the ever-expanding growth of biomedical publications, biologists have to retrieve up-to-date information from vast literatures to ensure they do not neglect certain significant publications. It becomes more and more important to deal with the extraction problem from the biomedical texts in an automatic way. The paper focuses on automatically identifying the relationships between human genetic diseases and genes from the biomedical literatures. The experimental data is retrieved from Mendelian Inheritance in Man (MIM) literatures of morbid in Online Mendelian Inheritance in Man (OMIM) database. We propose a hybrid method combining the rule learning and the statistical techniques. To collect the corpus used in the research, the first step is to find the sentences that include both the related human genetic diseases and genes mentioned from the morbid file, and they are regarded as the correct sentences. In the second step, the sentences that neither have the related human genetic diseases nor the genes mentioned from the morbid file are randomly selected, and they are regarded as the incorrect sentences. Next, the Memory-Based Shallow Parser is utilized to analyze these sentences to get some information in order to find rules in the following step. Then, some learning rules are obtained with a rule learner, ALEPH system. These generated rules are applied to catch the pairs of human genetic diseases and genes within one sentence. In the following, the study proposes a statistical approach, called Z-score method, to determine whether the pairs are valid or not. Finally, the experiments are made with considering some constraints and different numbers of rules. Furthermore, the evaluation metrics in the experiments are precision, recall rates, and F-scores.


Introduction
At present, there has been a fast-increasing amount of biomedical publications spreading in the World Wide Web.
For example, up to now, more than 21 million biomedical articles are available in MEDLINE. 1 For such an expanding-growing rate of the literatures, if biologists have to manually read through all the retrieved texts to find the information they actually need, it will become very difficult to keep up with the new findings. As a consequence, a substantial attention of research interest focuses on developing methods for automatically processing the biological and medical scientific literature -a process often referred to as biomedical text mining. Currently predicting new gene-disease associations has long been a principal goal in computational biology [1]. The goal of this paper is to solve the above issue, i.e., discovering the candidate genes likely to be associated with the specific diseases.
Recent studies have proposed several approaches to investigating the relationships between genes and diseases. Some previous studies use protein-protein interactions to predict gene-disease relationships [2,3,4]. Some researchers compute the similarity values between genes and diseases based on Gene Ontology (GO) [5] or Disease Ontology (DO) [6] terms [7,8,9]. Other controlled vocabularies such as MeSH [10] have already been utilized for linking proteins to disease terminologies [11,12,13]. Some other information like gene expressions [14,15,16], protein/genome sequences [17, 18. 19] and positional information [20] are also served as the important evidences to relating genes and diseases. Besides, there are some researchers who utilize the text mining techniques to extract gene-disease associations from the biomedical literatures [13,21,22]. Machine learning methods are applied by some researchers [1,23,24]. The network-based approaches to analyzing relationships between genes and diseases are proposed in [25,26,27]. These works demonstrate that associating genes with diseases is an active area of researches as it can lead to better understanding of diseases and it can reduce both time and expenditure in developing effective drugs and treatment.
The experimental data in this paper is obtained from Online Mendelian Inheritance in Man (OMIM) [28] and MEDLINE. OMIM is one of the most well-known databases 2 Discovery of Gene-disease Associations from Biomedical Texts that contain gene-disease annotations. The full-text overviews in OMIM contain information on all known mendelian disorders and over 12,000 genes. It is a comprehensive knowledge base of human genes and genetic diseases. Nowadays, one of the most comprehensive textual sources of biomedical information is MEDLINE. It includes more than 19 million citations from more than 7,300 different publications dating from 1966, and it continues updating weekly by U.S. National Library of Medicine (NLM). It is valuable for the biologists to mine information from the MEDLINE documents.
We observed that some researches also investigate extracting biological relations of interest rather than focusing on gene-disease relationships. Most of them apply the following four different kinds of approaches. First, one method utilizes the manually generated template-based/rule-based approaches which apply patterns generated by domain experts to extract concepts connected by a specific relation from text [29,30,31]. Secondly, other studies employ an automatic template/rule-learning method which creates similar templates/rules automatically by generalizing patterns/rules from text surrounding concept pairs known to have the relationship of interest [32,33,34]. Thirdly, the statistical method is proposed that identifies relationships by looking for concepts that are found with each other more often than would be predicted by chance [35]. Finally, Natural Language Processing-based (NLP-based) methods try to parse the text in order to generate the structure from which relationships can be extracted [36,37,38]. Among these approaches, we are interested in combining rule learning and statistical methods to help with relationship extraction to genes and diseases. We know that rules have several advantages, including that they are easy for humans to interpret, represent knowledge modularly and can be applied using tractable inference procedures. From the rule learning technique, we can obtain the rules relevant to gene-disease associations automatically. Furthermore, this study introduces the statistical method to improve the results returned by the rule learning technique. Figure 1 illustrates the overall architecture of our approach to identifying gene-disease relationships. First, we preprocess each document in the corpus. In this phase, we make use of the OMIM database which is a popular resource in the biomedical domain. Next, we select a set of positive sentences and a set of negative sentences for the rule learning purpose. We refer to the morbid data from the OMIM which records some correct relationships between genes and diseases. Therefore, the morbid data can be served as a gold standard for evaluation. In the following, a memory-based shallow parser (MBSP) is adopted to produce the tagged sentences with the parsing information. 2 Then a rule learner, the ALEPH system, is used to learn the relationships 2 http://www.clips.ua.ac.be/pages/MBSP between genes and diseases so that a set of rules is obtained. 3 For the test documents, we first preprocess the full texts to identify the genes and diseases appearing in the documents. Then the rules learned from the previous steps are used to generate the candidate associative gene-disease pairs. After that, we apply the statistical approach, Z-score method, to filter out some candidate gene-disease pairs. Finally, gene-disease associations in the corpus are extracted.

Training Data
As mentioned before, the morbid text records some correct relationships between genes and diseases. From the OMIM website, we refer to the phenotype and gene MIM (Mendelian Inheritance in Man) number in the morbid text, and then download 302 MEDLINE documents. With this method, there are 2,532 sentences that contain both disease and gene names. For avoiding the bias in the learning phase, we have to select the positive and negative sentences with the equal number. Hence 1,000 sentences with correct gene-disease associations are selected to be the positive examples for learning, and 1,000 sentences with incorrect gene-disease pairs are chosen to be the negative examples.

Testing Data
To do outside test, the testing data must be different with the training data in the experiment. From the morbid text, we refer to 108 phenotype and gene MIM numbers, and then retrieve 108 MEDLINE full texts because there are 919 sentences with correct gene-disease associations and this number is similar to the sentence number 1,000 for learning.

Methods
The proposed methods are divided into six steps: (1) preprocessing the corpus, (2) selecting correct and incorrect sentences for learning, (3) parsing the sentences, (4) rule learning, (5) making candidate gene-disease associations using the learned rules, and (6) using the statistical method to filter out the candidate pairs.

Corpus Preprocessing
Our preprocessing procedure for each document consists of gene name tagging and disease name tagging. Since the OMIM database contains lots of gene-disease annotations, we extract the gene name list and disease name list from the OMIM. Finally, there are 13,943 genes and 5,592 disease names in the lists, respectively.

Correct and Incorrect Sentence Selection for Learning
With the morbid data, the relevant diseases and genes can be annotated in the corpus. As mentioned in Section 3.1, two parts for rule learning are found. The first part contains 1,000 positive sentences where the associated genes and diseases appear in the same sentence. The second part includes 1,000 negative sentences where the gene-disease pairs of appearing in the same sentence do not match the gene-disease pairs annotated by the morbid data.

Parsing the Sentences
Before parsing the sentences, we clear the gene and disease tags of the sentences made in the previous stage and get 2,000 sentences with no tags. Then we make use of the memory-based shallow parser (MBSP) to do the parsing. Taking the sentence "HR54 and RAD54L may cause breast cancer." as an example, the parsing result is shown in Figure  2. The explanation for MBSP tags is shown in Table 1.    In Table 1, MBSP determines that HR54 is a plural noun (NNS). Its chunk belongs to a noun phrase (NP). It is not a prepositional noun phrase (PNP). Its relation tag is NP-SUBJ-1, which means that it is the subject of the sentence. It is not an anchor. Finally, the word lemma hr54 is a canonical lexical form of HR54. In this study, we adopt the parsing information that includes the part-of-speech, chunk and SVO relation (Subject-Verb-Object relation).

Rule Learning
The inductive logic programming (ILP) [39] is the machine learning framework for our rule learning algorithm. In the ILP framework, a hypothesis H (i.e. a set of rules) is induced from examples E with the background knowledge B such that H implies examples E with the background knowledge B. Given B and E, we use an ILP system, ALEPH system to learn H, so that some candidate rules are generated. After the learning step, ALEPH will output the information about learning, i.e., the numbers of positive and negative sentences covered by the rules and the rule forms.
To calculate the score of each rule, we propose a formula as follows. where S is the score, Pos and Neg represent the numbers of positive and negative sentences covered by the rule, respectively. The denominator is used for normalization. Figure 3 shows an example of the learning result. The first line identifies that the rule is for the verb "cause." There are 51 positive sentences and 31 negative sentences in the training data. The second line computes the value of denominator in Equation (1). The third line shows the score S. In the following lines show the 51 positive sentence IDs: s1, s6, …, and s892.

Rule-based Approach
In this phase, the testing data are first preprocessed as described in Section 4.1. For finding the sentences with correct gene-disease pairs, we filter out the sentences with negative meanings. A sentence including some negative words (e.g., not, neither, never) will be considered as negative. Furthermore, anaphora resolution is done by a simple heuristic: the preceding gene or disease name is extracted. Then we apply the rules learned previously so that some gene-disease pairs are retrieved. Figure 4 shows a sentence where a gene-disease pair can be retrieved by the rule "cause" in Figure 3.

Statistical Approach
Referring to the statistical method proposed by Al-Mubaid and Singh [40], we try to find the relatedness between genes and diseases. First, we extract the disease names appearing in the gene-disease pairs by the above rule-based approach. Then we retrieve 20 MEDLINE abstracts that mention about the extracted diseases in the OMIM and call it as the interest set S 1 . Next, we randomly select 182 MEDLINE abstracts Computer Science and Information Technology 4(1): 1-8, 2016 5 that do not contain the extracted diseases, and call it as the control set S 2 . Furthermore, a gene list (G 1 , G 2 , …, G 7939 ) extracted from the morbid file is used to count the numbers of genes mentioned the interest set and control set, respectively. With the above data and values, the expectation (ex) and evidence (ev) values are calculated for each gene. The equations are stated as follows.
where G i is the ith gene in the gene list, tf means term frequency and tf t (G i ) is the sum of the numbers of G i mentions in the interest set (S 1 ) and the control set (S 2 ). tf 1 (G i ) is the number of occurrences (mentions) of G i in S 1 . In the above equations, the expectation measures how many G i is normally expected to appear in S 1 ; while the evidence determines how many G i has actually appeared in S 1 . It is obvious that the larger of the difference between ex and ev is, the more of the likelihood that G i and the disease have a significant association is. Thus, we compute the normalized difference as shown in Equation (4).
for each gene, we sort the genes according to their f values and use the Z-score metric to determine the significant f values.4 where ) ( f mean is the average of all f values of all genes and ) ( f SD is the standard deviation of f values. Here the Z-score measures how many standard deviations of each f value are greater than the mean f value, for all genes, to indicate the statistical significance. In our experiments, we consider a gene as having the significant association with the disease if it gets Z-score of 1.0 or more.

Evaluation Metrics
We use the standard precision, recall and F-score evaluation measures. Precision and recall are defined as follows.  (8) where TP is the number of true positives, FP is the number of false positives, and FN is the number of false negatives. In this experiment, true positives are the sentences with correct gene-disease pairs proposed by our system. False positives are the sentences with incorrect gene-disease pairs proposed by our system. False negatives are the sentences with gene-disease pairs in the answer key, but our system does not propose. F-score measures the performance that includes precision and recall rates.

Results and Discussion
The experimental results are listed in Table 2. Column "ID" represents the experiment number. Column "Experiment" is the experiments we made. Column "Positive" is the number of correct gene-disease pairs the system proposes. Column "Negative" is the number of incorrect gene-disease pairs the system proposes. Columns "Precision," "Recall" and "F-score" stand for the values computed with Equations (6), (7) and (8), respectively.
In Table 2, experiments 1-4 use the rule-based approach and numbers 6, 9, 18 and all identify the numbers of applied rules. Figure 5 shows nine rules used in experiments 2 and 6. Experiments 5-8 employ both rule-based and statistical methods (i.e. Z-score method). Experiment 9 only uses the Z-score method (with the threshold > 1) to filter out the candidate gene-disease pairs.
Comparing experiment 1 to 5, 2 to 6, 3 to 7, and 4 to 8, it shows that combining the Z-score method can enhance the precision rates and F-scores while the recall rates are decreased a little bit. It demonstrates that Z-score can filter out some incorrect gene-disease pairs which may be mentioned in the same sentence. Comparing experiments 5-8 to 9, the precisions, recalls and F-scores are higher when rules are involved. It shows that the learned rules are helpful for the gene-disease association. In short, both the rule-based and statistical approaches are useful because experiments 7 and 8 get the best performance. 6 Discovery of Gene-disease Associations from Biomedical Texts

Conclusions
The paper aims to automatically identify the associations of the human genetic diseases and genes from the biomedical literatures. We first use the MBSP system to get the parsing information. Then we apply a rule learning algorithm to get the learned rules. In the following, the Z-score method is proposed to filter out the incorrect gene-disease pairs. The best results are 73.45% precision, 65.37% recall and 69.17% F-score, which are achieved by combining the learned rules and the Z-score method. It shows the proposed methods in this paper are effective in identifying the gene-disease associations. In future, we want to try to do more elaborate anaphora resolution to enhance the recall rate. Furthermore, enlarging the amount of the training sentences is the other feasible direction.