Cross-species Comparison of CpG Density in the Promoter Regions of Protein Kinase Oncogenes

In this report, we investigated CpG density occurred in promoter regions (PRs) and downstream flanking regions (DFR) of 61 human protein kinase oncogenes (PKOs), together with other three species: chimpanzee, mouse and horse. The quantified numbers of CpGs in the PRs of human PKOs were much higher than those of chimpanzee, mouse and horse, suggesting that the CpG density changes among the four species are associated with species evolution. Human PKOs with relatively high number of CpGs in the PRs showed stronger gene expression than the mouse PKOs in tumour tissues, but not in normal tissues. Furthermore, human PKOs with extremely high density of CpGs in the PRs exhibited much lower expression in tumour tissues than in normal tissues. Our data initially suggest that the occurrence and density of CpGs in the PRs of PKOs play an important role in regulating gene expression associated with the tumorigenesis. Thus, further improvement of our understanding of the density and spatial arrangement of CpGs in the PRs of PKOs and other oncogenes involved in tumorigenesis is very important for providing preventive and therapeutic strategies for human cancer.


Introduction
Presence of high CpGs density on the gene promoter region (PR) is believed to be associated with high gene expression [1][2][3]. This PR with high CpG density stretches from 0.5 to 2 kilo base pairs (kbp) upstream of gene translation start sites (TSS) [4][5][6]. Within the region when a section longer than 200bp is characterised by presence of 70% G+C dinucleotides, then this section is often referred to as a CpG Island [7]. The CpG Islands (CGIs) are found in two-third of house-keeping genes and a quarter of tissue-specific genes, respectively [8].
What makes the presence of CpG dinucleotides in the CGIs significant is the fact that the cytosines in a CGI are not often methylated. High CpG density make the whole CGI serves as a 'Passover' sign during the de novo methylation of blastocyst genome by DNA cytosine methyltransferases (Dnmt3a and Dnmt3b) [9]. Thus, any mutation, which adds cytosines into the CGI, is more likely to increase the density of CpGs in the gene PRs. While mismatches of dinucleotides in young cells are repaired quickly, the ageing cells do not or never correct their genomic errors. This is because of that majority of genes involved in the DNA repair and methyltransferases (Dmnt1) are rarely transcribed in senescent cells [10][11][12]. Consequently these genomic changes are passed down from generation to generation, which contribute to high CpG density in the gene PRs. The high CpG density causes high gene expressions reminiscent of cancer genes. It has been shown that CpG island methylation in the PRs is likely to be an important mechanism in modulating gene expression in several cancers [13,14]. In addition, an apparent drop of CpGs in the PR causes inactivation of tumour suppressor genes, aberrant cellular pathways such as DNA repair, cell cycle, apoptosis and cell division and imprinted genetic disorder [15][16][17][18].
Oncogenesis is related to various processes including the accumulation of deoxyribonucleic acid (DNA) nucleotides changes on the precancerous cell genome. The mechanisms by which the nucleotides (genetic/epigenetics) changes cause cancer have been proposed in proto-oncogenes, tumour suppressor and stability genes. However, the relationship between unusual variation of CpG dinucleotides in the human protein kinase genes PRs remains to be deeply investigated. The aim of this study is to investigate whether occurrence and density of CpG in the PRs of protein kinase oncogene (PKO) contributes to human oncogenesis. The research quantified the number of CpG dinucleotides in the 1K base pairs upstream of TSS and downstream flanking region of 61 PKOs in four mammalian species. The information acquired from this study has a potential for a 30 Cross-species Comparison of CpG Density in the Promoter Regions of Protein Kinase Oncogenes better understanding of oncogenesis in human.

Screening the Protein Kinase Oncogenes from the Human Kinome
The protein kinase oncogenes (PKOs) used in this study was screened out from the 518 protein kinase (PK) genes of the human kinome, which was published by Manning and colleagues [19]. Each of the 518 PK genes was then searched for in the Online Mendelian Inheritance in Man database (OMIM) at the National Centre for Biotechnology Information (NCBI) website, (http://www.ncbi.nlm.nih.gov/). The information retrieved from the database revealed full description of the gene, its functions (oncogenic or not) and peer reviewed journal articles which confirmed pro-oncogenic potential of the gene. Based on this information, a decision is made to either include or exclude it from the list of possible PKOs that would be investigated. The same process was repeated for the whole of 518 PK genes. This list of thoroughly screened PKOs from human, acted as a template for downloading the DNA sequences of orthologous PKOs from the mouse, horse and chimpanzee.

Obtaining DNA Sequences from the Ensembl Database
The DNA sequences of human PKOs were searched from Ensembl release 54 (GRCh37, http://www.ensembl.org/Homo_sapiens/Info/Index) where 61 human PKOs were selected for intensive analysis and cross-species comparison.
The retrieved data contained gene markers and the real gene links from different assemblies, in order to obtain the accurate gene sequences. To determine the accurate position of the TSS, each downloaded DNA sequence was carefully compared with the alignment of the same gene from Evidence Viewer (EV) and Sequence Viewer (SV) of NCBI website. The down-stream section of the PR (10-25 bases immediately upstream from the TSS) was aligned back with the corresponding section of the same gene illustrated on the EV and SV. Complete alignment confirmed correctness of the downloaded gene-including DNA sequence.
Each gene-including DNA sequence downloaded includes 5K bases (bs) upstream from the transcription start site (TSS), the full length of the gene (extron+intron) and the 1K bs downstream from the 3' end of the gene (stop codon). If the inter-gene distance (the distance between the TSS of the gene being downloaded and the 3' end of the upstream gene) is less than 5K bs, then the downloaded DNA sequence includes the full sequence within the inter-gene distance, the full length of the gene and 1K bs downstream from the 3' end of the gene. In each gene-including DNA sequence, the sequence upstream of the TSS is termed as a putative PR and the sequence downstream of the 3' end of the gene is the downstream flanking region (DFR [20,21]. The sequences of 1K bs at either end of the gene were selected for counting the density of CpGs. The gene sequence key elements such as: (1) 15 bs from the start codon (ATG) and the stop codon; (2) 1K bs from either flanking region of the transcript; (3) gene descriptions; (4) location of the gene either on plus (+1) or minus (-1) strand; (5) the number of exons; (6) location of the gene on the chromosome; (7) the coordinate of each exon and intron, were fed into the Matlab program for analysis. The same process was repeated for the 61 PKOs in the human, mouse, horse and chimpanzee. The analysed results were stored in both Matlab and excel tab-limited formats for convenient access.

Downloading Gene Expression Data
In order to investigate whether the CpG density at the 1k bp upstream of TSS was related to high expression level or not, twenty (20) PKOs with promoter region (PR) CpG densities that range from 1-138 were selected. The expression data of these PK oncogenes were downloaded from the GeneHub-GEPIS website. Gene expression data were retrieved in 34 normal and tumour tissues/organs from human and mouse. The gene expression data of the PKOs from both normal and tumour tissues were derived from the human and mouse (the data from other two species are unavailable currently) expressed sequence tag (EST). These data were used to compare the quantitative relationship between the expression levels of PKOs and the CpG density in their PRs.

How Gene Expression Data Are Determined From GeneHub-GEPIS Database
The algorithms used to calculate gene expression profile were previously described by Zhang and colleagues [22]. These expression data were integrated into human and mouse genes databases by GeneHub, a component of GeneHub-GEPIS database. The merged databases (GEPIS and GeneHub) incorporated gene specificity and biological characteristics relevant to the expression data which has made these data suitable for this research [22].

CpG Density at the Promoter Region (PR) of Protein Kinase Oncogenes
The study screened out 61 protein kinase oncogenes from the human kinome, which were used as template to download the homologous genomic sequences from chimpanzee, mouse and horse. The DNA sequences of these species were fed into self-written program for analysis. The CpG island is a DNA sequence equal or longer than the starting window (SW), in which the observed/expected (O/E) value of CpGs should be equal or higher than 0. 6 and G%+C%>50 [23]. The method used for CpG island identification was according to a previous report [18,20] with some modifications. A SW of 200 bs was applied from the 5' end of the DNA sequence, with the G%+C% and O/E value of the CpGs being calculated within the SW while the SW was moving step-by-step towards the TSS. To avoid the mathematical CpG island, 7 CpGs/200 bs was selected as a reasonable cutoff for a CpG island identified in the SW [23]. Two individual CpG islands were connected if they were separated by less than 50 bs. The CpG island information that included the number of CpGs in each CpG island, the density of the CpG island (number of CpG/100 bs), the position of each CpG within the CpG island, the starting and ending coordinates of the CpG island along the gene sequence were collected for statistical analysis using two-way ANOVA.
3.1.1. Human Figure 1 shows the CpG density and distribution in both PRs and DFRs of 61 human PKOs. On an average, 61 human PKOs had a number of 51.5 CpGs in their PRs and 15.8 in their DFRs, respectively. Forty-six PKOs (75.4%) showed very high CpG density with over 20 CpGs identified in their PRs while fifteen PKOs (24.6%) had a low CpG density with the CpGs less than 20 ( Figure 1). Five PKOs: HIPK2, IGF1R, PKCE, CAMK1B and PIM1 showed the highest density of CpGs in their PRs, with a number of 139, 134, 119, 116 and 103 CpGs quantified while they had a number of 40, 13, 20, 18 and 23 CpGs identified in their DFRs. Three PKOs had a very low density of CpGs in their PRs, with MER only containing 1, FRK and TRKA having 10 each while they also had 10, 5 and 8 CpGs quantified in their DFRs (Fig 1). Figure 1. Human PK oncogene CpG density at the promoter regions (PRs) and downstream flanking regions (DFRs). The CpG density clusters (see closed brace on the right hand side) are labelled as high and low CpG Density (CGD). Low CGD means less than 20 dinucleotides on 1kbp upstream of TSS; High CGD is greater than 20 dinucleotides on 1kbp upstream of TSS. Each of the dark and blue vertical spikes represents a pair of CpG dinucleotides and horizontal line represents the length of 1kb of the PR and DFR respectively. PK genes are plotted in a descending order of the CpG density in their PRs. The number in bracket next to the low or high CGD represents the number of genes in that cluster. The result shows 75% of the human PKOs have high CGD and 25% have low CGD in their PRs respectively.

Horse
The results clearly suggests that human PKOs have the highest density of CpGs occurred in their PRs, then the chimpanzee and mouse, and the horse the lowest. In addition, the CpG density in PRs is not associated with that in DFRs.

Comparison of the CpG Distribution in Selected PKOs
We compared the CpG distribution in the PR and DFR of selected PKOs of four species in order to estimate their evolutionary clue [19,24]. Five PKOs were selected where substantial cross-species changes seemed to have taken place in their PRs and DFRs. For instance, the human AKT1 oncogene showed a higher number of CpGs in the PR than in any of the other three mammalian species (Fig 5, top group). Similarly, the CpG density in the PRs of FAK1 and HIPK2 has remained high in human and mouse but become low in chimpanzee and horse (Fig 5, 2 nd and 3 rd groups from top). However, The TRKA oncogenes appeared to have lost majority of CpGs in the PRs of human and mouse but remained high in the chimpanzee and horse (Fig 5, 2 nd group from bottom). The LKB1 showed an apparent difference to the previous observations. The Chimpanzee LKB1 has a similarly high density of CpGs in both PR and DFR, making an apparent different pattern from those observed in human, mouse and horse PKOs which are comparatively low in both the PRs and DFRs (Fig 5, bottom group) Figure 5. Comparison of CpG density in the PR and DFR of 5 selected PKOs among four species. The labels on right hand side shows clusters of PKOs that have decreased / increased or shifted CpG density at the PRs. The human AKT1 shows a higher number of CpGs in the PR than in any of the other three mammalian species (top group). The CpG density in the PRs of FAK1 and HIPK2 has remained high in human and mouse but become low in chimpanzee and horse (2 nd and 3 rd groups from top). The TRKA appears to have lost majority of CpGs in the PRs of human and mouse but remained high in the chimpanzee and horse (2 nd group from bottom). The LKB1 showed an apparent difference to the previous observations. The Chimpanzee LKB1 has a similarly high density of CpGs in both PR and DFR, making an apparent different pattern from those observed in human, mouse and horse PKOs which are comparatively low in both the PRs and DFRs (bottom group).

The CpG-rich PKO Promoters and Their Expression Level in Tumour and Normal Tissues
We next determined whether and how the CpG density in the PRs of the PKOs is associated with gene expression in tumour and normal tissues in both mouse and human because of no data available for Chimpanzee and horse. In human tumour tissues, the PKOs having more than 57 CpGs in their PRs did not show high levels of gene expression, suggesting that very high density of CpGs in the PR does not favour the PKO expression in tumour tissues (Fig 6a). Four peaks of the gene expression of the PKOs having a number of 12, 25, 36 and 46 CpGs were observed when the CpG density ranged from 10 to 46, (Fig 6a). In human normal tissues, peaked gene expression was observed for the PKOs which have a number of 10,15,25,35,46,57,68 and 94 CpGs in their PRs (Fig 6b). Three PKOs containing higher numbers of 57, 68 and 94 CpGs had much higher gene expression, especially the 57 CpGs-containing PKO showed extremely high gene expression (Fig 6b), in contrast, a very low expression of these PKOs was in tumour tissues (Fig 6a). Generally, the PKOs that are highly expressed in tumour tissues tend to have lower expression in normal tissues, associating with the CpG density in their PRs (Fig 6a, b).
PKOs with the CpG density range of 20-76 in their PRs have much higher expression level in human than in mouse in tumour tissues (Fig 6a). But the PKOs of the two species have similar expression pattern and level in normal tissues, excepted for two genes (Fig 6b). The 57 CpGs-containing human PKO showed extremely high expression in normal tissues (Fig 6b), which was at least five times higher than the 57 CpGs-containing mouse PKO. In contrast, a 25 CpGs-containing human PKO had only one of third expression level, compared with the 25 CpGs-containing mouse PKO (Fig 6b). On an average, the PKO with high CpG density in the PRs have high expression in human than in mouse in both tumour and normal tissues (see table 1), indicating that CpG density in the PRs associated with the PKO gene expression is specie-related.

Discussion
Previous studies have reported that unmethylated CpGs in the PR served as a genomic footprint of replication origin [25,26]. In this report, we used human 61 PKOs as templates to compare the number of CpGs in the PR of the same genes in chimpanzee, horse and mouse genomes. As a result, the human PKOs had significantly higher density of CpGs occurred in their PRs, compared with the chimpanzee, mouse and horse PKOs. For instance, human focal adhesive kinase (FAK2) has more than 50 CpGs in its PR while the horse FAK2 only has 11 (see appendix 1). However, the human genome did not show that all the examined PKOs consistently have higher density of CpGs in their PRs than the other three species. A typical sample is that human TRKA has substantially decreased the CpGs density in the PR with only 11 CpGs detected, compared with the horse TRKA gene that has 41 CpGs in its PR. The significant difference of the CpG density in the PKO PRs among the four compared species is probably due to their evolutionary impacts and changes. It is well documented that evolutionary tree of human genome is distinctly from that of the other mammalian species [27]. It has been reported that the evolutionary impacts and changes often target gene promoter region leading to a diverse arrangement of the nucleotides in this region [5,28]. Molecularly, genetic transposition and recombination during the gene replication may play an important role in generating the high density of CpGs in both PRs and DFRs of human PKOs [29,30]. The genetic transposition also contributed to the chimpanzee LKB1 gene to have very high density of CpGs in its DFR [31,32]. Thus, it is likely that genetic transposition plays a key role in determining that chimpanzee LKB1 gene has 63 CpGs in PR and 133 CpGs in DFR while human LKB1 gene only 47 and 11 in these two regions in the present study.
In the present study, we observed first that human PKOs with a relatively high CpG density in the PRs are highly expressed in tumour tissues, consistent with previous studies that high CpGs density in the gene PR is associated with high gene expression [1][2][3]. This is probably due to the accumulations of methylated cytosines in the CpG-rich promoter regions that increase potentially the turnover rates of proto-oncogenes [33]. However, we observed further that human PKOs with a very high CpG density in the PRs exhibit very high expression in normal tissues. Thus, it appears that CpG occurrence and density in the PKO PRs may play a differential role in regulating their gene expression in human tumour and normal tissues. It has been reported that the occurrence and density of CpGs in the PRs of estrogen receptor genes and insulin-like growth factor 2 gene are age-dependently varied in human tumours [34][35][36][37]. The variability of CpG density in the PR contributed in part to the high cancer risk associated with age over 70 in human to [38]. Thus, the occurrence and density of CpGs in the PR of PKOs and other oncogenes may be potentially served as an indicator of oncogene expression associated with the development of sporadic cancers in human.

Conclusions
The occurrence and density of CpGs in the PRs of the 61 PKOs differ from one gene to another within individual species. On an average, human PKOs have much higher density of CpGs in the PRs than the other three species (chimpanzee, horse and mouse). The human PKO with relatively high density of CpGs (between 12-46 ) in the PRs are highly expressed in tumour tissues than in normal tissues. Differently, the human PKOs with extremely high density of CpGs (>57) in the PRs are highly expressed in normal tissues than in tumour tissues. The mechanism that CpGs density is linked to gene expression of the PKOs in human remains to be investigated.