NewsSE: An Ontology-based Search Engine for News

.


Introduction
The creation of the World Wide Web (WWW) was a major step toward information sharing. It resulted in the development of techniques which make information sharing easier and more powerful. Information Retrieval is one of the main tools to reach this goal. In the recent years, as volume of information on the Web increase rapidly, use of search engines for exploring the Internet becomes more urgent. This information explosion causes finding correct and relevant contents in the Web become more complex. These shortcomings clarify the need to use new approaches and technologies for improving the performance of searching information on the Web. One of these technologies is Semantic Web. New generation of search engines use some Semantic Web tools such as RDF, Ontology, Annotation and Inference Engine in order to present the most relevant results to the users. In fact the goal of using these techniques is to approach generated results to user's queries.
We can divide existing data on the Web to different categories. One of these categories is news. Nowadays, The Internet is the biggest resource for keeping news. This huge amount of news in the Internet indicates the necessity of using powerful Information Retrieval mechanisms in order to search among them precisely.
Today there are many search engines which still use traditional methods for searching information on the Web. These engines use different methods and techniques. These traditional (keyword-based) search engines work based on the General IR model. One of the most popular search engines which is well known as the best search engine called Google [1]. This company achieved better results for many searches with an innovation called PageRank. This iterative algorithm ranks Web pages based on the number and PageRank of other Web sites and pages that link there, on the premise that good or desirable pages are linked to more than others. Google also maintained a simple and user friendly interface to its search engine. In contrast, many of its competitors embedded a search engine in a Web portal.
News browsing and searching is one of the most important Internet activities. The huge amount of news available online reflects the users' need for a plurality of information and opinions. News Search engines are a direct link to fresh and unfiltered stream of information. There are many commercial News search engines such as Google News retrieves news information from more than 4,000 sources, organizes it into categories and automatically builds a Web page with the most important news for each category. Yahoo News runs an analogous service on more than 5,000 sources. Microsoft recently announced its NewsBot, a news engine that provides personalized news, according to different profiles built for each user [2].
As the importance of the news search engines, there are numerous researches by the scholars. NewsInEssence [3] is a system for finding and summarizing clusters of related news articles. QCS [4] is a software tool for streamlined IR from generic document sets. [5] Proposes a tool to automatically extracting news from Web sites. NewsJunkie [6] is a system that personalizes news for users by identifying the novelty of stories in the context of stories that users have already reviewed. SmartWeb [7] is another effort in order to develop a number of domain-specific ontologies that are relevant for mobile and intelligent user interfaces to open-domain question-answering and information services on the Web.
According to previous research, none of the news search engines care about the relationship between events and news. The objective of this paper is achieving a high retrieval performance by using two different kinds of ontologies; Domain ontology and Event Ontology. This study presents a complete Ontology-based framework for the extraction and retrieval of semantic information in the limited domain of news by considering events. We applied the framework in football domain and observed the improvements over traditional keyword-based approaches. This research uses Event Ontology and Domain Ontology as two main building blocks of search framework in order to increase the result coverage. The outline of this paper is as follows. In the next section, a NewsSE framework is presented. Experimental results are discussed in Section III. Finally, conclusions are drawn in Section IV.

Proposed Method (NewsSE)
Although various search engines work based on various techniques, but almost none of them don't care about the structure of information which should be searched. Moreover, the majority of search engines which work on the news don't customize for searching news. Therefore they suffer from low precision and performance.
In this study, we introduce a new framework which is used especially for searching in the news. This framework consists of several processes. Moreover, we focus on events that are related to news and then extract relation between them in order to increase the performance. In all processes of this framework, we use semantic techniques. The descriptions of the processes are shown in Table 1.
In the following sections we explain each of these processes. Figure 1 shows NewsSE framework and its processes. In fact, by using Semantic Web methods, we want to substitute attention to the aspect (appearance) of queries with attention to the meaning of queries.

Ontology
Ontology keeps all concepts and the relation between them. Moreover, they play a central role in Semantic Web applications by providing a shared knowledge about the objects in the real world, which promotes reusability and interoperability among different modules. Therefore, the quality of the Ontology should be the first concern in any semantic application.
Ontology that is created and used in this study have been evaluated based on human judgment. If we can prove the positive effects of using these two limited ontologies in this research, we can generalize the results to other ontologies and domains.
Since covering all different types of news is very difficult, this study focuses on a limited scope. We choose the football domain for this reason. Therefore, in the rest of this research all examples around the news are related to football.
NewsSE uses two different ontologies. First Ontology is called Domain Ontology and another one is called Event Ontology. In the following parts, we explain each of these ontologies.

Domain Ontology
This Ontology contains all concepts, relations and instances (individuals) related to a specific domain (football domain). For creating this Ontology we use protégé [8]  Indexing and query enrichment processes use this Ontology. In fact, the existing rules in the inference engine must be run over this Ontology in order to lead to enriching the query and indexing process.

Event Ontolog
One of the major goals of this research is focusing on events and relations between them. In this research, we design Ontology for extracting this event. This Ontology called Event Ontology. The Figure 4 shows the body of this Ontology. It should be noted in this Ontology, we just focus on five important events in football domain.
Here, different relationships are available between all classes in the Event Ontology. Figure 5 shows Event Ontology and its relationship with Domain Ontology. Event Ontology is used similar to Domain Ontology.

40
NewsSE: An Ontology-based Search Engine for News

NewsCr
Retrieving news from the Web and giving them to the Indexer module are the main goal of each crawler. In fact, a crawler is a Web application which works such as a robot. Each crawler contains a list of addresses (URLs). Addresses point to the news resources. Crawler surfs the Web in certain interval and fetch fresh news in different formats and store them in a temporal repository. Our crawler can fetch news in XML, RSS [9] and HTML formats.
Today, determining the time interval between two successive crawls to the same resource is a major concern and the lack of attention to these causes reducing efficiency of search engine and wasting system resources. Therefore, in NewsCr, we introduce a dynamic methodology in order to specify the time interval between two sequential crawls for a specific news resource. If we select a static time interval between two sequential crawls, it is possible some crawls that get ineffective. Figure 6 shows our new methodology for determining the time interval.
As shown in Figure 6, t indicates our current time (crawler time), i is the number of news, resources, T specifies the local time of news resources, P indicates the time portion and delta is the time interval related to specify P ( Table 2). The first step of this process is calculating the time difference between crawler current time and news resources time. Second step is determining the appropriate time portion for news resources and finally allocating time interval related to specified portion to the crawler scheduler. To find an appropriate time interval for each portion, we monitor the rate of arrival news into several News Agencies (resources) and indicate peak times. Based on this finding, we divide time into three portions and then we allocate time interval (delta) to each resource.
For example, suppose in a country like Iran, the peak of arrival news into News Agencies is between 6 am until 4 pm. Moreover, suppose we select a Static Time Interval (STI) equal to 20 minutes. In these conditions we can see a big gap between two sequential crawls in peak time. In contrast to this condition, in the midnight we have a lot of unnecessary crawls. These situations decrease the performance of the system and also cause wasting a lot of resources.

NewsQ
Users enter their queries via the User Interface (UI). Query Enrichment is a process used in order to make user's queries more enrich. In this study, we introduce a new methodology for enriching user's query called NewsQ. NewsQ contains several sub processes which are described in the Fig. 7  In the following, we explain each of the phases shown in the Figure 7:  In the first two steps, we want to create sets of terms by tokenizing and eliminating all stopwords. The output of this part is a set which contains several terms (Q={q 1 ,q 2 , . . . ,q i }) By using WordNet, we should find all synonyms of terms and create the set of synonyms related to each q i . The output of this process is a set called S= {s 1   After determining the type of each q i and s i , the system allocates the appropriate weight to each of them. This process is controlled and executed by rules in the Inference Engine. (In the next section, rules and inference engine will be explained)  In this stage, we define D as a set of related documents to each q i and s i and then we calculate sum of weights related to each document.  At the final stage, we choose an appropriate threshold. All documents with weight more than this threshold is our results.
For better understanding this process, we mention an example in Fig. 8.
The output of Query Enrichment process is a set of unordered results with their weight. The ranking process uses this weight to order results.

Inference and Rules
The rule and inference engine is used in indexing process and Query Enrichment. Each inference engine consists of a set of rules that if a certain condition is satisfied they should be fired. These rules help the system to extract new knowledge from Ontology and Knowledge-base. For example Query Enrichment uses rules to term weighting or indexing process uses rules to extract event from news and relationship between the concepts. Following list shows several rules used in NewsQ.  If q i is an Individual then λ=3  If q i is a specific Class then λ=3  If q i is a specific Individual then λ=4 Inference is a costly process, especially if the number of assertions is large. This problem should be handled carefully to maintain the scalability of the system. In evaluation section we will be familiar with set of rules.

NewsDex
Indexing is one of the important processes of any retrieval system. The important technology used for indexing is "term weighting". The value (denoted as a ik ) of the element of a keyword t i in a document d k is determined by different term weighting algorithms. In this part, first we should be familiar with some expressions. In this research we denote f ik to be the term frequency of a term t i in the document d k (news), which means the number of occurrence of term t i in the document d k divided by the total number of words in the document d k . We denote n i to be the number of documents that contains keyword t i . We denote N as the total number of documents in the corpus. So means the percentage of documents containing term t i among all documents in the corpus, which we call document frequency. NewsDex uses the Boolean weighting method. This is a simple method. It assigns the weight to be 1 if the word appears in the document (news), 0 if not.
Eq. (1) shows how this method calculates the weight: For indexing news, we use this method because of finding term frequency in a statement is not reasonable [10]. We just need to know a word appear in a news title or not. As we would like to incorporate into indexing process, we add some features to this method in order to improve the process of indexing. Figure 9 shows the overall process of NewsDex.
As shown in Fig. 9, the process of indexing contains 4 stages:

Tokenization
In this stage, each news should be tokenized into several phrases. For performing tokenization we use a parser. Eliminating Stopword In each search engine, a Stopword is a commonly used word (such as "the") that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. When building the index, most engines are programmed to remove certain words from any index entry. The list of words that are not to be added is called a stop list. Stopwords are deemed irrelevant for searching purposes because they occur frequently in the language for which the indexing engine has been tuned. So in this stage we should remove all Stopword by comparing each terms of news with list of Stopword.

Calculating the Weight of Terms, Concepts and Events in Each News
For calculating the weight, we use Boolean Method for each news. In addition to the weighting terms in each news, we weight concepts and events related to news. For finding all concepts and events related to news, we have to use Domain Ontology and Event Ontology respectively.

Index Matrix
After each indexing process, the index matrix should be updated. Table 3 shows a forward index matrix. Each row of matrix shown in the Table 3 indicates a specific document and each column specifies a term or concept or event.

Storing Indices in XML Format
The structure of the semantic index has high importance in the retrieval performance. We constructed a XML (Extensible Mark-up Language) format for storing indices of each news. Table 4 shows an example for index structure of a specific news.

NewsDex
All the news that are chosen as a result in the previous steps, at this stage must be ranked in order to present in the output. Therefore we present new algorithm for ranking news called NewsRank. In this study we assume that there are two different preferences for ordering the results:  News freshness: if users select this item, all results ordered just by NewsRank algorithm.  Similarity with user's query: if users select this item, before using NewsRank, another process should be done. In this process, a threshold should be chosen and all news with overall weight more than this threshold should be ordered by NewsRank algorithm.
All processes done in a search engine can be presented as well, only if search engine uses a suitable ranking algorithm. If all processes are executed well but the results are not properly presented to users, all efforts are wasted. After finishing all tasks, generated result should be ordered based on specific algorithm. Figure 10 shows NewsRank algorithm. Since ordering and ranking process in a news search engine is different from other kinds of search engines, we choose two important parameters for each news; time of occurring news (or degree of freshness) and news resource (or degree of credibility). In the above algorithm, we denoted these two parameters with α and β. The algorithm works as follows:  First, all generated results should be ordered base on the parameter α (degree of freshness). At the end of this stage each news with smallest α should be presented in higher position.  In this stage, news with equal α should be reordered based on second parameter β.
This mechanism ensures more recent and more reliable news are placed in the higher position. Specifying NewsRank parameters is a manual process. In fact, reliable resources have bigger β. Using β parameter helps to the engine to control the process of showing results. For example, it's possible in some situation we change the β of a resource in order to increase or decrease the importance of a resource. If we decide not to show the results of a specific resource, we can set β equal to zero.

Results and Discussions
In this section, we assess and compare the experimental results and proposed framework with two famous news search engines; Google news and Yahoo! News. Before starting the section, it is important to note that both Google and Yahoo search engines in terms of store size are very powerful and also as accessing to resources stored in these two engines are not possible therefore we focus just on first 50 results of each query. In the following parts, first we introduce set of queries used for evaluation, then we evaluate NewsSE performance.
In order to evaluate the proposed system, 20000 sports news has collected from various sources.

Evaluation Queries
We started the evaluation process by preparing the 8 queries shown in Table 5. In this research we classified the queries into 8 different classes that each query play the role of an indicator for each class. Next to each query, we also put the corresponding keyword query which was actually used in the evaluation.  Table 6 presents the Mean Average Precision (MAP) [11,12,13] for three different type of search engines.  Figure 11 shows the evaluation result for comparing MAP of TRAD, CB and NewsSE. As shown in Figure 11, we have different situation for each query. In first three queries, we don't have considerable differences between CB and EB because the framework cannot extract any event from these queries. For Queries n, Q4, Q5, Q6 and Q7, we can see the effect of adding Event Ontology to this to this framework. As Event Ontology helps us to extent the coverage of results so it shows better AP. The last query presents the effect of Inference Engine because in this query the name of Goalkeeper is not given explicitly but this system can infer the name of them from the Domain Ontology. At the final stage, we compare our framework with Google news and Yahoo news. The Table 7 shows the MAP for these three engines.

Conclusions
In this research, we presented a complete framework for performing semantic search in news. In addition to focus on the news in this framework, we considered events as the major components of news. In fact, in addition to the use of Semantic Web techniques that enhance the efficiency of search operations, the particular attention to news and events have a large impact on improving the quality of generated results for each query. Because of above mentioned reasons, the Mean Average Precision (MAP) for proposed framework is 86.81 percent which is higher than Google news and Yahoo! News. Furthermore the event ontology increase Average Precision about 0.8 percent compared to other traditional search engines.