Integrating Canonical Text Services into CLARIN’s Search Infrastructure

Today’s digital research infrastructures target a variety of user groups. A key task to achieve acceptance and active participation among them are both user-friendly and machine-readable interfaces to digital resources. This is especially the case for highly integrated infrastructures like the CLARIN project. The Canonical Text Service Protocol CTS is an established system in document based Digital Humanities that covers many of associated problems, like dealing with varying levels of text granularity, persistent identiﬁcation, address resolution and simple interfaces for an integration in various automatic work ﬂows. The paper shows the advantages of integrating a CTS instance into CLARIN and also demonstrates additional beneﬁts of this CTS implementation in form of built-in text mining techniques.


Introduction
The current landscape of digital resources in the field of humanities can be characterized as rather scattered and oriented on highly specific research interests. Despite strong efforts in building digital research infrastructures (like CLARIN, DARIAH etc.) to overcome the current heterogeneity and to build integrated research environments, it can be assumed that the majority of textual resources in this field (although often being encoded based on standard formats like the TEI guidelines [13]) are still not available via standardized interfaces and can not be found by means of existing search functionality. Naturally, it is a key task to convince and motivate researchers from a wide variety of subfields of the humanities to provide their valuable data to the wider community. As a consequence several attempts have been made and are actively used to minimize the effort needed for a thorough integration in existing environments and work flows, and to provide obvious benefits as motivation for the interested resource provider. The following issues are con-sidered as especially problematic and relevant to the authors and are addressed in this paper: • Many of the current solutions treat textual resources as atomic, i.e. all provided interfaces 1 are focused on the complete resource. The inherent structure of textual data is left to be processed by external tools or manually extracted by the user. Although this being acceptable for some use cases, a highly integrated research environment loses much of its power and applicability for research questions if ignoring this obvious fact.
• Textual resources do not have a typical granularity. Even for rather similar textual resources (like Webbased corpora or document-centric collections) it can not be assumed to have a "default structure" on which analysis or resource aggregation can take place. As a consequence many approaches require and assume a standard format that is foundation for all provided applications and interfaces.
• Granularity has to be addressed as a basic feature of (almost) all textual resources. Current infrastructures make use of several identification and resolving systems (like Handle, DOI, URNs etc.) but a fine-grained identification and retrieval of (almost) arbitrary parts are hardly supported or have to be modeled artificially using features that these systems provide 2 . As a consequence even textual resources already provided in CLARIN are often not directly accessible or combinable because of the heterogeneity of used reference solutions or the level of supported granularity.

Canonical Text Services
The Canonical Text Service protocol [11] describes a framework for web based identification and retrieval of passages of text cited by canonical reference as typical in 100 Integrating Canonical Text Services into CLARIN's Search Infrastructure classical studies and other literary disciplines. To achieve this it uses persistent CTS URNs to reference specific text passages such as a complete document or a more specific text part. The following example illustrates this reference system. The CTS URN urn:cts:pbc:bible.parallel.eng.kingjames: specifies the complete edition/document "King James Version of the Christian Bible". This URN corresponds to the reference that is provided by alternative systems that work with complete documents.
In contrast to those, CTS allows to reference any static text part with URNs like urn:cts:pbc:bible.parallel.eng.kingjames:1 3 , which references the first book in this document, or urn:cts:pbc:bible.parallel.eng.kingjames:1.4.2 4 , which references the second sentence of the fourth chapter of the first book of this Bible translation. Static URNs generally point to structural elements of the texts, like chapter, paragraph, sentence or stanza but are not restricted to a specific schema. Additionally CTS URNs allow to reference spans between two static URNs like urn:cts:pbc:bible.parallel.eng.kingjames:1-1.4.2 5 .
Using sub passage notation, any possible text passage can be requested, like urn:cts:pbc:bible.parallel.eng.kingjames:1@the[2]-2.4.2@a [3] 6 -which translates to the text passage from the second "the" in book one to the third character "a" in the second sentence in the fourth chapter of the second book. Sub passages are resolved using the exact string, which normally correlates to a word but may also be a string of words or a letter. For the remainder of this paper only the static CTS URNs are of relevance, URNs for spans of text passages and sub passage notation are not included. The relationship between CTS URNs and different levels of text granularity is also depicted in figure 1 for an excerpt of the English King James translation of the Bible. One central aspect of CTS is that it was developed by humanists and reflects their perspective on text references, as for instance described by [3]. This perspective might differ from the one used in other communities. Specifically much semantics is implicitly or explicitly encoded in CTS URNs, which is different from the understanding that IDs should be opaque as it is often assumed in CLARIN. Combining CTS with CLARIN has the potential to create an important connection between the two philosophies. It opens up tools that are provided in CLARIN for the digital classicist community and also provides access to their valuable data. The used implementation of CTS (described in [12]) proved to be efficient and scalable even for large text collections. Additionally it became possible to implement several features that are not part of the CTS protocol but useful additions that rely on certain properties of CTS, like text structure based text alignment (see section 7). The data sets that are already available as instances of CTS -and therefore can be imported into CLARIN with references to individual text parts -include Perseus 7 , Parallel Bible Corpus 8 , the Deutsche Textarchiv 9 and many others. Even if the data is already included in CLARIN -like the documents of the Deutsches Textarchiv -importing them via CTS provides resources with a smaller granularity and therefore additional research value. New data sets can be imported from TEI/XML in a configurable workflow or created with project specific import scripts from any format. For this work a subset of the Parallel Bible Corpus is used as an example. This subset consists of 20 Bible translations that contain more than 30,000 text parts and are either older than 70 years or published as public domain.

CTS as a Uniform Text Communication Interface
The following example illustrates the usefulness of integrated support for CTS using Monica Berti's Digital Athenaeus Index Digger 10 ([1]) as an example. The columns Read Greek Text (Perseus), Read Greek Text (Frontend UniLeipzig) and Annotate with Perseids provide hyperlinks to different tools that work with different data sets and are developed by different research groups. Yet, because CTS is specified as a technology independent protocol, it is possible to link the same text passages together by providing the same CTS URN as a parameter in the hyperlink. This makes it possible to connect and combine results of different tools and significantly increase interoperability between research groups. This also reduces the amount of effort that has to be put into retrieval of text resources and format conversions for new research projects.

The CLARIN Infrastructure
CLARIN 11 is the short name for the "Common Language Resources and Technology Infrastructure", a research infrastructure for scholars in the humanities and social sciences ( [8]). It offers easy and sustainable access to digital language data (e.g. in written, spoken or multimodal form) and also advanced tools to discover, explore, exploit, annotate or analyse data sets. CLARIN follows the approach that all integrated digital language resources and tools from all over Europe and beyond are accessible through a single sign-on online environment for the support of researchers in the respective fields. Therefore, a networked federation of language data repositories, service and knowledge centers, with access for all members of the academic communities in all participating countries was created. In CLARIN tools and data from the different centres are interoperable, so that they can be combined to perform complex operations in order to support researchers in their work.
CLARIN is one of the research infrastructures of the European Research Infrastructures Roadmap by ESFRI, the European Strategy Forum on Research Infrastructures. This roadmap contains five research infrastructures in the area of social sciences (CESSDA, European Social Survey, and SHARE) and humanities (CLARIN and DARIAH). On the European level CLARIN's governance and coordination body is CLARIN ERIC. An ERIC is an international legal entity, established by the European Commission in 2009. In 2012 CLARIN ERIC was established and took up the vital mission to develop and maintain this international infrastructure.
By now, the CLARIN infrastructure is fully operational in many countries, among them the German consortium CLARIN-D 12 . Therefore, a large number of participating centers are offering access to data, tools and expertise. At the same time, CLARIN continues being established in countries that joined just recently, and CLARIN's datasets and services are constantly updated, extended and improved.

CLARIN Integration
For a thorough integration in CLARIN the granularity of text resources contained in CTS instances has to be exposed mainly via metadata, as the comparable interfaces for retrieval of the actual text material is already provided by every CTS instance.  [5] For describing all kinds of resources CLARIN makes use of the Component Metadata Infrastructure (CMDI) framework [5]. It allows to build and use component-based metadata schemata for all kinds of resources including descriptions of complete corpora, tools or services. For the concrete realisation the popular CMD profile "OLAC-DcmiTerms" (clarin.eu:cr1:p 1288172614026) was used. The RESTbased design of the CTS protocol and its implementations reduce the effort that is necessary to include CTS instances in the center-based CLARIN infrastructure. The CLARIN center Leipzig makes strong use of webservices for both its internal structure and the external interfaces it provides 13 . For the incorporation of potentially unlimited numbers of CTS instances this approach was extended by creating a wrapper webservice as the main interface for the internal center infrastructure. Regarding all metadata-centric external views the default repository system is still used and provides a transparent interface to the CTS resources by standard interfaces like OAI-PMH.
The implemented solution allows the creation of CMDcompliant metadata on every potential level of granularity that is provided by the CTS instance. For the time being it was decided to only expose the top two levels via metadata. For the example depicted in section 2 it means that every specific edition of a Bible and all of its books are described by their own metadata files and can be accessed and searched for in search engines. This includes the typical descriptive metadata, all relevant references to resource-specific CTS services and the hierarchical interlinkage of metadata files as it is supported by the Virtual Language Observatory [4].

Results
The result of the process is a collection of metadata files that are created using the meta information that is part of the text inventory of the CTS instance. These files can be accessed using the Virtual Language Observatory 14 . As an example, the textgroup urn:cts:bible is translated into a single CMD record and the same holds for each document level CTS URN in this textgroup, like urn:cts:pbc:bible.parallel.arb.norm:, urn:cts:pbc:bible.parallel.ceb.norm: and urn:cts:pbc:bible.parallel.ces.norm:. For each document the CTS request for the CTS URNs on citation level 1 -which corresponds to one book of the Bible like Genesis -is added as a resource along with the CTS request for the complete document and the TEI/XML source file that was imported in CTS. The citation level can be chosen arbitrarily. In this work it seemed to be reasonable to work with 60 to 70 references per document to the books of the Bible instead of more than 1000 references per document on citation level 2 or even more than 30 000 references per document on citation level 3.

CTS Applications
Several applications were developed that rely on properties of the described implementation. Using CTS as a stan-dardized input they can be used with any data that is stored in the system. Each of these applications is provided along with the CTS implementation and can be used as soon as a new server instance is created. It is especially not required to pre-calculate any additional data.
One of these applications is the Candidate Text Alignment Browser 15 that visualises textual variants in one language using the TRAViz library [9]. As one major benefit for users of the Digital Classics this visualization intuitively reflects diachronic changes in biblical texts over almost 400 years. Further parameterization allows to change the size of the text passage, although, since TRAViz is relatively memory intensive, it is not recommended to use elongated passages as input.
The CTS server also integrates the Parallel Alignment Browser 16 that can be used to align text passages from selected documents independent of their language. For better readability or easier creation of CTS URNs the Canonical Text Reader and Citation Exporter CTRaCE 17 was developed. This tool renders the output of a CTS instance in a more appealing way, lets users traverse the documents and easily create CTS URNs for a selected text passage and generally provides "an interface to intuitively make this [meaning:CTS] capability accessible to humanities scholars."( [10]). Integrating support for Canonical Text Services in CLARIN makes it possible to connect these and future applications with the numerous tools and methods already provided in the CLARIN infrastructure.

Future Work
The current setup of Leipzig's CTS server is fully functional and will be a valuable part of CLARIN's constantly growing resource landscape. The integration of more resources in the CTS instance is already in progress and the described work also significantly simplifies the inclusion of CTS instances hosted by other data providers in CLARIN.
For an even tighter integration the current focus of development lies on providing interfaces to more relevant CLARIN components. It is expected that especially the preparation of an endpoint compliant to the CLARIN Federated Content Search FCS 18 or a wrapper for the execution environment WebLicht [7] will boost the usefulness of the system. Furthermore, continuous efforts are being made to provide more integrated analysis and visualization components in the CTS server. 17 /browser /?ctsURL=../../pbc/cts/ 18 http://weblicht.sfs.uni-tuebingen.de/Aggregator/