Discursive Constructions of Culture: Semantic Modelling for Historical Travel Guides

Besides introducing the Baedeker Corpus, a digital collection of early German travel guides on non-European countries, key topics addressed in this paper are linguistic and in particular semantic annotation, domain-specific taxonomy building, corpus enrichment by embedding of external resources, and some good reasons why advanced textual studies on this genre are of interest. Early Baedeker handbooks are valuable rarities today because only a small number of copies escaped from frequent maltreatment of being cut up to save luggage weight on local trips. Those having endured give us a vivid impression of cultural narratives from the turn of the 19 century, and they tell more than one story. The presented project approaches this complex issue focusing on data development in general and the leading actors in the travel guides, people and notable sights in particular. Considering widely recognized standards for scholarly editing and relying on pan-European infrastructures to enhance data interchange and reusability, the aim is to make a well-equipped and freely accessible language resource available, which is meant to foster cross-disciplinary research on cultural representation and identity constructing discourses.


Introduction
The Baedeker Corpus, which is currently being developed in the travel!digital-project 1 brings attention to historical travel guides, a part of literature often neglected by analog as well as digital humanities. In response, corpus building and design principles aim at reducing the shortage of appropriate digital data, which allow for historical analyses, and at providing an incentive for further investigations in this field. Considering the wide distribution of travel guides in general and the cultural significance of Baedeker handbooks in particular, it is surprising that scarcely any research has yet been devoted to this sources. Outstanding quality, incontestable reliability and a very attractive appearance (see Fig. 1) contributed to their good reputation.
Karl Baedeker (1801-1859) began his publishing business in Koblenz in 1827. Appearing first in German and French, from 1861 onwards also in English, the editions soon were well known and highly influential. The Baedeker set the standard and defeated all competition both in-and outside the German-speaking countries. Thus, a history of travel guides without Baedeker would not be complete. Baedeker's first travel book, which soon became his most popular title, was a reprint of Rheinreise von Mainz bis Cöln. Handbuch für Schnellreisende [1], written by the German historian Johann August Klein. While this edition was extended with a course map of the river Rhine but remained unchanged otherwise, in the course of other revised editions Baedeker gave practical advices, added useful information and descriptions of the most important sights. The amendments for this volume as well as the extensive information material for other following guidebooks were based on Karl Baedeker's own travels. [cf. 2: 23 f.] Although he adopted many design elements (e.g. the red cover, the name "handbook", the asterisks) from his English contemporary John Murray, Baedeker attained the leading position in the travel book sector. At the end of the 19 th century, Murray's handbooks disappeared from the European market. When business increased the regularly revised volumes were supervised by scholars and reached a high level of standardization from outer covers to inner organization. [  Despite this longstanding success story, substantial contributions to the history of the genre 2 as well as its specific language of expression are still limited. In the field of textual scholarship, travel guides have not been regarded as a literary form in their own right within the classical canon for a long time. Therefore, they played a minor role in comparison to travel narratives. Complaining that "there is no general history of this significant genre of tourist literature", Koshar [5: 15 f.] refers to the "bad reputation" travel guides have. In addition, he argues that "the variability and constantly increasing number of guidebooks frustrate the researcher's attempt to grasp the genre as a conceptual whole". Previous guidebook studies frequently focused on the classification of structural and functional features [cf. [6][7][8][9] and the majority of diachronic approaches is limited to one region or country. 3 Since investigations rely on small analog corpora, textual and linguistic analyses as well as cultural-historical interpretations remain too restricted and usually anecdotic in nature. In contrast, the few corpus-based and corpus-driven approaches in the realm of computer linguistics that work with large data volumes, concentrate on contemporary travel guides and related texts. Often constrained to translation studies, they exclude historical development and change [cf. [13][14][15][16]. One reason for this lack of diachronic corpus studies can be seen in the absence of appropriate electronic data, which probably is due to numerous difficulties in digitizing complexly structured historical guidebooks.
Moving away from regional restrictions and conventional methodologies, the travel!digital-project aims for a larger geographical diversification and the application of up-to-date text technology in order to create a sound basis for more systematic studies and to make this valuable part of cultural heritage available to various disciplines. Travel guides offer detailed descriptions of different places, people/s and cultures. In doing so, they do not represent neutral assessments. On the contrary, they are part of cross-cultural communication including curiosity, openness and empathy on the one hand and misunderstanding, prejudice and antagonistic relations of power on the other. Far more than mere historical sources, guidebooks mediate a variety of complex discourses. Introducing cultures and people/s, recommending itineraries, assessing sites and festivals, cultural and social conditions, suggesting modes and attitudes of behavior to adopt in these places and on these occasions at the same time disclose the authors' own expectations and fears. In addition, they contain many interesting details illustrating what was considered age-old, traditional or modern at that time and selecting what was worth seeing.
As complex inter-texts and significant discourse-historical artifacts 4 , travel guides represent "codified and authorized versions of local culture and history". [18: 94] Reflecting dominant discourses, (re)producing and (re)constructing them, they play a central role in shaping the tourist experience and directing the tourist gaze. [cf. 19,20] In addition to external influences, also inner qualities contribute to the complexity of the text type, 5 which shows features of the travelogue, atlas, geographical survey, art-history guide, restaurant and hotel guide, tourist brochure, address book, and civic primer. [cf. 21: 307] For all these reasons, the various readings of history, tradition and culture including tourism and cultural heritage as well as colonialism and orientalism are of particular interest.

Materials and Methods
The digital collection presented in this paper brings together all first editions of German travel guides on non-European countries which were released by the Baedeker publishing house before World War I. The seven volumes contained, dating from the period between 1875 and 1914, comprise 4.237 pages, more than 1.7 million running words and cover various regions, offering a balanced picture of different cultural areas. (See Table 1) It was Fritz Baedeker, third son of Karl and company manager from 1869 onwards, who expanded the range to distant regions. In part, this selection reflects the project leader's academic background in Cultural Anthropology. In part, it is due to the fact that a topical focus results in a manageable corpus size. In addition, early handbooks on non-European destinations have been treated as a minor matter in critical literature so far.
Corpus building and digitization processes (image creation, automatic text recognition, basic annotation) were already carried out in the early 2.000s within the framework of the Austrian Academy Corpus AAC, a predecessor institution of the Austrian Centre for Digital Humanities. As in the meantime structural annotation has been completed for the entire Baedeker Corpus as well, substantial data enrichment by means of linguistic and in particular semantic markup is at the core of the ongoing project. Aside from other well-established standards to ensure long-term preservation and free accessibility of the digitized material, we were especially interested in using the potential of semantic technologies for exploring the German repertoire of cultural discourses at the turn of the 19 th century. The project is funded by a support program of the Austrian Academy of Sciences that promotes sustainable digital research and data creation across humanities disciplines in accordance with international standard recommendations and best practices. Since not all of the components involved may be familiar to readers outside the DH community, the following sections giving an overview on data formats and tools, the layers of annotation, and the web application, also include introductory information on these aspects.

Formats, Tools and Layers of Annotation
High-quality digital images produced with a Zeutschel-OMNISCAN-7000 book scanner (8-bit, 256 greyscale) and stored in TIFF-format (400 dpi) were used as the basis for optical character recognition OCR, which was carried out using the software ABBYY FineReader [22]. As expected, changing typefaces, typographical properties and very small font sizes had a negative impact on the accuracy level achieved. Thus, text improvement was an inevitable step in each project stage. Luckily, at least semi-automatic opportunities to support time-consuming proofreading and manual corrections were available in all phases. The machine-readable text obtained from automatic recognition was exported to XML (Extensible Markup Language) [23], currently the state-of-the-art format for representing and sharing structured information on the Web. XML and related specifications for transforming and navigating XML documents such as XSLT (Extensible Stylesheet Language for Transformations) [24] and XPATH (XML Path Language) [25] are recommendations of the World Wide Web Consortium W3C [26], which is the most important international standards organization in this context.
The applied markup added to the plain text is based on the guidelines of the Text Encoding Initiative TEI (version P5) [27]. This standard set of rules for encoding texts of any genre defines markup elements for significant particularities at several levels and at different granularities including structural information (titles, paragraphs, verse lines, chapters, references etc.), typographical properties (highlighting, quotations etc.), and semantically distinct units (dates, named entities like persons, places etc.). Expressed in XML-syntax, TEI supports searching, text analysis and online publication as it allows to transform, rearrange, and convert the encoded texts for various presentation and navigation purposes. The entire TEI encoding scheme is extensive but at the same time modular and flexible, thus providing possibilities for customization and modification. In our case, customization simply meant that we selected a manageable subset of TEI components and generated a schema appropriate to our individual project requirements. In this context, one of our central aims was to preserve the original state of the texts in the digital representation to as high a degree as possible. Therefore, aside from structural, typographic and orthographic features, we e.g. recorded apparent errors and provided corrections for the sake of retrieval. On the phrase-level we encoded dates as well as names of persons indicating whether it is a historical, religious/mythological or literary figure. In addition, a wide range of group designations and selected sights were marked up with appropriate elements forming the groundwork for the more structured thesaurus discussed in detail below.
Both XML and TEI are application-, platform-and vendor-independent open source formats, thus there are numerous standard-compliant tools available for creating, managing, delivering, and displaying XML/TEI data and many of them are free of charge. The overall project did not develop or introduce new standards or tools but much rather used and combined existing and widely recognized components for creating a new textual resource, which is the actual result thereof.
In order to enhance comprehensive searchability, we added fundamental layers of linguistic annotation, segmenting the text into grammatical units called "tokens" (tokenization), reducing inflected word forms to a common base form called "lemma" (lemmatization), and identifying the word class of each token (part-of-speech-tagging). For the latter, we referred to the Stuttgart-Tübingen Tag Set STTS, a quasi-standard for tagging German texts, which includes 54 categories and therefore offers the opportunity to make detailed distinctions. [28] We implemented this step by using the tokenEditor [29], a sophisticated tool, developed at the ACDH, which allows to import TEI-conformant XML documents and to automatically enrich tokenized input data with word class and lemma information. The tool's main and to us most important function is the possibility for manual correction of the linguistic information. In our data, mismatches are highly frequent due to historical orthography, numerous named entities and unlimited abbreviations. Thus, the various useful features of the editor like filtering, sorting and batch assignment significantly ease the manual post processing.
In parallel, we have been working on controlled vocabularies using the Simple Knowledge Organization System SKOS [30] for structuring and organizing significant semantic components. "SKOS is an area of work developing specifications and standards to support the use of knowledge organization systems (KOS) such as thesauri, classification schemes, subject heading systems and taxonomies within the framework of the Semantic Web." [31] SKOS adapts principles such as equivalency, hierarchical and associative relationship for expressing semantic structures, but in contrast to traditional approaches, this standard makes embedded knowledge explicit in a formalized, machine-understandable way. SKOS uses the Resource Description Framework RDF [32], which is another W3C-standard that provides a model for creating statements on resources and their properties in the form of subject-predicate-object expressions, also known as RDF triples. The notions the tribe (subject) has the name (predicate) Wedda (object) and the Wedda (subject) live in (predicate) Ceylon (object) are simple examples. "Asserting an RDF triple says that some relationship, indicated by the predicate, holds between the resources denoted by the subject and object. This statement corresponded to an RDF triple is known as an RDF statement" [33]. Resources, which can range from physical objects to abstract concepts, or other entities like numbers and strings, can be subject and/or object of multiple statements, thus forming a semantic network. All of the components involved are identified by Uniform Resource Identifiers URIs [34], which enables information systems to retrieve entities and to interpret how they are interrelated. In addition, this allows for the combination of resources from different datasets in order to enhance semantic expressiveness. Thus, SKOS/RDF supports the publication, alignment, exchange, and reuse of machine-as well as human-readable vocabularies e.g. as Linked Data [35].
The subtitle of the travel!digital-project "Exploring People and Monuments in Baedeker Guidebooks" summarizes the taxonomy strategy we pursued and implemented by using the SKOS standard described above: The focus on people/s and notable sights highlights two semantic fields that can be identified as prominent elements not only of travel guides but of cultural discourse itself. Both domains are essential for talking about culture and both of them are particularly suitable for data modelling.
Humans e.g. are addressed in a variety of different forms in the guidebooks. References to classes are frequently used in making generalizations about groups and individuals as well. Group members are assumed to share the same characteristic features and may therefore represent the class as a whole. [ Concepts and labels include both nouns and adjectives, indicating associations among them by means of the property skos:related. Table 2 lists definitions of skos:topConcept(s) and shows selected examples of concepts, labels and terms.  The situation is similarly varied for monuments and notable sights. Since assessments and classifications of cultural heritage objects are integral parts of cultural representation, they form a separate unit in the concept scheme. Due to their infinite number, we have selected stellar sights only, which are labelled worth seeing with asterisks (*, **) in the printed editions. 6 The topical spheres here range from architecture 7 and artworks 8 to accommodations, landscapes, and breath-taking views.
The systematic recording of the extensive lexical inventory identified in the travel guides resulted in the Baedeker Thesaurus. To give an idea of its abundance Fig.  2 shows a visualization of group designations based on two volumes, Asia Minor and India.

Data Representation in the Web Application
All of the components presented so far form essential parts of the web application we are creating. It is based on the corpus_shell framework [37] using the FCS/SRU protocol that is part of the CLARIN-ERIC 9 infrastructure, an Europe-wide initiative. [38,39] The online edition includes the digital texts together with their facsimiles and detailed metadata and combines source oriented approaches with the applied semantic technologies. [cf. 40] It provides 6 Although invented by John Murray, the asterisks became famous as Baedeker brands. Murray used two stars for inns he knew from personal experience, and one star for those recommended to him. It was Karl Baedeker, who offered a whole system to lead first-time tourists to notable sites. [ query capabilities inside both the text and the linguistic annotation layers. In addition, there are some classical indexes available including word forms, lemmas, word classes, and personal names. (See Fig. 3A) Optionally, their frequency distributions can be displayed in bar diagrams.
The most important and innovative feature is the integrated Baedeker Thesaurus. The taxonomy gives a structured overview of the rich domain-specific vocabulary. It offers definitions, reveals relationships and adds further information using external data from the Linked Open Data LOD cloud. For practical reasons, all entries are linked to corresponding occurrences in the Baedeker Corpus in order to support efficient navigation and comfort of reading. (See Fig. 3B, 3C) Transforming names of people/s and sights in the travel guides into links to the LOD cloud, the thesaurus connects occurrences in the texts to other online resources, providing enhanced access to the Baedeker Corpus and additional information via the guidebooks' main actors. LOD datasets such as the Virtual International Authority File VIAF, the Thesaurus for the Social Sciences GESIS, the Art & Architecture Thesaurus AAT, the UNESCO Thesaurus, and DBpedia will open up fresh perspectives on the genre. Aiming at easy access and data sharing, metadata creation is based on CLARIN Metadata Components. [41] Resource descriptions adapt relevant CMDI-schemas according to the project's need, providing detailed information on historical prints, the quality of the digital documents, the extent of annotations, and the applied tools. In the spirit of open access, all data will be available to others under a Creative Commons License [42] after the project's end in November 2017. Since the project is set in the framework of Austria's DARIAH-EU [43] and CLARIN-ERIC activities, corpus data and all reusable output such as the thesaurus, documentations, and schemas represent Austrian in-kind contributions to the European infrastructures. While much still remains to be done, we decided to give insight into our work at an early stage. Thus, as of December 2016, a test version of the Baedeker Corpus is accessible online. [44] It includes two volumes, Konstantinopel und Kleinasien (1905) and Indien (1914) and a first part of the taxonomy with about 400 sights and more than 1.000 group designations. In the course of 2017, the year of the 190 th anniversary of the Baedeker Company, the remaining volumes of the corpus will be published gradually.

Conclusions
Along with the basic layers of linguistic annotation, the domain-specific vocabulary and content contextualization by Linked Open Data are appropriate and efficient instruments for exploring cultural diversity in historical travel guides from different angles. Digitally available data, fully equipped and freely accessible to interdisciplinary research, will ease exploiting such textual resources to a maximum. Without anticipating conclusions, the semantic data model presented in this paper aims at supporting fine-grained examinations of central components that have a lasting influence on cultural perceptions of "Other" and "Self". Thus, the intended audience for this new language resource may include literary studies, discourse analyses, lexicography, linguistics, cultural anthropology, and historical geography. 10 We expect that the applied semantic technologies have a great potential to reveal much about a discourse that goes far beyond travel literature.