Quantcast
Channel: Work progress – Digital Intellectuals
Viewing all articles
Browse latest Browse all 15

Publication of my digital edition – Developing my TEI Publisher application

$
0
0

In a recent blog post, I introduced the ‘publication’ phase of the digital edition pipeline I am currently developing within the DAHN project. I presented, explained and developed how we are using TEI Publisher to publish our corpora (d’Estournelles’ letters and Berlin Intellectuals’ letters and texts). At that time, I was only discovering TEI Publisher and I barely scratched the surface of what we are able to do with it. My article ended with the fact that I had generated an application, host of my edition, that will have to be modified and adjusted to what we want for our project.

Ever since, during the few months that have passed, I read documentation, explored the example websites provided by TEI Publisher and tested ODD and templates modifications with the example files given in the test area and with my own files. Finally, together with an intern (Manon) who was hired to help me shape our application, we did many modifications, added new features and transformed the generated generic standalone application into an instance dedicated and made for ego documents.

In this new post, we will not present every change we made for the application but key parts of those changes, as well as some personal features, sometimes less important in terms of running the application but interesting regarding content.

Major development for the application

First and foremost, we will present the elements that are essential to our application, that allow our corpus and its files to correctly and clearly appear for the visitor of the website, easing their way into our digital edition, one step at a time.

Choosing the right templates

In a web application like TEI Publisher, a template has to be used to display the content of the XML files that have been uploaded. A template is a model created to define how the data will be presented; it is an essential part of every website because it will decide how the user will see the information. 

When you generate your own application with TEI Publisher, it will require, in addition to the ODD, the template that will be set as a default for the website. This does not mean that we have to create it from scratch because many templates are offered with the “Playground” of TEI Publisher. Moreover, this is not a definitive decision, because once you set a template, you can still modify it and if needed, it is possible to add others, just like we did.

At the moment, our web application is composed of six templates, each with a specific given task. The most important template is the default one, “letter.html”, created to display the content of our corpora, which are generally correspondence. This is a pretty simple template, with three main elements : text, facsimile and metadata.

Display of an XML file by the default template

Since it is the default template, every time a new XML file will be added to our application, it will be displayed as such, unless we declare it otherwise. 

However, our application is not only composed of files from the actual corpus but also adjacent files, such as indexes and documentation, that also require a specific template because the “letters.html” is not suitable. The template “documentation.html” is only used for two XML files of our application: the encoding guidelines and the application documentation. This template is simply made of a table of contents and of the content of the file. The table of content can be extended to the lowest rank in the hierarchy of the documentation and every entry is associated with a link that can lead you directly to that part of the text.

Part of the documentation

The last big part of our application are indexes (persons, places, organizations, bibliographies). The main presentation is generic for all these documents, i.e. a “table of content” containing the list of all the entries in the index file and a detailed list of all the information included in these entries. Nonetheless, we made a template for every kind of index because the way we want to show that information is not always the same. For places, we chose to separate in two columns and insert a map on which you can find the corresponding place thanks to the geographical location. For persons, there are three columns of entries and the persons are classified by alphabetical order. For organisations, it depends on whether or not the geographical location is available in the file. If so, it will be similar to the place template but with only one column and if not, it will be similar to the person template but with two columns. Finally, the index of bibliographies consists only of one column with every information available about the book, article, text, etc. for each entry.

Template for the index of places

These templates should be useful for the user that wants to add their corpus to our website and have indexes because they can choose how they want their indexes to be displayed, in many ways.

Adding new information with multi presentation

We have now chosen our templates and we have one defined for the files in our corpus, which, like we already mentioned, presents facsimile, text and metadata. However, the text part still represents a lot of information, maybe too much for the visitor. Indeed, an encoding of the text does not only contain tags for paragraphs, titles, etc., e.g. text structure, but also annotations and detailed information such as named entities tagging (like the one mentioned in the indexes), notes (from the author but also from the archivist or some editors in the project) and some specificities inherent to manuscripts (deletion, addition, gap, unclear words, etc.). 

Luckily, TEI Publisher offers the possibility to deal with this mass of information via multi panels presentations. The idea and technique has been used in the seminal digital edition Van Gogh Letters, which is presented in the “Demo collection” and can be easily grasps thanks to the available ODD (“vangogh.odd”) and the attached template (“vangogh.html”). Multi panel presentation works with a system of modes, i.e. a declaration in the ODD. We specify that with a declared mode, a tag will appear a certain way and it will appear another way once in another mode. Finally, in the template, each panel will have a defined mode for the specific display of the tags.

For the documents themselves, we decided on three modes and panels, corresponding to the three kinds of information mentioned earlier: notes, corrections and entities. This is especially useful because many data provided with those modes are given in the form of a bubble when the pointer is located on the concerned word. However, some words in the text can have a mix of the three kinds of information (for example, a person name that was misspelled by the author and that required extra information by a member of the project). All of the information would appear in different bubbles that will overlap one another. So, with the orientation given by the modes and the panels, the user is able to choose which data they want to see, when they are browsing each page.

Same text with each presentation

 

For the corpus files, we also added a multi presentation for the “illustration” part of the template, i.e. where we could see the facsimile. The user can now choose to also have access to a map, where they can pinpoint the places mentioned in the document. We are also thinking of adding other features. This idea of multi presentation is not limited to corpus files because we also consider implementing it for indexes as well, such as the person index which usually contains, in addition to the entries for the persons, <relation> tags that specify the links between the different entities. The idea, in this case, would be to show the entries in one panel and display a graph presenting the different relations in another panel. However, this is still in the development phase.

Establishing the facets for the corpus

Facets for the Berlin intellectuals corpus

A corpus can consist of many documents, adding up to 10 or 20 pages when the whole collection is listed. This can be impractical, especially when the visitor wishes to see one particular file quickly. Browsing through all the pages to find the right file would be time consuming. In order to help with that situation, TEI Publisher offers the possibility to filter the results within the collection, through the use of facets. They are automatically added when the application is generated but the only default filters proposed are languages and keywords. However, it is possible to go into the code of the application and add your own filter, by selecting the chosen tags in the header. For our web application, we have chosen to classify by author, recipient (considering we deal with letters), place of writing, date of writing (by year, then month, then day), languages, conservation site and editorial status of the letters (transcribed, unfinished or approved) . 

The advantage of those filters is that they only appear in the interface if the tags called in the code are presented in one of the files present in the collection. This is why, even though keywords are still declared, they don’t appear in the facets (namely, because they are not part of the encoding of the corpus files). This allows the application to not have an overcrowded interface containing useless information, which would defeat the purpose of the facets. 

Those facets have nonetheless a flaw: it is not possible to choose more than one option for each filter. For example, consider the hypothesis that you have a corpus written between 1914 and 1918: if you choose to see all the letters from 1914, you wouldn’t be able to also add the letters from 1915 in your filter. It would be necessary to uncheck “1914” and check “1915”. However, we are trying to fix this problem and more largely to address this as a wider filter-related challenge. It is still possible to multiply the filters and decide, for example, on one author in particular, with one date and one place of writing, to narrow the search.

Creating a validator adapted to our encoding guidelines

To be sure that every element above is fully working, we need a validation mechanism that rejects every file that is not displayed properly. Two validation methods already exist in TEI Publisher : explicit validation and implicit validation. Explicit validation uses extension functions and requires programming and XQuery knowledge to run correctly. Implicit validation is triggered by inserting new files to the database; all its mechanisms are  already implemented, and it only needs to be configured properly.

We chose to use the implicit validation because it is much simpler to implement: during the uploading, TEI Publisher checks if input files are modeled according to given schemes.  If they’re not, they simply are not added to the database. We just had to change these given schemes so as to fit our encoding perfectly. In order to do so, we converted our pre-existing validation scheme into the scheme format understood by TEI Publisher, and then added it to already existing validation schemes in TEI Publisher.

Personalisation for an improved, more personalized version

Now that we have managed to deal with the most important part of our application and the way it runs, we can work on more trivial and secondary topics, such as the appearance of our website, the output formats or the optional information that we can provide to our users.

IMPROVING AND ADDING TO WHAT ALREADY EXIST
Working with the CSS

When on a website, there are two main elements that need to be defined in order to be able to display the content properly: templates and stylesheets. We already detailed the templates we created to display our many diverse XML files and we will now present the style we decided to establish for the many HTML tags.

After the application was generated, a CSS file “theme.css” has been created and is used for all the templates called in our application. This file contains the same CSS used for TEI Publisher, with its options, colors, links, etc. Although it is a proper file, we decided to completely rewrite it in order to achieve two things: discover how the website works by searching every HTML tag and changing its style and appearance to see the impact it has on the website, and personalize our application in all of its details in order to achieve a result that fits our theme and our liking.

The “theme.css” is a long and extensive file that contains many declarations and concerns a lot of elements from HTML, some of which are not self-explanatory or whose mention in a template is not necessarily expected. Even if we wanted to adapt the style to our liking, we wouldn’t want to change absolutely everything from this file, which means that it was necessary to discover which statement referred to what in order to modify the right tags. Fortunately, some parts of the CSS were classified by type of tags, like menubar, toolbar, table of contents, etc. which made it easier for us to adapt it to our liking in a new CSS file named “pec.css”. We defined in this file the default display for every tag that we have in our web application. However, we sometimes changed those settings once in a specific template to adapt it as we saw fit. Some tags used in the template of the corpus may not have the same importance as in the template of one index; it then needs to change in one of those cases. Fortunately, in every template, it is possible to declare a <custom-style> which overrules the declarations made in the CSS main file (if the same tag is declared in the main CSS file and in the template, it is the style declared in the template that will be displayed).

So, even though we had to go in detail to choose a style for every template, it allowed us to really put our mark in the web application, which now represents our editorial choices on every page.

Offering multilingualism

TEI Publisher is a publishing toolbox available for all, no matter the country you work from or the language you speak. The documentation and most parts of the website are only available in English but many others, and notably the “Demo collection” and the “Playground” can be found in several languages (English, French, German, Polish, Greek, etc.), which eases the user in their practice of TEI Publisher because they then have the opportunity to display it in a familiar language. Once you generate your own application, the templates and default pages are also in a variety of languages, but as soon as you change the content of one of the tags, you have to set up new translations, or the part will stay in the initial language, no matter the language option chosen (English is our preferred one as a world language, spoke by many), whether it is a title, a small paragraph or a header. 

Implementation of the multilingualism module

Example of the encoding of the multilingualism module (english)

However, it is possible to add your own translation files to help with the multilingualism of your website. Just like it was done by TEI Publisher, all you need to do is to create identifiers for each part you want to translate and add the translation in the dedicated file(s).

Example of the encoding of the multilingualism module
(french)

For our website, we translated the parts containing identifiers in English, French and German, as these are the three main languages in which the texts we edit were written. Thereby, even if we only have three languages set up, those are well thought through considering our target audience.

Proposing various outputs for the corpus

TEI Publisher is a great tool because it does not only offer the possibility to display XML files in HTML pages, but also to export those files in different outputs. In theory, there are six different outputs: XML, XSL-FO, PDF (from XSL-FO), ePUB, TeX and PDF (from LaTeX). In practice, not every output is easily exportable: the XML can be accessed in exide but is not directly downloadable; the XSL-FO is useful in order to see how the transformation from XML to FO was realized, but is not really useful for export, and the PDF from LaTeX doesn’t seem to really work, whether it is for our web application or for the example file in TEI Publisher, unless we manage to adjust it, as suggested in an issue from the Github repository of TEI Publisher1. The XSL-FO also encountered some issues, but in particular cases that we singled out in our application. 

When downloading some of our XML files, we sometimes get an error message, telling us the XSL-FO tree is empty, meaning that TEI Publisher wasn’t able to do the transformation. After some investigations, we discovered that this happens only when certain tags are present in the XML tree, i.e. elements from lists and tables (list, item, table, row, cell, etc.). This issue has been reported2 and we are currently waiting for a reply and a fix. 

For the other XML files not including lists or tables, the output is working well, rendering pretty much all the elements of the XML in the output. The same goes for ePub and TeX. However, even if all the elements are well represented, the presentation is not always great but TEI Publisher gives us the possibility to work on the output and improve it. In the ODD, when declaring an element, we can choose for what output we want it: if we don’t specify anything, the declaration will be applied to every output (web, fo, tex, epub) but if we want something specific, for ePub for example, we can also do it. This way, we can declare in every detail every output that we want, whether it is for corpus files or indexes. Moreover, the output usually only prints the body of the text, but we could want to have information from the metadata in the header, which we can do in the ODD.

Thanks to all those little tweaks, we have control over our web application, from the display of the file to the export in multiple formats, and we can always adjust it in the way that we prefer.

CREATING NEW PERSPECTIVES FOR THE APPLICATION
Improving the user experience

The web application generated by TEI Publisher is pretty basic at first. You usually have a homepage that displays the content of your corpus with facets, and once you are in a document, you have the option to zoom in or out, and to navigate between the pages, but that’s pretty much it. This is a simple canvas that we decided to greatly develop in order to offer a better user experience and a more practical website.

Firstly, we improved the navigation inside the application. Since we have multiple collections, we thought it would be easier to directly access them from the menu bar instead of having to go back to the homepage every time. Therefore, we added in the navigation, next to the “Home” button, a “Corpus” button, which is a drop down menu, in which we have all the different collections included in our application. Once you click on it, you arrive on a page containing links to “corpus”, “indexes”, “history” or “about” (I will present the last two in the next part). We can then quickly navigate to another collection and it is possible to declare a new item in the menu if a new collection is added. Next to that button, we added a “Documentation” button (also a drop down menu) that offers three links : two will be directed to the documentation that is part of the application and that we will present in the next paragraph, and the third one is the documentation for TEI Publisher. It links to the documentation available on the web.

Menubar and the use of the different buttons

Then, once in a document, we added a toolbar that provides additional information about the file itself and how it is displayed: access to the metadata found in the header of the XML file (author and date of the file, data about the document, history of the collection, encoding team and rules); licence attributed to the file; information allowing to cite the file correctly (title, author, responsible editor, encoding project, file version, link). There is also a color code for the different colors used in the templates (displaying specificities inherent to manuscripts and named entities). This represents little extra information that can be useful for the visitor and their user experience.

Toolbar and the use of the different buttons

Finally, as our web application is specialized in ego documents, we added an additional navigation into our main template, linked to a specific tag in the XML file. Indeed, a correspondence is usually composed of several documents, presented in a certain order that is indicated by a number in the file or an obvious timeline. The <correspContext> tag allows the encoder to give an indication, in a specific file, about the document that was written beforehand and the one that follows right after. This tag has been used in our template to go directly to the letter written before or after in the considered timeline, especially if you have a corpus that includes several authors and you want to stay in a specific part of the correspondence.

Providing additional information

Previously, we mostly mentioned the changes and adaptations we made to help display correctly the corpora and the indexes of our projects. However, we decided to do more and to provide further information, linked to the topic of the XML files or simply to the TEI Publisher instance itself.

First of all, when a project is uploaded, the documents added are the corpus and the indexes. Except with some of the metadata, this does not give much information about the content of the files, its history or even the background of the project itself. In order to improve this, we decided to add two new templates inside a collection: one for history of the corpus and one for history of the project. The goal is to present respectively a general summary of the content of the concerned corpus, and a presentation of the objectives and the contributors in the project. This is realized with an HTML file, filed like a web page would be. The format of this presentation is not limited to one style, because we did not present the data about Paul d’Estournelles de Constant’s corpus and the Berlin intellectuals’ the same way. However, the CSS should be pretty much the same and can be copy/pasted from what has already been done. This is also a way to present the authors of the corpus, the additional resources where we can find information about the project or other kinds of useful information. 

Then, we also wanted to provide helpful information about the web application itself and mostly information about how you can add data and the way to do it. As I already mentioned in my last post, I wrote guidelines for ego documents specifying rules of encoding for those types of documents. The guidelines are a documentation of all the tags that can and should be used for the encoding of ego documents, and a schema that regulates the tags that can be used and those that can’t. This part is linked to the validator that we presented earlier in the article, because it is the declaration made on the document that will be used to decide if an XML file complies with our application or not. In addition to the guidelines, we also wrote documentation for the application itself. Our goal is to make it available for other people with ego documents corpus. Therefore, if we want them to use our web application without it being too difficult for them, we have to provide extensive documentation on how to use and personalize it. Thus, with those two documents, every person visiting our website will be well equipped to easily add their ego documents corpus – provided that these are properly encoded.

Authors

This blog post has been written jointly with my intern, Manon Ovide, a 2nd-year master’s degree student in digital humanities at the University of Tours, hired to work on the “publication” phase of the DAHN project, i.e. the development of the TEI Publisher web application.

  1. Issue #47: Can Not download PDF (LaTeX Version) in Tei-Publisher application: https://github.com/eeditiones/tei-publisher-app/issues/47
  2. Issue #45: XSL-FO can’t handle <list> and <item>: https://github.com/eeditiones/tei-publisher-app/issues/45

Viewing all articles
Browse latest Browse all 15

Trending Articles