Encoding the corpus

I am now at the second part of the trinity previously mentioned in this article: the encoding. This portion of the work has already been partly explained in some of the first posts (there and there) but it was mainly about the content of the XML tree that will be used for each letter of the corpus. This post will address the means used to transform the text transcription that I obtain, according to the method described in the previous post, in an XML-TEI encoded text, that will then be used in the final post.

In order to do so, I have created a three-step workflow that allows an easy and quick encoding of the letters, one by one. The first step consists of creating the XML file, with the right template and with almost all the files’ metadata. Next, I encode the text that I exported from eScriptorium. Finally, I insert the encoded text in the template, then correct some mistakes and add the manual tags to the encoding.

First step: Creating the XML file

For a given corpus, two XML files containing each a document from the corpus will not really differ in terms of the content of the metadata (<teiHeader>). As I mentioned it in this post, I made a XML tree template for my corpus, where only a few metadata will have to be filled to give each file its corresponding data (the template with no characteristic data can be found at the end of this document).

Instead of having to fill the specific information manually, I developed a script, combined with a CSV file, that allows me to create all the files from the corpus (right now, this amounts to 520 letters) while filling some specific data in each document.

The CSV is a key element for the creation of the XML file. It consists of the number of pages and two other elements, repeated multiple times but in different ways: the letter’s date and number. For each letter, I have three figures: the number of the current letter, the preceding letter and the following letter. For the date, I have it in four formats: YYYY-MM-DD, French, English and French with no spaces, which I use to name my files. Then, in this same last format, I have the preceding and following letter.

Image may be NSFW.
Clik here to view.

CSV file for the letters

With the help of a script that extracts the rows’ content of the CSV, line by line and uses variables and combinations in a given XML tree (the template), I am able to create a file with the number and the date of the letter (format = LettreN_DDmonthYYYY). Then, I can already encode the following information in the <teiHeader>:

The title of the letter with its number and its date, both in English and French
The date of the letter in <docDate>, <origDate> and <date> from <correspDesc>
The preceding and following letter in <correspContext>

Image may be NSFW.
Clik here to view.

XML Tree created

Moreover, the quickness and easiness of this process (approximately 30 seconds to create the 520 files) give me the opportunity to do it multiples times, without encountering any issue, which is practical because there is one more important information added to the tree. Every time an action is taken on in an XML file, it is good practice to encode a <change> tag in the <revisionDesc> to keep track of the process. In the XML tree proposed in the script, I wrote a <change> tag with the following form:

<change when-iso="" who="#floriane.chiffoleau">Creation of the file</change>

Since I don’t encode every letter in one session but rather few letters at a time, it is important to stay consistent and to have a date for the creation of the file, corresponding to the day where I start working on the letter. So, every time I start working on some letters, I generate new templates with the current date in order to be thorough.

Once this is done, I have a proper XML file with mostly filled metadata fields and I now need to work on the body part of the file.

Second step: Encoding the text

To encode my text, I need a document in a text format. To have that, as I previously said, I export from eScriptorium the result of the process described in the previous article about transcribing the corpus.

It can be necessary to check the text for errors missed during the previous task: for incorrect line numbering, which is an error sometimes made by eScriptorium (in that case, I’ll just have to invert the lines to follow the correct numbering and have a coherent text); and mostly for instance where one word or phrase have been added above a line and it needs to be included in its attributed line in order not to mess up the encoding.

After that, I can use the text tagging script I created to encode my corpus. It is a pretty simple script that works mostly with regular expressions and with find/replace commands to encode the text. As I mentioned it multiple times about my corpus, it contains numerous recurring elements, present in every letter (with some exceptions). The main problem is that those recurrent elements do not always have the same writing.

For example, Paul d’Estournelles de Constant addresses every letter in the corpus to Nicholas Murray Butler. In every one of them, he addresses directly the letter to him with “à Monsieur le Président Nicholas Murray BUTLER”. However, while this set phrase is always to be found, it sometimes has other wordings:

“à Monsieur le Président N. Murray BUTLER”
“à Monsieur le Président N.Murray BUTLER.”
“à Monsieur le Président Nicholas Murray BUTLER,”

The surname is not consistently written fully and sometimes, nothing but the initial is present; the set phrase can end up with a comma or a point or nothing. This is why the regular expression is useful because I wrote it in order for the script to find it in every text, no matter how it is written. This is also the case for the date or the page numbering, which changes every time but does have the same structure. Then, with those regular expressions, I apply a find/replace command so that the script will find the exact expression in the text and will encode it with the corresponding tags. Finally, the script also tags the paragraphs by finding end-of-lines with a punctuation sign that indicates a sentence ending and suppresses all the newline to create line break (<lb/>) with specification if the word had been cut off by d’Estournelles.

Image may be NSFW.
Clik here to view.

Text format of the letter 120

Image may be NSFW.
Clik here to view.

Letter 120 after application of the text tagging script

Third step: Correcting the encoding

Lastly, the final step will consist of corrections and additions to what has been encoded. Firstly, I use the result of step one and two by inserting the tagged text in its corresponding XML file. I then have one script that corrects the errors made by the preceding script and that closes open tags; however, there is one element that has to be tagged manually beforehand, because I did not find a way to do it automatically: the letter title, situated in the opener. Once this is done, I can apply my new script to the XML file.

This script combines find/replace commands in a text in order to correct the mistakes made previously and BeautifulSoup commands in an XML file to add new encoding information.

Among the mistakes made by the preceding script, there are improper paragraph tags added before the <salute> or the <closer> that need to be removed or it will mess up the encoding and line break placed just before page beginning tags, which is redundant and out of place.

Then, with BeautifulSoup, I can automatically attribute the page’s number to the <pb> tag, if it is given, search for the letter’s writing place and then add it in the <origPlace> tag and in <correspDesc> and while doing so, I add a new <change> tag to the <revisionDesc>, with information about what is done (“First encoding of the transcription and some specific metadata”) and the current date (added in the terminal when I apply the script). If after that, there are still some errors that were not corrected and that make the file non conform, they need to be manually rectified.

I now have a fully formed XML file, with adequate encoding and a complete tree. The last task consists in an image-to-text comparison. Some tags sometimes need to be changed if they have not been placed correctly, like when a line beginning has been considered a new paragraph and it is in fact not one. Peculiarities from the letter also need manual tagging, whether they are deleted and/or added words, underlined word(s) or sentence(s), or elements added afterwards by d’Estournelles, like characters handwritten to correct words or hand notes. Lastly, I fill the metadata that need precision and after a final proofreading, I can consider that the encoding of the transcription is complete.

With this, I now have a defined process to encode the letters swiftly and easily, with numerous automatic procedures that leave me mostly with a correction job, since every transformation of files is done with one or several script(s). All I have to do is check that the files are correct and are XML-TEI compliant. Once the transcription is encoded accordingly to how the text is written in the letter, it reaches the first level of finalization of the transcription: “proposed”. The next level will be related to what I will explain in the next and last part of this blog post series: the annotation.

Encoding the corpus

First step: Creating the XML file

Second step: Encoding the text

Third step: Correcting the encoding

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112