I work on a project called “Digital edition of historical manuscripts” which aims to diffuse on a public platform multiple corpora that all have one thing in common: they are ego documents, which refer to personal writings, memoirs, correspondence and documents alike. My job, personally, is focused on one corpus, a correspondence from a century ago, written regularly by a French, Nobel Peace Prize winning senator to an American friend, president of Columbia University and future Nobel Peace Prize winner, during World War I. Those letters depict his vision of war and its consequences, as a pacifist, non-combatant, political and patriot person (I will get into the details in a future post).
The correspondence has to be made available for the platform and that requires several tasks. My only source material is the photographs that have been made of the multiple letters written by Paul d’Estournelles de Constant. I have to transcribe, correct and encode those letters and then, publish them on the platform.
Those tasks, not easy in a normal situation, are made even more difficult by the quantity in the corpus. Paul d’Estournelles was a rather prolific writer and he wrote 579 letters to his correspondent, Nicholas Murray Butler, from the start of the war, in August 1914, to the official cease-fire and the signature of the peace treaty, in June 1919. Furthermore, these letters are not homogenous in term of pages: some letters are 2-pages long but others are 28- or 50-pages long, which means an even bigger work for me.
One of the first tasks I decide to get into is the encoding of the correspondence. This encoding is not a one-step assignment but a multiple layer work that must be made following a specific order but not in one go. I will talk here about the first steps I took to achieve the encoding of the correspondence. My goal, in the end, is to encode those letters by using a Python script with special packages and regular expressions, to ease and fasten my work. To do so, it’s necessary to establish an XML Tree model for the letters, to know then which element I will have to write in the script.
Firstly, we have to consider the composition of the letter: it is necessary to observe the source material to identify what we will need in the future for the encoding. When looking at several letters, we can see that there are some redundant elements, as it is common in letters. The layout and the structure don’t change and Paul d’Estournelles starts all his letters the same way: the numbering of the letter, the letterhead, the date and place of the letter, its title, the salute, then the text starts. This will mean that the start of the body part encoding will include recurring elements that looks like this:
<div type="transcription">
<pb n="1" facs=" .jpg"/>
<head rend="center"> </head>
<opener>
<dateline rend="align(right)"><placeName> </placeName>, <date when-iso="1919-02-04"></date></dateline>
<title rend="align(center)"><hi rend="underline"> </hi></title>
<salute rend="indent"></salute>
</opener>
<p>...</p>
</div>
Even though that part has pretty much been settled, there are a lot more elements that has to be taken into consideration to create an XML Tree model.
Secondly and in order to work properly and not be completely “alone”, I used and took inspiration from numerous documents or websites, to help me with the realisation of the encoding:
- First of all, I needed to have the source material with me to be able to encode anything. I chose two different letters, separate in time and with some unique elements in one or the other, to have a larger scope of what I have to encode.
- Then, I used an XML file from the digital edition ‘Letters and texts. Intellectual Berlin around 1800’, a project linked to ours that contains a number of files that will be included, in the future, in our digital edition. I followed most of the encoding for those letters, since it contains approximatively all the information and markup that I need.
- In association with that file, I also used the editorial guidelines written during the elaboration of the ‘Letter and texts’ edition, as a support and explications for some markup that I couldn’t understand or didn’t know how to use.
- To guarantee my encoding, I also verified a lot of information with the TEI guidelines, in order to make sure of the use of an attribute or a tag.
- Finally, my source is a correspondence and the TEI guidelines are still incomplete in that matter so some elements can be difficult to encode (for example, the letterhead). To get answers on that specific matter, I search information in the diverse articles of Encoding Correspondence that offers different choices of markup for elements like opener, closer or pre-printed forms.
In a third and final step and with all those collected data, I can begin the creation of my model. To do so, I chose to encode the letter number 477, to be able to see what kind of information I will have to put in the text part, with the line and page break and with special markup, like quotation marks:
- I copied and pasted the XML file from ‘Letters and texts’ and I removed all the text parts inside the tags.
- I did my encoding in two parts: first, I encoded the metadata. That’s an important part because most of the information will always be the same, no matter the letter encoded. Then, I did the body part; the body is less generic because the inside text will usually be pretty different from one letter to another but it allows me to retain recurrent tags that I will have to use with regex to be sure that the text is well encoded.
- Since I don’t work alone on that project and I need to verify some parts of my work with my supervisors, I added some commentaries on the encoding to justify some decisions, whether it is to explain why I put a tag or to show that I didn’t know what to do and that I want advice on that matter.
- One important element is not figuring on my XML Tree model: named entities, such as person, place or organisation and the referencing attached to it. I think that it can be the matter of a reflexion later with another script maybe, so I decided to leave it untagged.
After all of that taken into consideration, I have an XML Tree model. However, the constitution of that model raised some questions that will have an impact on the transformation from a text file to an XML file: there are some elements on the text part that are so specific that it will probably be impossible to include it in the script. It will then be necessary to establish a list of all those elements to help the review and correction of each XML file once they have been encoded.