In all the posts I have already published, I mentioned multiples stages of my work, as well as some difficulties I have encountered while trying to realize it. More recently, I expressed the problems I had met during the creation of my transcription model and I had finished by saying that I would have to train my model until reaching something satisfying. After many (many) training iterations, I obtained an effective model, that works with pretty much every little specific parts of a transcription, even though it still makes mistakes (which is normal, since it is quite difficult to get a perfect model). After also developing a segmentation model (see this article), I am ready to begin the next and heaviest part, the ultimate stage of my work, which will consist of a trinity:
- Transcription
- Encoding
- Annotation
I will not develop them all in one setting, but rather make a post for each part. This post will focus on the first part of this trinity: the transcription.
The transcription part consists of turning an image of a text into the text in a format that can be exploited afterwards. In order to do so, there are two steps: segmentation and text recognition. They usually are applied simultaneously with Kraken and others OCR software but for my corpus, I needed to have specific models for each part to get a transcription as precise as it can. That is why I worked a long time to train functional and efficient models, for both the segmentation (as explained in this article) and the transcription (as explained in this article and in this one). Those models and the logs that show what were the steps to produce them are available in the GitHub repository of the project.
However, even if this task can be broken down in two, the two parts still need to be used together at mid-task so that the results are as effective as possible. When used together, one can help improve the other and vice versa, as I will show it later in the post.
Lastly, I think it is important that I precise one thing: the majority of what I will be doing for the transcription is done with eScriptorium that I already presented in this article. My models work in command line but the GUI is there to ease transcriptions so it is better to use it, both for accuracy and rapidity.
The segmentation
To segment a text, there are two methods: manual segmentation and model segmentation. Those methods are peculiar because one’s advantage is the other disadvantage: the manual segmentation will be more precise than the model segmentation but will take much more time and vice versa.
I mainly used the manual segmentation when I created my training data because it needed to be a thorough work since all the elements will later be used for training, whether it is the baseline or the content (and because the segmentation model that I will talk about afterwards had not yet been created when I generated my training data). However, since it is necessary that the transcription is as automatized as possible, it was essential to have a model for it. The SCRIPTA team (the team that financed eScriptorium) developed a segmentation model and although it was compatible with my corpus and worked pretty well, we decided, with my team, to develop our own model. That way, we increase the precision of our segmentation and we add a new element to our process (I will not get into the details of how I develop this model since I already dedicated an article to it but I may refer you to it1).
Once I have produced my segmentation model, I can apply it to the letters with the “Segment” button of eScriptorium. The time taken to segment all the documents varies from one letter to another because it depends on the content of the page (if it is a full page of lines, it will take longer that if it just the end of a letter with two lines and a signature) and on the number of pages in the letter (a three-pages letter will logically be faster to segment than a fifteen-pages letter).
Before going into the double correction with the text recognition, I need to verify the segmentation for easy-to-spot errors. Prior to the segmentation model, one of the most common mistakes was the departmental archives stamp that was considered part of the text to segment (since something is writing in it) or the symbols that are used to separate two paragraphs in some letters. However, and this is partly for this reason that I did the training, the new model does not consider those parts because it learned during the training that it was not part of the elements that I wanted segmented.
Though the model does not do those mistakes, it still makes some that need to be corrected before importing the transcription. One of the most problematic errors of segmentation is the baseline that don’t go all the way to the end of the line.
As it is visible on the image above, the model does not always segment all the words at the end of lines, as it can be seen here with “Belgique”, “hécatombes”, “existences” and “n’ya”. This will be a problem at a later point; either the word is just missing the plural “s” and it will not be considered an error when I correct the text but this will give me a wrong transcription or it will miss a key element of one of the word and it will confuse the model during the text recognition. As it is demonstrated in the image below, some words don’t have their plural form and others, like “Belgique” are not recognized. It is necessary to fix that segmentation so that all the words will properly be considered during the segmentation.
After that, I can also correct other little parts visible only with the segmentation so that all the evident rectifications have been made before moving to the double correction. For example, sometimes, the model has split one line in two so I need to replace that with one long line to keep consistency in the segmentation. Finally, the segmentation also sometimes needs additions, because it is possible that the model forgot some elements, like a word added above a sentence.
The double correction
Then, I move on to the next part, where segmentation and text recognition work together to help improve the segmentation as much as it can, so that the transcription is as precise as possible. This part is made in two steps, that need to be repeated twice:
- apply the transcription model
- correct the segmentation
This is a basic task but the transcription (once a proper transcription model adapted to the corpus has been developed) helps a lot for the correction because sometimes, the model will wrongly translate some common line that it is supposed to know and the segmentation will be at fault. That way, I can verify the segmentation and make sure that I will have less corrections for the transcription and mostly not transcription errors due to the segmentation.
Firstly, I need to take care of the irregularly positioned points in the baseline which give an impractical appearance to the word and can confuse the transcription sometimes.
Then, I will do the most important corrections, which I will explain with an example, illustrated below.I took the first line of the first letter of d’Estournelles corpus, which is a dateline that reads “Lettre N° 1 (15 Août 1914)”.
When the model does the segmentation, it marks the baseline for the line then its polygon and mask are built around it. In step 1, it is visible that the polygon and mask of the line are rather large and are even overstepping on the separation line underneath. Once I apply the text recognition to it (step 2), the transcription makes no sense, almost all the characters have been wrongly transcribed making it just gibberish. This means that the segmentation needs to be corrected to have an accurate recognition. If I check step 1 again, I can find what needs to be changed. There are two problems, and they are related to the size of the polygon: it takes into account the line underneath and the lines from the verso that are slightly visible in the page and it tries to translate it as a whole with the rest of the text. All those parts linked together explain why the transcription of step 2 was so poor.
It is however possible to improve that segmentation to have a smaller and more accurate polygon. If I insert manually a segmentation line above and below the line and I recalculate the polygon, it will give me a mask more adapted and restricted to the only line that I am interested in. I just need to suppress the new segmentation line and the result will be step 3. Once this is done, I only have to apply the transcription model again to see if I get a better result and step 4 shows that, indeed, this time, everything has been adequately transcribed. This double correction is mostly used for the background noise that can be inconvenient for the transcription.
It is very important to follow a specific order to do those two manipulations and the correction for the irregular points must come first, because every time I touch the baseline, a new polygon is generated. So, if I took a long time to be sure that the polygon is such as I want it to be, considering the background noise, changing a point in the baseline will create a new polygon, likely similar to the first one generated, which was unsatisfactory.
This double correction gives us the possibility to be very meticulous with our segmentation and to minimize word and character errors, that should not be a problem afterwards if the transcription model trained is good enough.
The text recognition
The final part of this task is the text recognition, that has already been initiated with the double correction. After working for a long time to develop a fitting transcription model, this task shouldn’t normally be too difficult since the text recognition has to be pretty accurate, thanks to the training. However, it still needs to be done and so, I will now work to correct the text, first semi-automatically then manually, to have a precise verification.
The text format for the correction will be XML ALTOs, which I used to produce my training data. I will extract the attribute @CONTENT
from the <String>
tag to get the text from the file and make modifications.
I work first of all with the semi-automatic solution to rectify the text: two pretty simple scripts, one that searches the wrong words and one that corrects them (both are available in a folder in the GitHub repository) but this requires a task between the two to work properly. The first script uses a spellchecker package. It parses a text, reads all the words, compares them to a list of word frequency submitted to it and registers as key in a dictionary all the words it does not recognize. Then, using the Levenshtein distance principle2, it puts as key’s value the best correction. Then, it is necessary to manually sort out the dictionary created: sometimes, there are words considered wrong because it does not exist in the word frequency list, because it is a name/surname or just because the word is cut at the end of a line and so it is considered false. Once the dictionary has all the good corrections, I apply the second script, which is a simple search/replace: the text and the dictionary are read and once a key from the dictionary is found in the text, it is replaced by its value.
Following that work, I do a new correction, this time manually, with text-to-image comparison, to be sure that all has been changed appropriately and to have a word-to-word transcription. I correct words that do not correspond to what is on the text but were not considered wrong because they exist in the dictionary and I check the punctuation and change it if needed.
Finally, in order to keep track of this transcription, I import in eScriptorium the corrected files in the XML ALTO format with an obvious label, like “corrected_transcription” for example, to be safe. That way, those documents can be used later, as training data or something else if needed.
The first part of the trinity is now finished: after following all the previous steps, I am able to transform all the images from my corpus in a text format, with an adequate segmentation and an accurate transcription, stored in a personal account on eScriptorium. This will give me the possibility then to export the documents again, this time in a Text format, which will not give me one file for each page like it does with XML ALTO but one file with all the text. That file will subsequently be a key element for the next part of the trinity: the encoding.
- How to produce a model for the segmentation: lien
- Measurement of differences between two sequences; number of single-character edits (insertions, deletions or substitutions) required to change one word into the other