In order to facilitate the work on the automatic transcription of d’Estournelles’ correspondence, we identified the need to develop a specific segmentation model, in addition to a transcription model. This post will talk about many elements already mentioned in this article so some specific parts will not be developed because it has already been explained there.
Just like what I explained for the transcription model, there are two methods to do a segmentation model:
- Creating a model with just the available training data
- Fine tuning an existing model with the training data
To do one or the other, it is necessary to first create training data from our corpus. To do so, I am using the data I created for my transcription model, with the main difference that we focus here on the baselines, the polygons and the masks and not the text itself.
The training data are available in the ALTO XML format so the model training module will consider the @BASELINE attribute in the <TextLine> tag and not the @CONTENT attribute (as it does for the transcription model).
In those training data, segmentation is mainly manual. It follows exactly text lines, from point A to point B with intermediary points when the text is curved and it is necessary to adapt it so that everything is accounted for.
If the second method is chosen (fine tuning), we need a segmentation model to work from. In our case, we will work with “cBAD_27.mlmodel” developed by the SCRIPTA team, model that we already use for our corpus. It is used in some of the training data (85% manual segmentation, 15% “cBAD” segmentation). This segmentation still required some corrections (like suppressing the departmental archive stamp, present in almost every page of the corpus as I mentioned it here) so that the text segmentation is perfect (which explains why we want/need to fine tune it).
Now that we have everything we need, the training can start.
The first method (from scratch) can be more complicated than the second one if we want a conclusive result, because we need a lot of documents and lines to have an effective result. When I did the first training, I had 80 documents with about 25 lines for full pages and 5-to-10 to first and last pages of the letters: it did 13 epochs then produced a model that is considered to be a “model_best”.
During a training, Kraken produces a model for every epoch done. At some point, it considers that one of the results represents the best of what it could do in terms of learning. It names that model “model_best” and this is usually the model that we then use. However, when I tried to apply that “model_best” afterwards to one of the letters, there was no result. The segmentation is considered done but nothing appears on the image. I concluded that there were not enough training data to train the model from scratch.
Later on, I added about 40 manually segmented documents to my training, which gave me around 130 documents and this time, the produced model did work. The training lasted 32 epochs and there was a real difference between that training and the previous one because the accuracy was not the same at all. In the first training, the “model_best” has a 15,49% accuracy while the second one has a 41,28% accuracy, which is a number much closer to what is produced when we are using the second method, as we will see next. I applied that model to a letter to test it and the segmentation was pretty good, which shows that 130 documents is enough to train a model from scratch.
The second method (fine tuning) is efficient with less training data and mostly because our fine-tuned model is already suited for our corpus and we just need to adjust it a little bit.
The training is made on a cluster with a GPU, so that the whole process is faster and the model is developed easily. The file paths are stored in an XMLLIST file, extension created for XML files and, in our case, it calls the ALTO XML effortlessly.
I will make the demonstration of how everything works with that second method and the command line to do the training is like this:
$ ketos segtrain -f alto –device cuda:0 -t segmentation.xmllist -e segvalid.xmllist -i cBAD_27.mlmodel
Constitution of the query
ketos
: Training model command with Krakensegtrain
: Training segmentation command with Kraken-f alto
: Option that specify the format type for the training data. Here, we are using ALTO XML but there are other formats.–device cuda:0
: By default, this option would be ‘CPU’ so it is not mandatory when the training is done on a computer. However, when the training is done on a GPU, it is required to use that option.-t segmentation.xmllist
: The “-t” or “—training-files” option is used to call the training data. It is possible to put this option more than once, with new paths. Here, my XML files are stored in the file “segmentation.xmllist”-e segvalid.xmllist
: The “-e” or “—evaluation-files” option is used to specify the data that will be used for the model evaluation during the training. By default, there is a partition with a 90/10 ratio and Kraken takes 10% of the training data to evaluate but with that option, we can choose ourselves what Kraken will use for the evaluation.-i cBAD_27.mlmodel
: The “-i“ or “—load” option loads an existing file to continue training. That way, we specify that the training must be made from another model (not necessary when training a model from scratch)
Training results
- Training time: 10 minutes
- Number of epochs: 24 epochs
- Model_best: Model 14
- Given accuracy: 49,02%
Application on the corpus
Firstly, when the model is applied to the corpus, we notice how the baseline appears. The baseline made with the manual segmentation looks like the line is strikethrough while the baseline made from the “cBAD” model is lower. The new model seems to have retained the baseline from the manual segmentation in its training and that is how it appears.
Secondly, we can see the effectiveness of the new model. It correctly recognizes the text regions and lines but it also managed to learn what part of the segmentation we didn’t want it to keep. The new model does not consider the recurrent page elements that don’t require segmentation and transcription, elements that were always suppress when segmented by the “cBAD” model (like the archival stamp or the asterism).
Finally, when comparing corrections on the segmentation from the “cBAD” model and from the new finetuned model, it takes less time with the latter which proves, once again, its efficiency.