Through the DAHN project, we are developing a pipeline for the digital edition of archival documents with TEI Publisher as the final publication step. This pipeline is divided into six steps. I already presented them in several posts: segmentation and transcription with the difficulty to establish a model for the former or the latter, post-processing with the transcription of my corpus, encoding with the creation of my XML tree and publication with blog posts dedicated to the TEI Publisher app (first and second). However, there is still one step, the first in the pipeline, that I never talked about in my article: digitization. The main reason is that, until recently, we, the project members, hadn’t decided on the tool we wanted to use to store and disseminate the images in our corpus. We knew we needed the IIIF technology in order to ensure high quality but we didn’t know what we were going to implement to do so. Eventually, we decided to use NAKALA, a tool created by Huma-Num which I already mentioned several times.
What is NAKALA?
Where does it come from?
Clik here to view.

NAKALA logo
NAKALA is among the tool suite created and launched by Huma-Num, which infrastructure plays a key role in our project, because we are already using one of its tools, Sharedocs, to store private data from the project, and we intend to deploy the TEI Publisher application through one of the publication services they offer.
Why and how to use it?
NAKALA is a tool created to share, publish and promote all kinds of digital data (image, sound, video, text, etc.). Huma-Num provides, with this tool, a documentation that explains largely why it has been created and what are its elements: interface, API, metadata, identifier and information on how the data is managed on the website (import, deletion, diffusion). Once you have been granted an account on the website, all you need is to connect to it, and you have access to the interface and/or the API (after generating an access key).
Why did we choose it?
NAKALA was considered a perfect fit for several reasons. First of all, like I already said earlier, Huma Num is already integrated in our project, so it is logical to go to something we previously worked with. Then, we wanted a tool that offers IIIF links, to ensure that our images would be available in high quality, with an easy-to-find identifier. Finally, we wanted an easily accessible data holding service, one that everybody could use freely. The only requirement to have an account on NAKALA is to justify the need for it, which means presenting a research project that requires the importation of images. In the case of our publication platform, every person who would wish to add new corpora to it should have an encoding project running so they would also have images to match the encoding and thereby, they have a reason to ask for access to NAKALA.
How to operate outside of NAKALA to gather metadata for the images ?
My goal here will be to present how to work on NAKALA. In order to do so, I will illustrate it with the Berlin Intellectuals corpus, that I entirely added on NAKALA using two methods that I will expose, one after the other. The Berlin Intellectuals edition is not composed of only one collection but many, by several authors and multiple types of texts and data. In that case and in order to ease my work (so that I don’t have to search every time what kind of information I am supposed to add on the website), I developed a few steps to follow before adding a new corpus.
My corpus had already been completely encoded, which is a plus because I knew exactly where to go to find the information that I needed. NAKALA requires certain metadata to add new images (title, author, date, licence, type) and I can personally add others that I find useful or necessary. Fortunately, those kinds of information are encoded with specific tags in TEI XML. All I had to do was to create a script that would extract all this information, file after file, so that I can store them in a CSV afterwards. The script prints in the terminal all the metadata collected (one line = one file) and then, all you need is to copy/paste it and save the file as a CSV1 . The spreadsheet may need some cleaning up after, whether it is because of an error or a miss in one of the XML files or just a flaw in the print of the terminal. Once that is done, to better organize my work, I also decided to separate this spreadsheet into several smaller others, one for each author from my corpus (if they are represented with more than five/ten letters in the corpus).
Clik here to view.

Extract from the CSV of Dorothea Tieck
The metadata matches all the fields I planned on filling in NAKALA to add my images: title (in three languages), genre (type), author, date, language(s), topic (keywords), rightsHolder and accessRights.
If I were planning on adding my corpus via the interface, that would be all the steps that I would need to take. However, considering the large amount of data I need to add, I chose to use the API and to simplify my work, I used the model found in the POST/datas part of the API and added my metadata, by using a script.
Clik here to view.

Model for the upload in the NAKALA API
The script contains a more elaborate version of this model, mostly because I have several elements to put in “metas”, so while there is only one element in this model, I multiplied them on my file, one for each field I want to fill.
In the “files” part, I removed “description” and “embargoed” since I don’t need them for my upload but it is to the appreciation of the user (specially if you wish to delay the publication of the images); for the rest of “files”, it will stay unchanged, as this part will be dealt with during the import in the API.
The “collectionIds” part requires to previously create a collection in the interface, indicating the classification I want to give to my document. However, adding the data to the collection is not a mandatory step, even though it can be of help later on.
The “rights” part is a bonus that is only necessary if I want to add a user to my document. I am already considered the owner of the document so I don’t have to put my ID in this part. In my case, I added another admin, which is a group I created, composed of all the members of the DAHN project.
Finally, before being able to generate these metadata, it will be necessary to create, through the interface, the ORCID of the authors of the documents if they don’t already exist. By doing so, it will be easier afterwards to add the author and the information they need in order to be valid, as it is shown in the script.
Once all of those steps are done, I can execute the script to generate my series of metadata as such and so, get on to the next step: adding new data to NAKALA.
Adding new data: interface or API, a choice depending on the task at hand
NAKALA possesses several ways to implement new data in its database. Some were developed by outside groups to facilitate how they add new data but in this post, I will only develop the two main ways offered by NAKALA which I both used during my exploitation of the tool. Each can be more useful according to what you want to add (many files but only few metadata or several metadata for one or two files or a mix of both).
Numerous files for few metadata: favoring the use of the interface
Clik here to view.

NAKALA Interface
The interface is mostly effective when a large upload needs to be made, because while proceeding with it, it is possible to add all the files at once in the interface (which, as I will develop it later, is not possible with the API). All the metadata has to be completed, field by field. Five are mandatory (title, type, author, date, licence) and others can be added according to what I want to present and develop for those data.
Then, NAKALA offers the possibility to add a group to the data. A group can be created by a NAKALA user and is composed of the people added by the user, provided they have a HumaNum account. Once the group is added, it is also possible to define its level of control over the data. It can be reader-only but it can also be labelled editor or administrator of the data, meaning the members of the group can modify the data if they want. The user who adds the data is and always remains their owner. As I previously mentioned, for my import, I created a group for the DAHN project, giving it “admin” control.
Finally, before depositing (private) or publishing (public) the data, it can be added to a collection. This establishes a classification within the data and can later help with the mining of the new identifier of the images.
Many files, different metadata each time and a lot of information: API is the way
Working with the API for the importation is a three-steps task (four if there are several images): one step in NAKALA, one step with the metadata and one (or two) step(s) with NAKALA again.
First of all, I have to add the file in NAKALA, or more precisely in a “waiting room” where the file will wait for its metadata to be really added to my account. This is done by using “POST/datas/uploads” in the API.
Clik here to view.

NAKALA API (upload)
Once the upload is done, the file receives a new identifier, stored in “sha1”. This has an equivalent in the metadata file that I create, so all I need to do is copy/paste that new identifier in the file by replacing the value “string”. If the document only has two or three files, it is possible to add new accolades ‘{‘ in “files” with a “sha1” in each, containing the new identifiers.
Clik here to view.

Example of two files called in the metadata
Then, as the metadata should have already been generated, all I need to do is copy/paste the metadata for the files by using the POST/datas and execute it.
Clik here to view.

NAKALA API (data)
If the information has been correctly entered and everything goes well, the response should be 201, which means that the data has been registered by the server and given an identifier. However, as it was sometimes my case, the execution can result in a 500 Error. One of the main reasons for that can be a malfunction during the upload of the file. In that case, it can be necessary to call it again in the “waiting room” and execute the POST/datas again to obtain a 201. In other cases, it is also possible that one of the data has been incorrectly filled, which prevents the 201.
As a fourth step, if the same data has many files attached to it and since it is too long to do it one upload after the other, it is possible to use the interface. The new data should have an identifier that I’ll need to access the metadata in the NAKALA interface and add the missing files in one batch.
After importation: how to retrieve the newly created identifiers for the images
I have added all the images from the Berlin intellectual corpus in NAKALA but now, I need to retrieve the identifier of the images so that I can add them in my encoded files, in order to have the link and have them appear properly in the TEI Publisher platform.
Once again, this is the kind of task that seems almost impossible, especially considering the fact that I added a massive amount of files. Retrieving the identifier one by one is an onerous task that I luckily don’t have to do. I mentioned above the interest of adding the file into a collection. This will be a time saver in the harvesting of the new identifiers.
Clik here to view.

NAKALA API (collection)
The only information that is needed is the identifier of the collection. For my work, I created a collection for each author, for one major reason: some images have the same name from one author to the other. This is an inconvenience, because I plan on changing the name of the file by its new identifier using a Python script and regex, but if two files have the same name, one of them will be wrongly identified, which is detrimental to the publication platform. So, with the collection created after each author, I can safely extract the identifier and then, do the replacement author by author.
This step will produce a JSON file, with a number of results (default: 10; max.: 25) and if there is more than one page necessary, they will have to be produced after rendering the wanted page in one of the fields. In those files, the only information to retrieve is the name of the file and its new identifier. This will require a heavy cleanup of the file. Thankfully, most of the information contained is redundant and I wrote a script with some regexes that can quickly delete the useless information. A manual cleaning of the remaining noisy data will leave me with the filename and their identifier.
After transforming the file into a Python dictionary (key=filename; value=identifier), I use a simple find/replace script with the dictionary and add the new identifier to the XML files. Once this is done, I can upload the files into the platform and view my facsimile.
Clik here to view.

View of the text and the facsimile on the platform
At first, this tool can be hard to handle but once the techniques are assimilated and you understand how it operates, adding new data and retrieving the identifier is pretty easy and swiftly done.
- A documentation to easily use this script and the following can be found on my Github here.