Quantcast
Channel: Work progress – Digital Intellectuals
Viewing all articles
Browse latest Browse all 15

Recognizing and encoding the corpus’ named entities

$
0
0

Many months ago, I posted an article about the work I had to do to obtain a publishable edition of my corpus. I called it a trinity because it would be divided into three parts: 

  • transcription;
  • encoding;
  • annotation.

In this article, I developed the first part of the trinity and about a month later, I posted the second part: the encoding. Now, a year and a half later, I am posting the last part of this trinity, finally finishing this series of posts.

This ‘annotation’ part should really be called “Named Entity Recognition” or “NER”. NER is a task of information extraction. In a given text, it locates entities and classifies them into categories such as persons, locations, organizations, etc. For the d’Estournelles corpus, I decided to retrieve four categories of entities and to encode them as follows: persons with <persName>, places with <placeName>, organizations with <orgName> and works (publications) with <title>.

Choosing the right tool for the NER task: the use of spaCy

There exist several tools designed to carry out  named entity recognition, whether it is with a graphical user interface (GUI) or a command line interface (CLI).  For our corpus, I initially selected and tried three different tools: spaCy, Stanza and entity-fishing.

Entity-fishing was interesting, notably for its connection to Wikidata but it was quickly abandoned: too complicated to download and use locally, it would not have been efficient either to use the online demo. I was then left with spaCy and Stanza. They are both NLP libraries working with Python. They operate equivalently and produce virtually  the same outputs. Using a language model while reading the texts, the tool retrieves the named entities and classifies them by category (LOC, PER, ORG, MISC). Even though the tools are similar, the models are different and ultimately, the results of spaCy seemed better than those of Stanza. Thus, I decided to use the former for the NER task.

Retrieving the named entities of the corpus

In order to use spaCy and obtain the results in an easily exploitable format, I wrote a script. Here is how it works:

  • It is given a language model trained to recognize named entities with documents such as news, media, blogs, comments, etc.
  • It reads the files of the input folder as texts
  • Once it recognizes an entity (which can be one word or several), it writes it in the output file and next to it, after a space, it gives its corresponding label (‘PER’, ‘LOC’, etc.)
  • As I didn’t want to have the result in multiple files (e.g.: one file = one output), I used the Python function open() with the mode “a” for append instead of the mode “w” for writing. This way, spaCy reads a file, retrieves the named entities, writes them in the output file and then, goes on to the next file, repeating the action, again and again, until everything in the folder has been processed.

Thus, the output is a text file containing all of the recognized named entities. However, this is, in no way, a final draft. It requires a lot of corrections and modifications before it can be used to encode our named entities. 

Obtaining a clean, encoding-ready list of named entities

a) Semi-automatic cleaning of the file

I needed to do a series of corrections to quickly suppress a lot of useless data. First, with a series of regexes explained here, I suppressed unnecessary spaces that could have complicated the processing of the output file. 

Then, I had to remove all the duplicates. Indeed, the output file is a collection of all the named entities retrieved from the corpus. Consequently, as I am building a database of our corpus, it is logical that some entities have been mentioned more than once in the corpus. So, if spaCy worked correctly, it should have retrieved the same entity with the same label on numerous occasions, which needs to be fixed. To do so, I used a script that removes duplicates while preparing the file for the next step, which would be a migration towards a spreadsheet. 

The spreadsheet makes it possible to process the output more easily because it has extra features that help me (whether I use Microsoft Office Excel or LibreOffice Calc). First, with the previous script, I changed the space between the entity and the label in a tabulation. This means that, with the migration to the spreadsheet, I had two columns: one with the entity and one with the label.

With Excel (which I used) and those columns, I could take advantage of two features:

  • Sort the data alphabetically (whether it is the entities or the labels)
  • Highlight the cells when there are duplicates (those duplicates are different from the previous ones, because now it will give us the entities that were retrieved with different labels, as well as some entities that it considers duplicate but are not, if case sensitivity is considered (which Excel does not for duplicates)).
b) Manual suppression of the wrongly-retrieved named entities

Once the semi-automatic tasks were completed, I needed to do a manual correction of the file. This is a rather long task, especially if the file has a lot of entries (more than 7000 in our case). It consists of reading the entries, one by one, and suppressing what looks like errors in the named entities recognition. The most common errors are usually found in the “MISC” label. Sorting the labels alphabetically can help them to be found more quickly, and get rid of the useless entities. For our corpus, this was the case for the dates that were retrieved, or for long parts of sentences that had no reason to be considered as named entities. Of course, it is possible to have doubts about some entries and their relevance in the database I am building. When confronted with such cases, I kept the entities and decided that I would check their relevances later, when I would start building the indexes. With those corrections, the number of entries dropped drastically and I only had 2500 left.

c) Pooling the many versions of an entity in one regular expression

As I just said, I had 2500 entries, whether they were persons, places, organizations or works. However, that does not mean that I had 2500 separate entities in my file. In most cases, several entries represented the same entity but it was written differently in the corpus. For the persons, they could have been mentioned with a Mr/Mrs, with their first name, with their title (President, Doctor, etc.). For the places, they could have been mentioned with an article or not. For the organizations, they could have been mentioned with their short or long name version. Sometimes, it was also possible that the author of the corpus made mistakes when writing the material (Ex: “Anglererre” instead of “Angleterre”). The quality of the model and the low number it represented in Levenshtein Distance made it possible for spaCy to still recognize it as an entity so I kept it, since it would be useful later on, for the encoding of the text.

The task, here, was to pool all of these versions of the entity in one regular expression, in order to use it afterwards to encode the corpus. The first step was to spot all the instances and to gather them. The best way for me to do so was to round them up by last name and to put in parenthesis after the last name the information that could have been present before the last name, which makes the transformation into regular expressions easier. This took some time but it diminished the odds of having the same entity with multiple identifiers in the encoding and in the index. Thus, after that, I could write my regular expressions by taking into account all the singularities of the entity.

 

An example case with the entity “Joseph Caillaux” 

  • “Caillaux” → No change
  • “Joseph Caillaux” → “Caillaux (Joseph )”
  • “J. Caillaux” → “Caillaux (J. )”
  • “J.Caillaux” → “Caillaux (J.)”

Thus, the instances of “Caillaux” are not separated between the “C” section and the “J” section but are just present in the first one. Then, I can write my regular expressions. For example, the four examples above will give the subsequent regex: “(J(.|oseph)( )?)?Caillaux”

Decryption of the regex :

  • “J(.|oseph)” → the “J” is always there but it has to be followed by a dot or the rest of the letters to make “Joseph”
  • ( )? → there might be a space between the first name and the surname
  • “(J(.|oseph)( )?)?Caillaux” → the first name, followed or not by a space, can be present next to the last name, but it is not mandatory to encode the entity. Finding “Caillaux” is enough for an encoding, “Joseph Caillaux” can also be encoded but “J Caillaux” would not work, as it is not recorded in the regex.

 

After all the entities each had its regular expression, I could divide my entity file into four tabs, one for each category and begin the next part of this NER task.

Creating the entity databases of our corpus

a) Assigning an identifier

Once I had all the entities of our corpus, I could start to really create our databases. The first step was to give each entity a unique identifier. It will be found in the XML files and have a link in the index with its information. I chose a simple solution for the identifier, a letter joined with a number, given randomly (mostly in alphabetical order for our corpus). I have four types of entities so each category has its own letter: “p” for person, “l” for location, “g” for organizations/groups and “w” for works. The first entity starts with “0001” and so on.

First entries of the CSV for places

b) Building the indexes

I had CSV files with entities and identifiers and I needed to transform that into XML tags for our indexes. I developed several scripts (one for each category) that process information from the CSV files. The script creates entries of index – one tag for one entity -, attributes the identifier as the @xml:id, and puts the name of the entity in the corresponding tag (<persName> for person, <title> for work, etc.). Then, I could build the index.

The first step was to normalize the name of the entities because I didn’t want to keep the regular expression developed for the encoding. In that case, it was logical to choose the most common form of the entity (Ex: “A(cadémie|CADÉMIE) F(rançaise|RANCAISE)” → “Académie Française”). The only exception to that might be with the places. The entities are collected from a French corpus but for better accessibility, I decided to write the indexes in English. The names of persons, works and organizations must not be translated, as they would not be recognizable, but it is logical to put the place names in English (e.g.: 

“(E|É)(TATS|tats)(-| )(U|u)(NIS|NIS)( d’A(mérique|MÉRIQUE))? → “United States of America”).

The second step was to add the information for each type of entity. I established, for each one, the tags it should contain.

1. Places

A place requires the @type of place it is, a name, its country, the geographical coordinates of its location and if it has one, a GeoNames

There are exceptions: 

  • for a continent, there are no coordinates given as the area is too big
  • for a street, I add <address> to the location to be more precise
  • there is no country given when the place is already a country or a sea

Example of a precise address

Example of a country

2. Persons

The minimum a person requires is a name, even if it is just a last name. Additionally, other forms of the name can be given, as well as a nationality, date and place of the birth and death, sex, occupation, affiliation and education, important event(s) of their life and if they have one, a VIAF. It is possible that not all of this information is available. 

For the name, the only requirement is that the first <persName> in the <person> tag is provided in the format “Last Name, First Name” in order to help with the alphabetical order of the persons once the index has been published. When only the last name is known, it is encoded like this: “Last Name, (Unknown name)”.

Example of a person with complete information

3. Organizations

An organization has a name, a description, a location and a VIAF. The location is not required because sometimes I do not have any information about the place where the organization is or was located. When it is there, it is usually presented in the same manner as for places (placeName, country, coordinates).

Example of an organisation with a placeName

4. Works

A work is provided inside a <monogr> tag, in which there is a title, an attribute for the type of work it is and information about the publication (publishing place and date). The Paul d’Estournelles de Constant corpus includes mostly periodicals (type=”j”).

Example of an entry for a periodical

 

Building those indexes was also practical because by searching for information about the entity, it was possible to find that some entities kept during the cleaning of the files were actually nonexistent or wrong. I could then suppress them from the list of named entities before they were encoded and could avoid some false positives.

Encoding the named entities

This step was probably the easiest because everything had been prepared during the previous steps and all I had to do was to apply it to the corpus. 

First, just like in the previous step, I created several scripts that also process information from the CSV. This time, it would help create the functions for the encoding. Those functions would subsequently be called in the named entities encoding script.

The technique used here to encode the named entities can be applied to two scenarios:

  • the corpus has already been encoded, only the named entities are missing
  • the corpus is still only available in its text format

When I proceeded to the named entities recognition of my corpus, I already encoded most of it, so I was in the first scenario. Accordingly, I developed a script to deal with this. It is quite simple. Using the module BeautifulSoup, the script reads the <body> as a text. It searches the entities contained in the regexes and encodes them. Then, it writes a new file where it recreates the XML tree with the newly encoded named entities and information added to the <revisionDesc> to document this new change.

Now, it is also possible that the NER task had been done before any encoding of the corpus was made. So, it is logical to want to encode the named entities at the same time as the rest of the corpus. Having the named entities encoded by functions and individual lines in the script makes it possible. In the second part of this series of posts, I presented the steps to encode the corpus and I mentioned the text tagging script, which works mostly with regular expressions. Thus, it is totally possible to use our technique for the named entities because it operates in exactly the same way. I only need to import the functions and to add the four lines that do the transformation near the end of the main part of the transcription. After executing the script, my corpus is encoded with both the structure and the named entities.

Conclusion

I now have a corpus encoded with extra information. However, it is possible that it still needs some modifications because some errors or omissions can be found 

later. In some cases, the regexes involved different entities because of the closeness of their form (Ex: “Mr and Mrs Last Name” or “Toul|Toulon|Toulouse”). When this happened, I left the @xml:id blank to avoid wrongly encoding the identifier. Now, it is mandatory to fix those omissions to be sure that the XML files are valid. In order to help with this task, I created a script that harvests the unreferenced named entities. Once this is done, to have a fully encoded corpus, it is necessary to encode the referencing string, which are entities where a generic name was given so the NER tool wasn’t able to recognize it (For example: “my father”, “his wife”, “my school”, etc.). This has to be done manually because an automatic tool is not able to identify the entity mentioned, as it depends on the context and the knowledge of the corpus. 

After those steps have been completed, the corpus will officially be ready for publication and put online for display. An additional annotation could be added later on, to provide information about the content of the letter, to shed some light on some of the ideas and elements mentioned by the author.


Viewing all articles
Browse latest Browse all 15

Trending Articles