Introduction
He is, of course, a recording artist and a guest actor in Game of Thrones. His shape, without going into details, is pretty human. But you knew that already. You made the connection between these words and the entity they represent. However, this isn’t quite as easy and straightforward a task for a computer. Enter Named Entity Recognition (NER) to save the day. NER is essentially a way to teach a computer what words mean.
What is NER?
We can first look at the formal definition:
“NER is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.” (Wiki) |
That wasn’t very helpful at first glance. Let’s try a simple example:
In 2025, John Doe traveled to Greece and visited the Acropolis, where the Parthenon is.
Given the context, we might be interested in different types of entities. If what we are after are semantics, then we simply need to understand which words signify persons, which signify places, etc.
On the other hand, in some cases we might need syntactic entities like nouns, verbs, etc.
Why NER?
Okay, now we know what NER is, but what does it do, in real life? Well, plenty of things:
- Efficient searching: This applies to any service that uses a search engine and has to answer a large number of queries. By extracting the relevant entities from a document corpus, we can split it into smaller homogeneous segments. Then, at query time we can reduce the search space and time by only looking into the most relevant segments.
- Recommendation systems: News publishers, streaming services and online shops are just a few examples of services that could benefit from NER. Clustering articles, shows or products by the entities they contain helps a recommendation engine to deliver the great suggestions to users, based on the content they prefer.
- Research: Each year the already tremendous volume of papers, research journals and publications increases further. Automatically identifying entities such as research areas, topics, institutions, and authors can help researchers navigate through this vast interconnected network of publications and references.
Now we’re getting somewhere. We know what NER is and have a few good ideas about where it can be used. But why and where are we at ORFIUM using it?
NER applications at ORFIUM
Text matching
In some of our services we use Natural Language Processing (NLP) methodologies to match recording or composition catalogs with other catalogs, internally and externally. NER can aid this process by extracting the most relevant industry-related entities, like song titles and artists, which can then be used as features for our current algorithms and models.
Data cleaning
The great volume of data we ingest daily often contain irrelevant and superfluous information. An example of this are YouTube catalogs, where usually video titles contain more than just song titles or artist names, and might have no useful information at all. By extracting the entities most relevant to the music industry, we essentially remove any noise, which will lead to better metadata, as well as a more trustworthy knowledge base.
Approaches and Limitations
Depending on the context and the text structure, there are various approaches that can be employed, but they usually are grouped into two general categories, each with its own strengths and drawbacks: rule-based and machine learning approaches.
In rule-based approaches, a set of rules is derived based on standard NLP strategies or domain-specific knowledge and then used to recognize possible entities. For example, names and organizations are capitalized, and dates are written in formats like YYYY/MM/DD.
- Pros:
- Straightforward and easy to implement for well-structured text
- Domain knowledge can be easily integrated
- Usually computationally fast and efficient
- Cons:
- Rule sets can get very large, very fast for complicated text structures, requiring a lot of work
- General purpose rule sets not easily adaptable to specific domains
- Changes to the text structure further complicate rule additions and interactions.
In machine learning approaches, a model is trained using a dataset annotated specifically for the task at hand. The model learns all the different ways in which relevant entities appear in text and can then be used to identify them in the future.
- Pros:
- Training process is domain-agnostic with easily customizable entity tags
- Well-suited for unstructured text and easily adaptable to structure changes
- Pre-trained models can be customized and used to speed up the training process
- Cons:
- The process requires large amounts of annotated entries to create a robust model
- May require annotators with specific domain expertise
- Training process can be costly in terms of time and money depending on the use-case
Our Project
What we wanted to accomplish was to build a baseline entity extraction process which could potentially later be used to improve our matching and other services.
Dataset
A good starting point for that would be the YouTube catalogs we ingest. These are catalogs of unmatched sound recordings. As mentioned earlier, video title structures are usually a bit chaotic. Therefore, this use case is an excellent candidate to test the potential and limitations of NER.
In the video titles, the most relevant entities that are present and we would like to identify are recording TITLE, PERSON and VERSION (remix, official video, live, etc)
We investigated both a rule-based and a machine learning approach. For their evaluation, however, we needed an annotated dataset tailored to our use case. For that reason we turned to LabelStudio and our Operations Team. LabelStudio is an open-source online data annotation tool with an intuitive UI, where we uploaded a catalog sample. The sample was split into sub-tasks which then were handled by our Operations Team.
Label Studio – Open Source Data Labeling
At this point, we would like to say a big thank you to the Operations Team for their help. Dataset annotations are almost always quite tedious and repetitive work, but an incredibly important first step in our testing.
Rule-based approach
For the construction of our rules, we first needed to investigate whether there was any kind of structure in the video title text. We found a few patterns.
Information inside parentheses
The first thing we noticed is that when parentheses ( (), [], {} ) were present, they mostly contained featured artists or version information, like live, acoustic, remix, etc. This information was rarely found outside parentheses.
For these reasons we wrote a few simple rules for attributes inside parentheses:
- If they contained any version keywords (live, acoustic, etc.), tag them as VERSION
- If “feat” was present, then tag the tokens after that as PERSON
Segmentation
One other thing we noticed was that some entries could be split into segments using certain delimiters ( -, |, / ). These entries could be generally split into 2-4 segments. Also, “|” and “/” have higher priority than “-”. When split by | or /, the first segment mostly contained recording titles and sometimes also artists. When split by -, the picture was not quite as clear, since titles and artists appeared both in the first segment as well as the rest. The most prevalent case, however, was the artist appearing in the first segment and the title in the second.
Based on the above we have the following rules for splittable entries:
- When split by | or / tokens in the first segment, they are tagged as TITLE. In the second segment, tag them as PERSON
- When split by – tokens in the first segment are tagged as PERSON. In the second segment, tag them as TITLE
Finally tokens in entries that did not belong in the above categories, were tagged as TITLE.
Machine learning approach
Our work for the machine learning approach was much more straightforward. We decided to go with transfer learning. This is a process where we take a state-of-the-art pre-trained (usually on public and general-use datasets) model and partly extend the training with a custom dataset. This is very efficient, since we didn’t have to waste time training the model from scratch, but we still get to tailor it to our needs.
spaCy · Industrial-strength Natural Language Processing in Python
For that purpose, we used Spacy, which is a well-established and open-source python package for NLP. It supports multiple languages and NLP algorithms, including NER. Its models are easily re-trained and integrated with few lines of code. It’s also great that there are some Spacy models optimized for accuracy and others for speed. Spacy models were retrained using the annotated dataset provided by our Operations Team.
Results
Both approaches performed very well and identified the majority of the TITLE entities. As far as PERSON and VERSION entities are concerned, the rule-based approach struggled a bit, while the machine learning one did a decent job. Below we have some examples of wrong predictions:
We also faced a few common issues with both approaches, which made their predictions less accurate.
Conclusion
Here is where today’s journey comes to an end. We had a chance to briefly introduce the concept of Named Entity Recognition, describe a few of its general and more custom uses, and learned that, despite the variety of approaches, they all come with caveats and we usually have to make compromises depending on our needs. Is our text well-structured? Are our entities generic or do they require specific domain knowledge? How do different approaches adapt to changes? Are we able to annotate our own datasets?
We also started this article with a question. Did NER help us to answer it? Our models certainly tried. Both our rule-based and machine learning approaches gave us the following result when asked to identify the entities in “Ed Sheeran – Shape of You”:
But what do we know? They seem to perform very well, so they might be right.
Theodoros Palamas
Machine Learning Researcher/Data Scientist @ ORFIUM