Using Spacy in Python To Extract Named Entities in Spanish
The Spacy Small Language model has some difficulty with contemporary news text that are not either Eurocentric or US based. Likely, this lack of accuracy with contemporary figures owes in part to a less thorough scrape of Wikipedia and relative changes that have taken place in Mexico, Bolivia and other countries with highly variant dialects of Spanish in LATAM since 2018. Regardless, that dataset can and does garner some results for the purpose of this exercise. This means that we can toy a bit around with some publicly available data.
Entity Hash For Spanish Text
In this informal exercise, we will try to hack our way through some Spanish text. Specifically, making use of NER capacities that are sourced from public data – no rule based analysis – with some functions I find useful for visualizing Named Entities in Spanish text. We have prepared a Spanish news text on the topic of ‘violence’ or violent crime sourced from publicly available Spanish news content in Mexico.
Using spacy, you can hash the entities extracted from a corpus. We will use the lighter Spanish language model from Spacy’s natural language toolkit. This language model is a statistical description of Wikipedia’s Spanish corpus which is likely slanted towards White Hispanic speech so beware it’s bias.
First, import the libraries:
import spacy import spacy.attrs nlp = spacy.load('es_core_news_sm')
With the libraries in place, we can import the module ‘org_per’. This module is referencing this Github repo.
The work of identifying distinct entities is done in a function that filters for Geographical entities and People. Both of these tags are labeled as ‘GEO’ and ‘PER’, respectively in spacy’s data.
The variable ‘raw_corpus‘ is the argument you provide, which should be some Spanish text data. If you don’t have any, visit the repository and load that file object.
import org_per raw_corpus = open('corpus_es_noticias_mx.txt','r', encoding='utf-8').read().split("\n")[1:] entities = org_per.sacalasentidades(raw_corpus) # use list of entities that are ORG or PER and count up # each invidividual token. tokensdictionary = org_per.map_entities(entities)
As noted before, the text has its origins in Wikipedia. This means that newer more contemporary types of text may not be sufficiently well covered – breadth doesn’t imply depth in analysis because stochastic models rely on some passing resemblance with data that may not ever have been seen.
Anecdotally, over a small corpus, we see performance below 80 percent accuracy for this language model. Presumably, a larger sampling of Wikipedia ES data will perform higher, but certain trends in contemporary news text makes this expectation necessary to temper.
The output returned from running `org_per.map_entities(entities)` will look like this:
{"Bill Clinton": 123, "Kenneth Starr" : 12, }
The actual hashing is a simple enough method involving placing NER text with its frequency count as a value in a dictionary. Within your dictionary, you may get parses of Named Entities that are incorrect. That is to say, they are not properly delimited because the Named Entity Language Model does not have an example of your parse. For instance, Lopez Obrador – the current president of Mexico – is not easily recognized as ‘PER’.
Accuracy
This is measured very simply through tabulating how much you agree with the returned Named Entities. The difference between expected and returned values is your error rate. More on accuracy metrics next post.