Tokenizing Text, Finding Word Frequencies Within Corpora

One way to think about tokenization is to consider it as finding the smallest possible unit of analysis for computational linguistics tasks. As such, we can think of tokenization as among the first steps (along with normalization) in the average NLP pipeline or computational linguistics analysis. This process helps break down text into a manner interpretable for the computer and analyst.

NLTK Is Very Comprehensive

NLTK is likely the best place to start for both understanding and customizing NLP pipelines. Please review their documentation on tokenization here: NLTK – Tokenization Example. While I recommend reviewing NLTK, you should also keep up with Engineers who mostly use Tensorflow. Yes, you must learn two packages at once if you are a linguist in the IT industry.

Learn TensorFlow

“Learn TensorFlow. Given the effort you will place into learning how to combine computing with linguistics, you are also in a strange way lightening the load by proceeding in parallel with industry trends. Likely, a popular topic will contain ample documentation. Consider that most engineers will have a frame of reference for tokenization that is not necessarily grounded in Linguistics, but instead based on interactions with industry centric examples with an intent to prepare data for Machine Learning.”

Industry Perceptions

Thus, if you do not know tokenization both in terms of how engineers perceive tokenization and how linguists work with the concept, then you will likely be perceived as not only not knowing how to program, but also not knowing about your own subject matter as a linguist. While this is obviously not true, perception matters so you must make the effort to reach engineers at their level when collaborating.

# -*- coding: utf-8 -*-
"""
Created on Fri Sep 10 23:53:10 2021

@author: Ricardo Lezama
"""
import tensorflow as tf 

text = """ 
    A list comprehension is a syntactic construct available in
    some programming languages for creating a list based on existing lists. 
    It follows the form of the mathematical set-builder notation (set comprehension) as
    distinct from the use of map and filter functions.
    """
 
content = tf.keras.preprocessing.text.text_to_word_sequence(text)

Obviously, we do not want to repeat the one liner above over and over again within our individual python script. Thus, we neatly repackage the one line as the main line in a function titled ‘tokenize_lacartita‘ as follows:



def tokenize_lacartita(text):
    """ open_lacartita_data references a function  to open our txt documents. 
    
    Arg: This should be a text string that will be converted into a list with individual tokens.
          ex.) ['This', 'is', 'text', 'tokenized', 'by', 'tensorflow']
    
    Returns: New line separated documents. 
    """
    keras_tok  = tf.keras.preprocessing.text.text_to_word_sequence(text)
    return keras_tok

The data we will receive for using this tokenization module is shown below. As you can see, there are individual strings, lowercased and no punctuation as this is by default eliminated in the tokenization process.

['morelia',
 'apoyar',
 'no',
 'es',
 'delinquir',
 'señalan',
 'grupos',
 'feministas',
 'a',
 'sheinbaum',
 'capacitan',
 'a',
 'personal',
 'de',
 'la',
 'fiscalía',
 'cdmx']

Word Frequency and Relative Percentage

We can create a function to find word frequencies. Granted, the counter module in Python can do this already, but, for educational purposes, we include a function to track a word’s frequency within a list. The if-condition below can permit us to count whenever we see our target word within the word list. In this case, we examine a series of headlines related to Mexico that were gathered and classified by hand by Mexican University students.

def word_frequency(word, word_list):
    """
    Function that counts word frequencies.

    Arg: target word to count within word list.
    
    Return: return a count. 
    """
    count = 0
    for word in word_list:
        if word == target_word:
            count += 1
    return count

The word_frequency function receives “AMLO” or it’s normalized version: ‘amlo’ alongside the word list as the second argument. The frequency of the string is listed next to the term when it is returned. Obviously, you can add more elaborate details to the body of the function.

word_frequency("amlo", saca)
Out[164]: 'amlo: 12'

Tokenization In Native Python

At times, an individual contributor must know how to not write a custom function or invoke tokenization from a complex module with heavy libraries. There may be times that linguists are working within siloed environments. This implies that you would not have write privileges to install libraries, like TensorFlow, in a generic linux environment. In this cases, use native python – the term references built in functions or modules that require no additional installation outside of having the most updated version of python.

In fact, you may indeed need to rely more on the raw text you are attempting to tokenize. At times, there are different orthographic marks that are relevant and necessary to find. For example, were you to split based on a space, ” “, or a period, “.”, you can do so by calling the split attribute.

def word(text): 
    return text.split(" ")

All strings contain a split attribute that you can invoke for free in general. Furthermore, you can run a method called ‘strip’ and cleanout a lot of whitespaces. Please see the examples below.

def sentence(text): 
    text_strip = text.strip()
    return text_strip.split(" ")

Frequency Counts For Named Entities Using Spacy/Python Over MX Spanish News Text

On this post, we review some straightforward code written in python that allows a user to process text and retrieve named entities alongside their numerical counts. The main dependencies are Spacy, a small compact version of their Spanish language model built for Named Entity Recognition and the tabular data processing library, Matplotlib, if you’re looking to further structure the data.

Motivation(s)

Before we begin, it may be relevant to understand why we would want to extract these data points in the first place. Often times, there is a benefit to quickly knowing what named entity a collection (or even a hadoop sized bucket) of stories references. For instance, one of the benefits is the quick ability to visualize the relative importance of an entity to these stories without having to read all of them.

Even if done automatically, the process of Named Entity Recognition is still guided by very basic principles, I think. For instance, the very basic reasoning surrounding a retelling of events for an elementary school summary applies to the domain of Named Entity Recognition. That is mentioned below in the Wh-question section of this post.

Where did you get this data?

Another important set of questions is what data are we analyzing and how did we gather this dataset?

Ultimately, a great number of computational linguists or NLP practitioners are interested in compiling human rights centered corpora to create tools that analyze newsflow quickly on these points. When dealing with sensitive topics, the data has to center on those related topics for at-need populations. This specific dataset centers around ‘Women/Women’s Issues’ as specified by the group at SugarBearAI.

As for ‘where did they obtain this data?, that question is answered as follows: this dataset of six hundred articles existing in the LaCartita Db – which contains several thousand hand-tagged articles – is annotated by hand. The annotators are a group of Mexican graduates from UNAM and IPN universities. A uniform consensus amongst the news taggers was required for its introduction into the set of documents. There were 3 women and 1 man within the group of analysts, with all of them having prior experience gathering data in this domain.

While the six hundred Mexican Spanish news headlines analyzed are unavailable on github, a smaller set is provided on that platform for educational purposes. Under all conditions, the data was tokenized and normalized with a fairly sophisticated set of Spanish centric regular expressions.

Please feel free to reach out to that group in research@lacartita.com for more information on this hand-tagged dataset.

Wh-Questions That Guide News Judgments

With all of the context on data and motivations in mind, we review some points on news judgment that can help with the selection of texts for analysis and guide the interpretation of automatically extracted data points.

Basic news judgment is often informed by the following Wh-questions:

1.) What occurred in this news event? (Topic Classification; Event Extraction)

2.) Who was involved in the news?

3.) When did this news event take place?

4.) Where did it take place?

5.) Why did this take place?

If you think of these questions at a fairly high level of abstraction, then you’ll allow me to posit that the first two questions are often the domain of Topic Classification and Named Entity Recognition, respectively. This post will deal with the latter, but assume that this issue of extracting named entities deals with documents already organized on the basis of some unifying topic. This is why it’s useful to even engage in the activity.

In other words, you – the user of this library – will be open to providing a collection of documents already organized under some concept/topic. You would be relying on your knowledge of that topic to make sense of any frequency analysis of named entities, important terms (TF-IDF) etc. as is typical when handling large amounts of unstructured news text. These concepts – NER and TFIDF – are commonly referenced in Computational Linguistics and Information Retrieval; they overlap in applied settings frequently. For instance, TF-IDF and NER pipelines power software applications that deal in summarizing complex news events in real time. So, it’s important to know that there are all sorts of open source libraries that handle these tasks for any average user or researcher.

Leveraging Spacy’s Lightweight Spanish Language Model

The actual hard work involves identifying distinct entities; the task of identifying Named Entities involves statistical processes that try to generalize what a typical Named Entity’s morphological shape in text involves.

In this example, my particular script is powered by the smaller language model from Spacy. One thing that should be noted is that the text has its origins in Wikipedia. This means that newer contemporary types of text may not be sufficiently well covered – breadth doesn’t imply depth in analysis. Anecdotally, over this fairly small headline-only corpus sourced by hand with UNAM and IPN students that contains text on the Mexican president, Andres Manuel Lopez Obrador, Covid and local crime stories, we see performance below 80 percent accuracy from the small Spanish language model. Here’s the small 600 headline strong sample: example headlines referenced.

NER_News Module

Using the below scripts, you can extract persons and organizations. Using spacy, you can extract the entities extracted from a corpus. We will use the lighter Spanish language model from Spacy’s natural language toolkit. This post assumes that you’ve dealt with the basics of Spacy installation alongside its required models. If not, visit here. Therefore, we should expect the below lines to run without problem:

import spacy
import spacy.attrs
nlp = spacy.load('es_core_news_sm')

In this example, we use clean texts that are ‘\n’ (“new line separated”) separated texts. We count and identify the entities writing to either memory or printing to console the results of the NER process. The following line contains the first bit of code referencing Spacy and individuates each relevant piece of the text. Suppose these were encyclopedic or news articles, then the split would probably capture paragraph or sentence level breaks in the text:

raw_corpus = open('corpora/titularesdemx.txt','r', encoding='utf-8').read().split("\n")[1:]

The next step involves placing NER text with its frequency count as a value in a dictionary. This dictionary will be result of running our ‘sacalasentidades’ method over the raw corpus. The method extracts GEO-political entities, like a country, or PER-tagged entities, like a world leader.

import spacy
import spacy.attrs
nlp = spacy.load('es_core_news_sm')

import org_per
raw_corpus = open('corpora/titularesdemx.txt','r', encoding='utf-8').read().split("\n")[1:]
entities = org_per.sacalasentidades(raw_corpus)
 
# use list of entities that are ORG or GEO and count up each invidividual token.     

tokensdictionary = org_per.map_entities(entities) 

The object tokensdictionary formatted output will look like this:

{'AMLO': 11,
 'Desempleo': 1,
 'Perú': 1,
 'América Latina': 3,
 'Banessa Gómez': 1,
 'Resistir': 2,
 'Hacienda': 1,
 'Denuncian': 7,
 'Madero': 1,
 'Subastarán': 1,
 'Sánchez Cordero': 4,
 'Codhem': 1,
 'Temen': 2,
 'Redes de Derechos Humanos': 1,
 'Gobernación': 1,
 'Sufren': 1,
 '¡Ni': 1,
 'Exigen': 2,
 'Defensoras': 1,
 'Medicina': 1,
 'Género': 1,
 'Gabriela Rodríguez': 1,
 'Beatriz Gasca Acevedo': 1,
 'Diego "N': 1,
 'Jessica González': 1,
 'Sheinbaum': 3,
 'Esfuerzo': 1,
 'Incendian Cecyt': 1,
 'Secretaria de Morelos': 1,
 'Astudillo': 1,
 'Llaman': 3,
 'Refuerzan': 1,
 'Mujer Rural': 1,
 'Inician': 1,
 'Violaciones': 1,
 'Llama Olga Sánchez Cordero': 1,
 'Fuentes': 1,
 'Refuerza Michoacán': 1,
 'Marchan': 4,
 'Ayelin Gutiérrez': 1,
 'Maternidades': 1,
 'Coloca FIRA': 1,
 'Coloquio Internacional': 1,
 'Ley Olimpia': 3,
 'Toallas': 1,
 'Exhorta Unicef': 1,
 'Condena CNDH': 1,
 'Policías de Cancún': 1,
 'Exposición': 1,
 'Nadia López': 1,
 'Aprueba la Cámara': 1,
 'Patriarcales': 1,
 'Sofía': 1,
 'Crean Defensoría Pública para Mujeres': 1,
 'Friedrich Katz': 1,
 'Historiadora': 1,
 'Soledad Jarquín Edgar': 1,
 'Insuficientes': 1,
 'Wikiclaves Violetas': 1,
 'Líder': 1,
 'Alcaldía Miguel Hidalgo': 1,
 'Ventana de Primer Contacto': 1,
 'Parteras': 1,
 'App': 1,
 'Consorcio Oaxaca': 2,
 'Comité': 1,
 'Verónica García de León': 1,
 'Discapacidad': 1,
 'Cuánto': 1,
 'Conasami': 1,
 'Amnistía': 1,
 'Policía de Género': 1,
 'Parteras de Chiapas': 1,
 'Obligan': 1,
 'Suspenden': 1,
 'Contexto': 1,
 'Clemencia Herrera': 1,
 'Fortalecerán': 1,
 'Reabrirá Fiscalía de Chihuahua': 1,
 'Corral': 1,
 'Refugio': 1,
 'Alicia De los Ríos': 1,
 'Evangelina Corona Cadena': 1,
 'Félix Salgado Macedonio': 5,
 'Gabriela Coutiño': 1,
 'Aída Mulato': 1,
 'Leydy Pech': 1,
 'Claman': 1,
 'Insiste Morena': 1,
 'Mariana': 2,
 'Marilyn Manson': 2,
 'Deberá Inmujeres': 1,
 'Marcos Zapotitla Becerro': 1,
 'Vázquez Mota': 1,
 'Dona Airbnb': 1,
 'Sergio Quezada Mendoza': 1,
 'Incluyan': 1,
 'Feminicidios': 1,
 'Contundente': 1,
 'Teófila': 1,
 'Félix Salgado': 1,
 'Policía de Xoxocotlán': 1,
 'Malú Micher': 1,
 'Andrés Roemer': 1,
 'Basilia Castañeda': 1,
 'Salgado Macedonio': 1,
 'Menstruación Digna': 1,
 'Detenidas': 1,
 'Sor Juana Inés de la Cruz': 1,
 'María Marcela Lagarde': 1,
 'Crean': 1,
 'Será Rita Plancarte': 1,
 'Valparaiso': 1,
 'México': 1,
 'Plataformas': 1,
 'Policías': 1,
 'Karen': 1,
 'Karla': 1,
 'Condena ONU Mujeres': 1,
 'Llaman México': 1,
 'Sara Lovera': 1,
 'Artemisa Montes': 1,
 'Victoria': 2,
 'Andrea': 1,
 'Irene Hernández': 1,
 'Amnistía Internacional': 1,
 'Ley de Amnistía': 1,
 'Nació Suriana': 1,
 'Rechaza Ss': 1,
 'Refugios': 1,
 'Niñas': 1,
 'Fiscalía': 1,
 'Alejandra Mora Mora': 1,
 'Claudia Uruchurtu': 1,
 'Encubren': 1,
 'Continúa': 1,
 'Dulce María Sauri Riancho': 1,
 'Aprueba Observatorio de Participación Política de la Mujer': 1,
 'Plantean': 1,
 'Graciela Casas': 1,
 'Carlos Morán': 1,
 'Secretaría de Comunicaciones': 1,
 'Diego Helguera': 1,
 'Hidalgo': 1,
 'LGBT+': 1,
 'Osorio Chong': 1,
 'Carla Humphrey Jordán': 1,
 'Lorenzo Córdova': 1,
 'Edomex': 1,
 'CEPAL': 1,
 'Delitos': 1,
 'Murat': 1,
 'Avanza México': 1,
 'Miguel Ángel Mancera Espinosa': 1,
 'Reconoce INMUJERES': 1,
 'Excluyen': 1,
 'Alejandro Murat': 1,
 'Gómez Cazarín': 1,
 'Prevenir': 1,
 'Softbol MX': 1,
 'Martha Sánchez Néstor': 1}

Erros in Spacy Model

One of the interesting errors in the SPACY powered NER process is the erroneous tagging of ‘ Plantean’ as a named entity when, in fact, this string is a verb. Similarly, ‘Delitos’ and ‘Excluyen’ are tagged as ORG or PER tags. Possibly, the morphological shape, orthographic tendency of headlines throws off the small language model. Thus, even with this small test sample, we can see the limits of out-of-the-box open source solutions for NLP tasks. This shows the value added of language analysts, data scientists in organizations dealing with even more specific or specialized texts.

Handling Large Number of Entries On Matplotlib

One issue is that there will be more Named Entities recognized than is useful or even possible to graph.

Despite the fact that we have a valuable dictionary above, we still need to go further and trim down the dictionary in order to figure out what is truly important. In this case, the next Python snippet is helpful in cutting out all dictionary values that contain a frequency count of only ‘1’. There are occasions in which a minimum value must be set.

For instance, suppose you have 1000 documents with 1000 headlines. Your NER analyzer must read through these headlines which ultimately are not a lot of text. Therefore, the minimum count you would like to eliminate is likely to be ‘1’ while if you were analyzing the entirety of the document body, then you may want to raise the minimum threshold for a dictionary value’s frequency.

The following dictionary comprehension places a for-loop type structure that filters out on the basis of the term frequency being anything but ‘1’, the most common frequency. This is appropriate for headlines.

    filter_ones = {term:frequency for term, frequency in data.items() if frequency > 1}

While this dictionary filtering process is better for headlines, a higher filter is needed for body text. 10,000 words or more potentially words implies that the threshold for the minimum frequency value is higher than 10.

    filter_ones = {term:frequency for term, frequency in data.items() if frequency > 10}

The resulting dictionary now presented as a matplotlib figure is shown:

 

def plot_terms_body(topic, data):
    """
    The no.aranage attribution is to understand how to best and programmatically 
    plot the data. Intervals are determined by the counts within the dictionary. 
    
    Args: Topic is the name of the plot/category. The 'data' is  a list of ter 
    
     """
    #The bar plot should be optimized for the max and min size of
    #individual 
    filter_ones = {term:frequency for term, frequency in data.items() if frequency > 10}  
    filtered = {term:frequency for term, frequency in data.items() if frequency > round(sum(filter_ones.values())/len(filter_ones))  }   
    print(round(sum(filtered.values())/len(filtered)), "Average count as result of total terms minus once identified terms divided by all terms.")
    terms = filtered.keys()
    frequency = filtered.values()   
    y_pos = np.arange(len(terms),step=1)
    # min dictionary value, max filtered value ; 
    x_pos = np.arange(min(filtered.values()), max(filtered.values()), step=round(sum(filtered.values())/len(filtered)))
    plt.barh(y_pos, frequency, align='center', alpha=1)
    plt.yticks(y_pos, terms, fontsize=12)
    plt.xticks(x_pos)
    plt.xlabel('Frecuencia en encabezados')
    plt.title(str(topic), fontsize=14)
    plt.tight_layout()
    plt.show()
Named Entities or Frequent Terms

We are able to extract the most common GER or PER tagged Named Entities in a ‘Women’ tagged set of documents sourced from Mexican Spanish news text.

Surprise, surprise, the terms ‘Exigen‘, ‘Llaman‘, ‘Marchan‘ cause problems due to their morphological and textual shape; the term ‘Victoria‘ is orthographically identical and homophonous to a proper names, but in this case, it is not a Named Entities. These false positives in the NER process from Spacy just reflects how language models should be trained over specific texts for better performance. Perhaps, an NER model trained over headlines would fare better. The data was already cleaned due to a collection process detailed below so normalization and tokenization were handled beforehand.

N-Gram Analysis Over Sensitive Topics Corpus

I was recently able to do some analysis over the Sugar Bear AI violence corpus, a collection of documents classified by analysts over at the SugarBear AI group. The group has been classifying manually thousands of documents of Mexican Spanish news over the past year that deal with the new topics of today: “Coronavirus”, “WFH”, as well as pressing social issues that have only recently come to the fore of mainstream news. Since the start of the pandemic, they’ve made some of their material available online, but the content is mostly for sale as the Mexico City based group needs to eat. That aside, they produce super-clean and super-specific ‘Sensitive Topics‘ data.

There is a growing consensus that solid data and analysis is needed for NLP activities that involve specific topics, like human rights or documents referencing cultural minorities in any region. For instance, we can not reasonably expect webscraped data from Reddit to fulfill the NLP needs of the Mexican Spanish speaking communities residing outside of official borders. As a whole, the worldview of Reddit as encoded in its texts will not reflect adequately on the needs or key ideas held within that population. Basic NLP tasks, then, like mining for best autocompleted answers, searches or key terms can not be sourced from such a demographic specific (White, 18-45, Male) corpus.

Stopwords

We will work over a couple of hundred documents and try to see what inferences can be drawn from this short sample. We definitely need a new stopword lists. This can help filter out ‘noise’ that is specific to the domain we’re researching. The list(s) can be expanded over time – NLTK’s default Spanish is not enough – and, while there may be some incentive to work with an existing list, the bigrams below show why we would need a domain specific list, one in Spanish and developed by analysts who looked at this data continually.

N-Gram Analysis

As alluded to above, we will start with an N-gram analysis.

To put it simply, N-Grams are units of strings in text that appear with some amount of frequency over a given text. The strings are called ‘tokens’; what specific guidelines on how strings are defined as a token depends on the developers of a corpus. They may have specific applications in mind when developing a corpus. In the case of the Sugar Bear AI, the most relevant application is the development of a news classifier for sensitive topics in Mexico.

We used NLTK to sample out what a basic N-Gram analysis would yield. As is typical in news text, this corpus contained dates, non-alphanumeric content, but no emojis or content associated with social media. The actual analysis from the NTLK ngram module is sufficient for our purposes. That being said, some modest use of the tokenized module and regular expressions did the trick of ‘normalizing’ the text. Linguists should leverage basic notions of Regular Expressions.

        if not re.match("[^a-zA-Z]", i) and not re.match("[^a-zA-Z]", j) 

The violence corpus is fairly small but it is not static. The Mexico City based Sugar Bear Ai group continues to annotate at the document level content from alternative and Mexican mainstream news sources. In the set referenced here, there are about 400 documents from 8 different news sources in Mexico. The careful approach to its curation and collection, however, ensures any researcher working with the corpus will have a balanced and not commonly analyzed collection of texts on sensitive topics.

Code Description

The code here presents a manner in which a list (any list) can be processed by NLTK’s ngram module, but also normalized with some basic tokenization and stop

import re
from nltk.corpus import stopwords
def find_bigrams(t):
    list = []
    for a in t:
        token = nltk.word_tokenize(a)
        stop_words = set(stopwords.words('spanish')) 
        filtered_sentence = [w for w in token if not w in stop_words]         
        bigrams = ngrams(filtered_sentence,2)        
        for i, j in bigrams:
            if not re.match("[^a-zA-Z]", i) and not re.match("[^a-zA-Z]", j) :
                list.append("{0} {1}".format(i, j))
    return list 
alpha = find_bigrams(list)
...
frente múltiples
visité cerca
absoluto atrocidades
consejeras rendirán
ser derechos
Vazquez hijo
detenga amenazas
puntos pliego
odisea huyen
anunció separación
Covid-19 debe
víctima frente
Acapulco sentenciar
logró enviar
garantías acceso
documentaron atención
Atendiendo publicación
llamado Estados
interior Estaciones
encuentran mayoría
informativas pandemia
relación familiar
comunidad Agua
día finalizar
Jornada Ecatepec
lenta momento
guerras territoriales
Pese Facebook
pedido retiro
Rendón cuenta
A.C. Mujeres
error quizá
iniciativas enfocadas
consciente anticapitalista
afectar mujer
Justicia u
alertas tempranas
mediante expresión
Nahuatzen comunidad
garantizar repetición
alza indicadores
noche martes
creó Guardia
asegura feminicidio
Unión Fuerza
pronunciamiento Red
carbono equivalente
condiciones desarrollo
comparativo agresiones
recorte refugios
agregó pesar
Ejemplos de Ngrams de corpus de violencia.

The Ngrams represent a small sampling of typical speech within the ‘violence’ corpus. The usual steps to normalize the content were taken, but, as already mentioned, domain specific stop-word list should be developed.

Interesting Patterns In Authorship, Anonymity

Another interesting insight about this small dataset is that the document set or topic has a usual set of authors. Either its a few authors from a small blog or a large institution(s) publishing without attribution. You can read the Spanish texts here. I share the graph about how attribution and which media groups cover the topic of ‘Violence’ in Mexico.

Most large institutions, like El Financiero or even La Jornada, are capable of finding insight on impactful events, like a major violation of human rights, but they do so anonymously. Larger institutions are better able to publish content without attribution, which translates into potentially less risk for their staff.

Authorship Attribution: Large Institutions Can Post Anonymously

Using Spacy in Python To Extract Named Entities in Spanish

The Spacy Small Language model has some difficulty with contemporary news text that are not either Eurocentric or US based. Likely, this lack of accuracy with contemporary figures owes in part to a less thorough scrape of Wikipedia and relative changes that have taken place in Mexico, Bolivia and other countries with highly variant dialects of Spanish in LATAM since 2018. Regardless, that dataset can and does garner some results for the purpose of this exercise. This means that we can toy a bit around with some publicly available data.

Entity Hash For Spanish Text

In this informal exercise, we will try to hack our way through some Spanish text. Specifically, making use of NER capacities that are sourced from public data – no rule based analysis – with some functions I find useful for visualizing Named Entities in Spanish text. We have prepared a Spanish news text on the topic of ‘violence’ or violent crime sourced from publicly available Spanish news content in Mexico.

Using spacy, you can hash the entities extracted from a corpus. We will use the lighter Spanish language model from Spacy’s natural language toolkit. This language model is a statistical description of Wikipedia’s Spanish corpus which is likely slanted towards White Hispanic speech so beware it’s bias.

First, import the libraries:

import spacy
import spacy.attrs
nlp = spacy.load('es_core_news_sm')

With the libraries in place, we can import the module ‘org_per’. This module is referencing this Github repo.

The work of identifying distinct entities is done in a function that filters for Geographical entities and People. Both of these tags are labeled as ‘GEO’ and ‘PER’, respectively in spacy’s data.

The variable ‘raw_corpus‘ is the argument you provide, which should be some Spanish text data. If you don’t have any, visit the repository and load that file object.

import org_per
raw_corpus = open('corpus_es_noticias_mx.txt','r', encoding='utf-8').read().split("\n")[1:]
entities = org_per.sacalasentidades(raw_corpus)
 
# use list of entities that are ORG or PER and count up
# each invidividual token.     

tokensdictionary = org_per.map_entities(entities) 

As noted before, the text has its origins in Wikipedia. This means that newer more contemporary types of text may not be sufficiently well covered – breadth doesn’t imply depth in analysis because stochastic models rely on some passing resemblance with data that may not ever have been seen.

Anecdotally, over a small corpus, we see performance below 80 percent accuracy for this language model. Presumably, a larger sampling of Wikipedia ES data will perform higher, but certain trends in contemporary news text makes this expectation necessary to temper.

The output returned from running `org_per.map_entities(entities)` will look like this:

{"Bill Clinton": 123,
"Kenneth Starr" : 12,
}

The actual hashing is a simple enough method involving placing NER text with its frequency count as a value in a dictionary. Within your dictionary, you may get parses of Named Entities that are incorrect. That is to say, they are not properly delimited because the Named Entity Language Model does not have an example of your parse. For instance, Lopez Obrador – the current president of Mexico – is not easily recognized as ‘PER’.

Accuracy

This is measured very simply through tabulating how much you agree with the returned Named Entities. The difference between expected and returned values is your error rate. More on accuracy metrics next post.