Leveraging NVIDIA Downloads

An issue during the installation of TensorFlow in the Anaconda Python environment is an error message citing the lack of a DLL file. Logically, you will also receive the same error for invoking any Spacy language models, which need TensorFlow installed properly.

Thus, running the code below will invoke an error message without the proper dependencies installed:

import spacy
import spacy.attrs
nlp = spacy.load('es_core_news_sm')

The error message below will appear if the NVIDIA GPU Developer kit is not installed:

"W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found"
"I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine."

The issue is the lack of a GPU developer kit from NVIDIA.

CUDA Toolkit 11.4 Update 1 Downloads | NVIDIA Developer

Frequency Counts For Named Entities Using Spacy/Python Over MX Spanish News Text

On this post, we review some straightforward code written in python that allows a user to process text and retrieve named entities alongside their numerical counts. The main dependencies are Spacy, a small compact version of their Spanish language model built for Named Entity Recognition and the tabular data processing library, Matplotlib, if you’re looking to further structure the data.

Motivation(s)

Before we begin, it may be relevant to understand why we would want to extract these data points in the first place. Often times, there is a benefit to quickly knowing what named entity a collection (or even a hadoop sized bucket) of stories references. For instance, one of the benefits is the quick ability to visualize the relative importance of an entity to these stories without having to read all of them.

Even if done automatically, the process of Named Entity Recognition is still guided by very basic principles, I think. For instance, the very basic reasoning surrounding a retelling of events for an elementary school summary applies to the domain of Named Entity Recognition. That is mentioned below in the Wh-question section of this post.

Where did you get this data?

Another important set of questions is what data are we analyzing and how did we gather this dataset?

Ultimately, a great number of computational linguists or NLP practitioners are interested in compiling human rights centered corpora to create tools that analyze newsflow quickly on these points. When dealing with sensitive topics, the data has to center on those related topics for at-need populations. This specific dataset centers around ‘Women/Women’s Issues’ as specified by the group at SugarBearAI.

As for ‘where did they obtain this data?, that question is answered as follows: this dataset of six hundred articles existing in the LaCartita Db – which contains several thousand hand-tagged articles – is annotated by hand. The annotators are a group of Mexican graduates from UNAM and IPN universities. A uniform consensus amongst the news taggers was required for its introduction into the set of documents. There were 3 women and 1 man within the group of analysts, with all of them having prior experience gathering data in this domain.

While the six hundred Mexican Spanish news headlines analyzed are unavailable on github, a smaller set is provided on that platform for educational purposes. Under all conditions, the data was tokenized and normalized with a fairly sophisticated set of Spanish centric regular expressions.

Please feel free to reach out to that group in research@lacartita.com for more information on this hand-tagged dataset.

Wh-Questions That Guide News Judgments

With all of the context on data and motivations in mind, we review some points on news judgment that can help with the selection of texts for analysis and guide the interpretation of automatically extracted data points.

Basic news judgment is often informed by the following Wh-questions:

1.) What occurred in this news event? (Topic Classification; Event Extraction)

2.) Who was involved in the news?

3.) When did this news event take place?

4.) Where did it take place?

5.) Why did this take place?

If you think of these questions at a fairly high level of abstraction, then you’ll allow me to posit that the first two questions are often the domain of Topic Classification and Named Entity Recognition, respectively. This post will deal with the latter, but assume that this issue of extracting named entities deals with documents already organized on the basis of some unifying topic. This is why it’s useful to even engage in the activity.

In other words, you – the user of this library – will be open to providing a collection of documents already organized under some concept/topic. You would be relying on your knowledge of that topic to make sense of any frequency analysis of named entities, important terms (TF-IDF) etc. as is typical when handling large amounts of unstructured news text. These concepts – NER and TFIDF – are commonly referenced in Computational Linguistics and Information Retrieval; they overlap in applied settings frequently. For instance, TF-IDF and NER pipelines power software applications that deal in summarizing complex news events in real time. So, it’s important to know that there are all sorts of open source libraries that handle these tasks for any average user or researcher.

Leveraging Spacy’s Lightweight Spanish Language Model

The actual hard work involves identifying distinct entities; the task of identifying Named Entities involves statistical processes that try to generalize what a typical Named Entity’s morphological shape in text involves.

In this example, my particular script is powered by the smaller language model from Spacy. One thing that should be noted is that the text has its origins in Wikipedia. This means that newer contemporary types of text may not be sufficiently well covered – breadth doesn’t imply depth in analysis. Anecdotally, over this fairly small headline-only corpus sourced by hand with UNAM and IPN students that contains text on the Mexican president, Andres Manuel Lopez Obrador, Covid and local crime stories, we see performance below 80 percent accuracy from the small Spanish language model. Here’s the small 600 headline strong sample: example headlines referenced.

NER_News Module

Using the below scripts, you can extract persons and organizations. Using spacy, you can extract the entities extracted from a corpus. We will use the lighter Spanish language model from Spacy’s natural language toolkit. This post assumes that you’ve dealt with the basics of Spacy installation alongside its required models. If not, visit here. Therefore, we should expect the below lines to run without problem:

import spacy
import spacy.attrs
nlp = spacy.load('es_core_news_sm')

In this example, we use clean texts that are ‘\n’ (“new line separated”) separated texts. We count and identify the entities writing to either memory or printing to console the results of the NER process. The following line contains the first bit of code referencing Spacy and individuates each relevant piece of the text. Suppose these were encyclopedic or news articles, then the split would probably capture paragraph or sentence level breaks in the text:

raw_corpus = open('corpora/titularesdemx.txt','r', encoding='utf-8').read().split("\n")[1:]

The next step involves placing NER text with its frequency count as a value in a dictionary. This dictionary will be result of running our ‘sacalasentidades’ method over the raw corpus. The method extracts GEO-political entities, like a country, or PER-tagged entities, like a world leader.

import spacy
import spacy.attrs
nlp = spacy.load('es_core_news_sm')

import org_per
raw_corpus = open('corpora/titularesdemx.txt','r', encoding='utf-8').read().split("\n")[1:]
entities = org_per.sacalasentidades(raw_corpus)
 
# use list of entities that are ORG or GEO and count up each invidividual token.     

tokensdictionary = org_per.map_entities(entities) 

The object tokensdictionary formatted output will look like this:

{'AMLO': 11,
 'Desempleo': 1,
 'Perú': 1,
 'América Latina': 3,
 'Banessa Gómez': 1,
 'Resistir': 2,
 'Hacienda': 1,
 'Denuncian': 7,
 'Madero': 1,
 'Subastarán': 1,
 'Sánchez Cordero': 4,
 'Codhem': 1,
 'Temen': 2,
 'Redes de Derechos Humanos': 1,
 'Gobernación': 1,
 'Sufren': 1,
 '¡Ni': 1,
 'Exigen': 2,
 'Defensoras': 1,
 'Medicina': 1,
 'Género': 1,
 'Gabriela Rodríguez': 1,
 'Beatriz Gasca Acevedo': 1,
 'Diego "N': 1,
 'Jessica González': 1,
 'Sheinbaum': 3,
 'Esfuerzo': 1,
 'Incendian Cecyt': 1,
 'Secretaria de Morelos': 1,
 'Astudillo': 1,
 'Llaman': 3,
 'Refuerzan': 1,
 'Mujer Rural': 1,
 'Inician': 1,
 'Violaciones': 1,
 'Llama Olga Sánchez Cordero': 1,
 'Fuentes': 1,
 'Refuerza Michoacán': 1,
 'Marchan': 4,
 'Ayelin Gutiérrez': 1,
 'Maternidades': 1,
 'Coloca FIRA': 1,
 'Coloquio Internacional': 1,
 'Ley Olimpia': 3,
 'Toallas': 1,
 'Exhorta Unicef': 1,
 'Condena CNDH': 1,
 'Policías de Cancún': 1,
 'Exposición': 1,
 'Nadia López': 1,
 'Aprueba la Cámara': 1,
 'Patriarcales': 1,
 'Sofía': 1,
 'Crean Defensoría Pública para Mujeres': 1,
 'Friedrich Katz': 1,
 'Historiadora': 1,
 'Soledad Jarquín Edgar': 1,
 'Insuficientes': 1,
 'Wikiclaves Violetas': 1,
 'Líder': 1,
 'Alcaldía Miguel Hidalgo': 1,
 'Ventana de Primer Contacto': 1,
 'Parteras': 1,
 'App': 1,
 'Consorcio Oaxaca': 2,
 'Comité': 1,
 'Verónica García de León': 1,
 'Discapacidad': 1,
 'Cuánto': 1,
 'Conasami': 1,
 'Amnistía': 1,
 'Policía de Género': 1,
 'Parteras de Chiapas': 1,
 'Obligan': 1,
 'Suspenden': 1,
 'Contexto': 1,
 'Clemencia Herrera': 1,
 'Fortalecerán': 1,
 'Reabrirá Fiscalía de Chihuahua': 1,
 'Corral': 1,
 'Refugio': 1,
 'Alicia De los Ríos': 1,
 'Evangelina Corona Cadena': 1,
 'Félix Salgado Macedonio': 5,
 'Gabriela Coutiño': 1,
 'Aída Mulato': 1,
 'Leydy Pech': 1,
 'Claman': 1,
 'Insiste Morena': 1,
 'Mariana': 2,
 'Marilyn Manson': 2,
 'Deberá Inmujeres': 1,
 'Marcos Zapotitla Becerro': 1,
 'Vázquez Mota': 1,
 'Dona Airbnb': 1,
 'Sergio Quezada Mendoza': 1,
 'Incluyan': 1,
 'Feminicidios': 1,
 'Contundente': 1,
 'Teófila': 1,
 'Félix Salgado': 1,
 'Policía de Xoxocotlán': 1,
 'Malú Micher': 1,
 'Andrés Roemer': 1,
 'Basilia Castañeda': 1,
 'Salgado Macedonio': 1,
 'Menstruación Digna': 1,
 'Detenidas': 1,
 'Sor Juana Inés de la Cruz': 1,
 'María Marcela Lagarde': 1,
 'Crean': 1,
 'Será Rita Plancarte': 1,
 'Valparaiso': 1,
 'México': 1,
 'Plataformas': 1,
 'Policías': 1,
 'Karen': 1,
 'Karla': 1,
 'Condena ONU Mujeres': 1,
 'Llaman México': 1,
 'Sara Lovera': 1,
 'Artemisa Montes': 1,
 'Victoria': 2,
 'Andrea': 1,
 'Irene Hernández': 1,
 'Amnistía Internacional': 1,
 'Ley de Amnistía': 1,
 'Nació Suriana': 1,
 'Rechaza Ss': 1,
 'Refugios': 1,
 'Niñas': 1,
 'Fiscalía': 1,
 'Alejandra Mora Mora': 1,
 'Claudia Uruchurtu': 1,
 'Encubren': 1,
 'Continúa': 1,
 'Dulce María Sauri Riancho': 1,
 'Aprueba Observatorio de Participación Política de la Mujer': 1,
 'Plantean': 1,
 'Graciela Casas': 1,
 'Carlos Morán': 1,
 'Secretaría de Comunicaciones': 1,
 'Diego Helguera': 1,
 'Hidalgo': 1,
 'LGBT+': 1,
 'Osorio Chong': 1,
 'Carla Humphrey Jordán': 1,
 'Lorenzo Córdova': 1,
 'Edomex': 1,
 'CEPAL': 1,
 'Delitos': 1,
 'Murat': 1,
 'Avanza México': 1,
 'Miguel Ángel Mancera Espinosa': 1,
 'Reconoce INMUJERES': 1,
 'Excluyen': 1,
 'Alejandro Murat': 1,
 'Gómez Cazarín': 1,
 'Prevenir': 1,
 'Softbol MX': 1,
 'Martha Sánchez Néstor': 1}

Erros in Spacy Model

One of the interesting errors in the SPACY powered NER process is the erroneous tagging of ‘ Plantean’ as a named entity when, in fact, this string is a verb. Similarly, ‘Delitos’ and ‘Excluyen’ are tagged as ORG or PER tags. Possibly, the morphological shape, orthographic tendency of headlines throws off the small language model. Thus, even with this small test sample, we can see the limits of out-of-the-box open source solutions for NLP tasks. This shows the value added of language analysts, data scientists in organizations dealing with even more specific or specialized texts.

Handling Large Number of Entries On Matplotlib

One issue is that there will be more Named Entities recognized than is useful or even possible to graph.

Despite the fact that we have a valuable dictionary above, we still need to go further and trim down the dictionary in order to figure out what is truly important. In this case, the next Python snippet is helpful in cutting out all dictionary values that contain a frequency count of only ‘1’. There are occasions in which a minimum value must be set.

For instance, suppose you have 1000 documents with 1000 headlines. Your NER analyzer must read through these headlines which ultimately are not a lot of text. Therefore, the minimum count you would like to eliminate is likely to be ‘1’ while if you were analyzing the entirety of the document body, then you may want to raise the minimum threshold for a dictionary value’s frequency.

The following dictionary comprehension places a for-loop type structure that filters out on the basis of the term frequency being anything but ‘1’, the most common frequency. This is appropriate for headlines.

    filter_ones = {term:frequency for term, frequency in data.items() if frequency > 1}

While this dictionary filtering process is better for headlines, a higher filter is needed for body text. 10,000 words or more potentially words implies that the threshold for the minimum frequency value is higher than 10.

    filter_ones = {term:frequency for term, frequency in data.items() if frequency > 10}

The resulting dictionary now presented as a matplotlib figure is shown:

 

def plot_terms_body(topic, data):
    """
    The no.aranage attribution is to understand how to best and programmatically 
    plot the data. Intervals are determined by the counts within the dictionary. 
    
    Args: Topic is the name of the plot/category. The 'data' is  a list of ter 
    
     """
    #The bar plot should be optimized for the max and min size of
    #individual 
    filter_ones = {term:frequency for term, frequency in data.items() if frequency > 10}  
    filtered = {term:frequency for term, frequency in data.items() if frequency > round(sum(filter_ones.values())/len(filter_ones))  }   
    print(round(sum(filtered.values())/len(filtered)), "Average count as result of total terms minus once identified terms divided by all terms.")
    terms = filtered.keys()
    frequency = filtered.values()   
    y_pos = np.arange(len(terms),step=1)
    # min dictionary value, max filtered value ; 
    x_pos = np.arange(min(filtered.values()), max(filtered.values()), step=round(sum(filtered.values())/len(filtered)))
    plt.barh(y_pos, frequency, align='center', alpha=1)
    plt.yticks(y_pos, terms, fontsize=12)
    plt.xticks(x_pos)
    plt.xlabel('Frecuencia en encabezados')
    plt.title(str(topic), fontsize=14)
    plt.tight_layout()
    plt.show()
Named Entities or Frequent Terms

We are able to extract the most common GER or PER tagged Named Entities in a ‘Women’ tagged set of documents sourced from Mexican Spanish news text.

Surprise, surprise, the terms ‘Exigen‘, ‘Llaman‘, ‘Marchan‘ cause problems due to their morphological and textual shape; the term ‘Victoria‘ is orthographically identical and homophonous to a proper names, but in this case, it is not a Named Entities. These false positives in the NER process from Spacy just reflects how language models should be trained over specific texts for better performance. Perhaps, an NER model trained over headlines would fare better. The data was already cleaned due to a collection process detailed below so normalization and tokenization were handled beforehand.

Using Spacy in Python To Extract Named Entities in Spanish

The Spacy Small Language model has some difficulty with contemporary news text that are not either Eurocentric or US based. Likely, this lack of accuracy with contemporary figures owes in part to a less thorough scrape of Wikipedia and relative changes that have taken place in Mexico, Bolivia and other countries with highly variant dialects of Spanish in LATAM since 2018. Regardless, that dataset can and does garner some results for the purpose of this exercise. This means that we can toy a bit around with some publicly available data.

Entity Hash For Spanish Text

In this informal exercise, we will try to hack our way through some Spanish text. Specifically, making use of NER capacities that are sourced from public data – no rule based analysis – with some functions I find useful for visualizing Named Entities in Spanish text. We have prepared a Spanish news text on the topic of ‘violence’ or violent crime sourced from publicly available Spanish news content in Mexico.

Using spacy, you can hash the entities extracted from a corpus. We will use the lighter Spanish language model from Spacy’s natural language toolkit. This language model is a statistical description of Wikipedia’s Spanish corpus which is likely slanted towards White Hispanic speech so beware it’s bias.

First, import the libraries:

import spacy
import spacy.attrs
nlp = spacy.load('es_core_news_sm')

With the libraries in place, we can import the module ‘org_per’. This module is referencing this Github repo.

The work of identifying distinct entities is done in a function that filters for Geographical entities and People. Both of these tags are labeled as ‘GEO’ and ‘PER’, respectively in spacy’s data.

The variable ‘raw_corpus‘ is the argument you provide, which should be some Spanish text data. If you don’t have any, visit the repository and load that file object.

import org_per
raw_corpus = open('corpus_es_noticias_mx.txt','r', encoding='utf-8').read().split("\n")[1:]
entities = org_per.sacalasentidades(raw_corpus)
 
# use list of entities that are ORG or PER and count up
# each invidividual token.     

tokensdictionary = org_per.map_entities(entities) 

As noted before, the text has its origins in Wikipedia. This means that newer more contemporary types of text may not be sufficiently well covered – breadth doesn’t imply depth in analysis because stochastic models rely on some passing resemblance with data that may not ever have been seen.

Anecdotally, over a small corpus, we see performance below 80 percent accuracy for this language model. Presumably, a larger sampling of Wikipedia ES data will perform higher, but certain trends in contemporary news text makes this expectation necessary to temper.

The output returned from running `org_per.map_entities(entities)` will look like this:

{"Bill Clinton": 123,
"Kenneth Starr" : 12,
}

The actual hashing is a simple enough method involving placing NER text with its frequency count as a value in a dictionary. Within your dictionary, you may get parses of Named Entities that are incorrect. That is to say, they are not properly delimited because the Named Entity Language Model does not have an example of your parse. For instance, Lopez Obrador – the current president of Mexico – is not easily recognized as ‘PER’.

Accuracy

This is measured very simply through tabulating how much you agree with the returned Named Entities. The difference between expected and returned values is your error rate. More on accuracy metrics next post.

Introducción a Python

logotipo de Python

Python es un lenguaje de programación bastante flexible y veloz cuando se
considera que es un lenguaje de alto nivel. En este breve resumen del idioma se presenta un ejercicio que abre un archivo. Esta tarea es casi rutinaria en todo trabajo complejo.

Ahora, si buscas un ejercicio más avanzado o uno que simplemente abarca un tema en particular sugiero este enlace . También tengo un proyecto donde identifico las entidades en un texto pero la documentación esta en ingles.

“Los lenguajes de programación de alto nivel se caracterizan porque su estructura semántica es muy similar a la forma como escriben los humanos, lo que permite codificar los algoritmos de manera más natural, en lugar de codificarlos en el lenguaje binario de las máquinas, o a nivel de lenguaje ensamblador.”

UNAM, F. J. (2004). Enciclopedia del lenguaje C. México: Alfaomega/RaMa.

Básicamente quiere decir que Python simplifica operaciones que usan más líneas de código en otros lenguajes de programación dentro de un formato más compacto y amigable. En parte, esto limita la relevancia del idioma para ciertas aplicaciones industriales pero esto no es tan relevante para un programador principiante o avanzado con metas dentro del campo de PLN o Inteligencia Artificial.

Lo importante es que rápidamente puedes llevar a cabo una tarea sobre alguna tarea que sería imposible completar a mano.

Modulos/Librerias de Python

Un modulo o librería en Python es una base de código ya hecho (quizás ya incluida como parte de la instalación de la version de Python) que permite que el usuario lleve a cabo ciertas tareas.

Un módulo es un objeto de Python con atributos con nombres arbitrarios que puede enlazar y hacer referencia.

COVANTEC

El término ‘librería’ o modulo existe para todo idioma de programación.

Este programa usa una sentencia que invoca otro modulo: ‘import’ X es
el patrón. Por ejemplo, import csv implica que el usuario busca importar el modulo de csv que se especializa en abrir o leer un archivo csv.

import csv
#csv una libreria para textos delimitados por ','

Flexibilidad

Bastante se puede llevar acabo con las librerías ya instaladas en cualquier version de Python descargada. Es mas, tambien es posible (pero no recomendable a largo plazo) usar Python sin la estructura de orientación de objeto de manera profunda.

Es decir que no se tienen que definir clases o metodos muy sofisticados y igual se puede derivar un ben uso del idioma.

Es decir que se pueden utilizar los módulos de Python como si fuese un script como el idioma de BASH. Si ninguna de estas oraciones tiene mucho sentido por la falta de contexto, está bien.

Por ahora, es suficiente saber que Python es bastante flexible y fácil de entender. Unas cuantas líneas de código pueden llevar una tarea fácil (pero repetitiva) a una adecuada resolución.

Interpretado – no compilado

Python es un idioma que sin un compilador puede correr y generar un análisis o resultado de algún tipo con mucha facilidad. Por ejemplo, esta interacción demuestra una operación de aritmética llevandose a cabo.

2 + 2 en Python:

Python en Anaconda: Una sesión interactiva donde un programador corre una secuencia de comandos para obtener su resultado final.

Finalmente, recomiendo el uso de Anaconda-Spyder ya que cuenta con mucha documentación y librerías

Spyder es un enviro de desarrollo que esta equipado con los módulos necesarios para un analista de datos junto con buena documentación y accesos a nueva librerías. El grupo de Anaconda actualiza un repositorio con las librerías mas actualizadas y relevantes.

Anaconda está disponible aquí hasta mero abajo de la página.

Sintaxis abrir y cerrar un archivo en Python

Una tarea simple como abrir un archivo es posible con una sola linea de codigo o varias dependiendo en como se guste organizar el codigo.

#NO ES CODIGO: se agrega documentacion o comentarios
file = open('ejemplo.txt', 'r')

#puedes acceder methodos del objeto 'file' con '.'
file.read()

Alternativamente, la sintaxis permite:

file = open('ejemplo.txt', 'r').read()

Tambien existe la posibilidad de agregar argumentos que lean bien el tipo de texto que se busque leer ya que hay codecs distintos de acuerdo con los caracteres de un idioma. Nota que no hemos invocado ningun tipo de modulo para estas operaciones.

file = open('ejemplo.txt', 'r', encoding='utf-8').read()

Importando librerías ya preinstaladas: regular expressions

Una librería ya instalada en Python es la de ‘regular expressions‘ denominada simplemente “re” y que se importa de la siguiente manera.

import re 

La librería puede invocar el metodo de substitución com un atributo. Es decir con el símbolo ‘.’ adjunto al modulo ‘re’.

objeto = re.sub('abc', 'cba', 'este es el texto que se manipula abc')

El ejemplo tiene como primer argumento de re.sub( …) a el hilario que se buscar reemplazar y el segundo argumento con que se busca reemplazar. Finalmente, el tercer argumento hace referencia al texto que buscar manipular: ‘este es el texto que se manipula abc’.