Tokenizing Text, Finding Word Frequencies Within Corpora

One way to think about tokenization is to consider it as finding the smallest possible unit of analysis for computational linguistics tasks. As such, we can think of tokenization as among the first steps (along with normalization) in the average NLP pipeline or computational linguistics analysis. This process helps break down text into a manner interpretable for the computer and analyst.

NLTK Is Very Comprehensive

NLTK is likely the best place to start for both understanding and customizing NLP pipelines. Please review their documentation on tokenization here: NLTK – Tokenization Example. While I recommend reviewing NLTK, you should also keep up with Engineers who mostly use Tensorflow. Yes, you must learn two packages at once if you are a linguist in the IT industry.

Learn TensorFlow

“Learn TensorFlow. Given the effort you will place into learning how to combine computing with linguistics, you are also in a strange way lightening the load by proceeding in parallel with industry trends. Likely, a popular topic will contain ample documentation. Consider that most engineers will have a frame of reference for tokenization that is not necessarily grounded in Linguistics, but instead based on interactions with industry centric examples with an intent to prepare data for Machine Learning.”

Industry Perceptions

Thus, if you do not know tokenization both in terms of how engineers perceive tokenization and how linguists work with the concept, then you will likely be perceived as not only not knowing how to program, but also not knowing about your own subject matter as a linguist. While this is obviously not true, perception matters so you must make the effort to reach engineers at their level when collaborating.

# -*- coding: utf-8 -*-
"""
Created on Fri Sep 10 23:53:10 2021

@author: Ricardo Lezama
"""
import tensorflow as tf 

text = """ 
    A list comprehension is a syntactic construct available in
    some programming languages for creating a list based on existing lists. 
    It follows the form of the mathematical set-builder notation (set comprehension) as
    distinct from the use of map and filter functions.
    """
 
content = tf.keras.preprocessing.text.text_to_word_sequence(text)

Obviously, we do not want to repeat the one liner above over and over again within our individual python script. Thus, we neatly repackage the one line as the main line in a function titled ‘tokenize_lacartita‘ as follows:



def tokenize_lacartita(text):
    """ open_lacartita_data references a function  to open our txt documents. 
    
    Arg: This should be a text string that will be converted into a list with individual tokens.
          ex.) ['This', 'is', 'text', 'tokenized', 'by', 'tensorflow']
    
    Returns: New line separated documents. 
    """
    keras_tok  = tf.keras.preprocessing.text.text_to_word_sequence(text)
    return keras_tok

The data we will receive for using this tokenization module is shown below. As you can see, there are individual strings, lowercased and no punctuation as this is by default eliminated in the tokenization process.

['morelia',
 'apoyar',
 'no',
 'es',
 'delinquir',
 'señalan',
 'grupos',
 'feministas',
 'a',
 'sheinbaum',
 'capacitan',
 'a',
 'personal',
 'de',
 'la',
 'fiscalía',
 'cdmx']

Word Frequency and Relative Percentage

We can create a function to find word frequencies. Granted, the counter module in Python can do this already, but, for educational purposes, we include a function to track a word’s frequency within a list. The if-condition below can permit us to count whenever we see our target word within the word list. In this case, we examine a series of headlines related to Mexico that were gathered and classified by hand by Mexican University students.

def word_frequency(word, word_list):
    """
    Function that counts word frequencies.

    Arg: target word to count within word list.
    
    Return: return a count. 
    """
    count = 0
    for word in word_list:
        if word == target_word:
            count += 1
    return count

The word_frequency function receives “AMLO” or it’s normalized version: ‘amlo’ alongside the word list as the second argument. The frequency of the string is listed next to the term when it is returned. Obviously, you can add more elaborate details to the body of the function.

word_frequency("amlo", saca)
Out[164]: 'amlo: 12'

Tokenization In Native Python

At times, an individual contributor must know how to not write a custom function or invoke tokenization from a complex module with heavy libraries. There may be times that linguists are working within siloed environments. This implies that you would not have write privileges to install libraries, like TensorFlow, in a generic linux environment. In this cases, use native python – the term references built in functions or modules that require no additional installation outside of having the most updated version of python.

In fact, you may indeed need to rely more on the raw text you are attempting to tokenize. At times, there are different orthographic marks that are relevant and necessary to find. For example, were you to split based on a space, ” “, or a period, “.”, you can do so by calling the split attribute.

def word(text): 
    return text.split(" ")

All strings contain a split attribute that you can invoke for free in general. Furthermore, you can run a method called ‘strip’ and cleanout a lot of whitespaces. Please see the examples below.

def sentence(text): 
    text_strip = text.strip()
    return text_strip.split(" ")

Getting Started With The Command Line

First of all, what is a ‘command line’? Visually, within a Microsoft Operating System, the command line looks like this:

Pictured is a command line. You can access this by searching ‘CMD’ on your windows search bar.

You can access the command line by typing ‘CMD’ on your Windows Desktop.

As the name implies, a command line is an interface where a user inputs literal commands to accomplish a computing task. The same or similar amount of tasks are often possible via mouse clicks, keyboard commands outside of the CMD screen, etc. Prior to the mouse, the command line was the primary mode of interacting with a computer in the 1970’s prior to the invention of the mouse, point and click method for accessing files in a computer. You may reasonably ask: ‘If people can accomplish basic computing tasks with a simple mouse click and scroll of the screen, then why would anyone use the command line’? The answer is that in modern computing, heavier or more complex tasks can be accomplished more easily by providing specific instructions that can not easily be accomplished with mouse clicks.

Let’s explore a simple and commonly used command to just get started in the command line.

The MV command

The MV command, an often used command for server administrators and grunt level programmers (like me), can help move files around and between computers. Think of programming at this level as steps or tasks accomplished. Individually, this command seems insignificant, but often times, commands are used in harmony with other commands.

Like I alluded to above in the introduction, there is some parity between the tasks accomplished with a mouse and the command line. They are both a type of interface with the computer/server.

For example, one could drag and drop a series of files, one by one, into a different folder based on some criteria. However, let’s say you have a large set of files with some common denominator in terms of text or naming conventions. While you could continue to drag and drop, there are moments in which there are simply too many files to easily view within a screen.

This simple command called ‘mv’ or short for move permits you to move files within a command line interface to other directories.

mv FILENAME somedirectory/

In general, first you type the command type, in this case ‘mv‘, then to the right of that command you specify which files you will move.

mv FILENAME ZIP/

Below is a screenshot of a real world example, a screenshot of a directory that contains these files and the directory where I want to place these files. You may have noticed that there is a ‘*‘ symbol right before zip and then the directory ZIP.

The ‘*‘ symbol is called the Kleene star or wildcard character. The wildcard character matches any character and any number of characters simultaneously. This * character tells the computer to search for a filename with any number and type of strings prior to the string I specified, ‘zip’. It will therefore move all the zip files into the ZIP/ directory.

‘MV’ command where we specify the movement of zip files into ZIP

Here is the literal codeblock:

mv *zip ZIP/

As you can see, the file names that can be moved are ‘g2p-seq2seq-master.zip’, ‘spa-eng.zip’, ‘spanish_g2p.zip’ and ‘NER_news-main.zip’. Its a lot a easier to just type *zip and go about your business that away. Imagine if instead of 4 files it was 400 files you needed to move. In that scenario, therein lies the utility of such a simple command that can “catch-all” the filenames.

Please note, you can also type the file names individually. Here is an example:

mv 'g2p-seq2seq-master.zip' 'spa-eng.zip' 'spanish_g2p.zip' 'NER_news-main.zip' ZIP/

Hopefully, this brief overview of how to use the ‘MV’ command is helpful. Feel free to reach out at lezama@lacartita.com with any questions. Thanks!

Frequency Counts For Named Entities Using Spacy/Python Over MX Spanish News Text

On this post, we review some straightforward code written in python that allows a user to process text and retrieve named entities alongside their numerical counts. The main dependencies are Spacy, a small compact version of their Spanish language model built for Named Entity Recognition and the tabular data processing library, Matplotlib, if you’re looking to further structure the data.

Motivation(s)

Before we begin, it may be relevant to understand why we would want to extract these data points in the first place. Often times, there is a benefit to quickly knowing what named entity a collection (or even a hadoop sized bucket) of stories references. For instance, one of the benefits is the quick ability to visualize the relative importance of an entity to these stories without having to read all of them.

Even if done automatically, the process of Named Entity Recognition is still guided by very basic principles, I think. For instance, the very basic reasoning surrounding a retelling of events for an elementary school summary applies to the domain of Named Entity Recognition. That is mentioned below in the Wh-question section of this post.

Where did you get this data?

Another important set of questions is what data are we analyzing and how did we gather this dataset?

Ultimately, a great number of computational linguists or NLP practitioners are interested in compiling human rights centered corpora to create tools that analyze newsflow quickly on these points. When dealing with sensitive topics, the data has to center on those related topics for at-need populations. This specific dataset centers around ‘Women/Women’s Issues’ as specified by the group at SugarBearAI.

As for ‘where did they obtain this data?, that question is answered as follows: this dataset of six hundred articles existing in the LaCartita Db – which contains several thousand hand-tagged articles – is annotated by hand. The annotators are a group of Mexican graduates from UNAM and IPN universities. A uniform consensus amongst the news taggers was required for its introduction into the set of documents. There were 3 women and 1 man within the group of analysts, with all of them having prior experience gathering data in this domain.

While the six hundred Mexican Spanish news headlines analyzed are unavailable on github, a smaller set is provided on that platform for educational purposes. Under all conditions, the data was tokenized and normalized with a fairly sophisticated set of Spanish centric regular expressions.

Please feel free to reach out to that group in research@lacartita.com for more information on this hand-tagged dataset.

Wh-Questions That Guide News Judgments

With all of the context on data and motivations in mind, we review some points on news judgment that can help with the selection of texts for analysis and guide the interpretation of automatically extracted data points.

Basic news judgment is often informed by the following Wh-questions:

1.) What occurred in this news event? (Topic Classification; Event Extraction)

2.) Who was involved in the news?

3.) When did this news event take place?

4.) Where did it take place?

5.) Why did this take place?

If you think of these questions at a fairly high level of abstraction, then you’ll allow me to posit that the first two questions are often the domain of Topic Classification and Named Entity Recognition, respectively. This post will deal with the latter, but assume that this issue of extracting named entities deals with documents already organized on the basis of some unifying topic. This is why it’s useful to even engage in the activity.

In other words, you – the user of this library – will be open to providing a collection of documents already organized under some concept/topic. You would be relying on your knowledge of that topic to make sense of any frequency analysis of named entities, important terms (TF-IDF) etc. as is typical when handling large amounts of unstructured news text. These concepts – NER and TFIDF – are commonly referenced in Computational Linguistics and Information Retrieval; they overlap in applied settings frequently. For instance, TF-IDF and NER pipelines power software applications that deal in summarizing complex news events in real time. So, it’s important to know that there are all sorts of open source libraries that handle these tasks for any average user or researcher.

Leveraging Spacy’s Lightweight Spanish Language Model

The actual hard work involves identifying distinct entities; the task of identifying Named Entities involves statistical processes that try to generalize what a typical Named Entity’s morphological shape in text involves.

In this example, my particular script is powered by the smaller language model from Spacy. One thing that should be noted is that the text has its origins in Wikipedia. This means that newer contemporary types of text may not be sufficiently well covered – breadth doesn’t imply depth in analysis. Anecdotally, over this fairly small headline-only corpus sourced by hand with UNAM and IPN students that contains text on the Mexican president, Andres Manuel Lopez Obrador, Covid and local crime stories, we see performance below 80 percent accuracy from the small Spanish language model. Here’s the small 600 headline strong sample: example headlines referenced.

NER_News Module

Using the below scripts, you can extract persons and organizations. Using spacy, you can extract the entities extracted from a corpus. We will use the lighter Spanish language model from Spacy’s natural language toolkit. This post assumes that you’ve dealt with the basics of Spacy installation alongside its required models. If not, visit here. Therefore, we should expect the below lines to run without problem:

import spacy
import spacy.attrs
nlp = spacy.load('es_core_news_sm')

In this example, we use clean texts that are ‘\n’ (“new line separated”) separated texts. We count and identify the entities writing to either memory or printing to console the results of the NER process. The following line contains the first bit of code referencing Spacy and individuates each relevant piece of the text. Suppose these were encyclopedic or news articles, then the split would probably capture paragraph or sentence level breaks in the text:

raw_corpus = open('corpora/titularesdemx.txt','r', encoding='utf-8').read().split("\n")[1:]

The next step involves placing NER text with its frequency count as a value in a dictionary. This dictionary will be result of running our ‘sacalasentidades’ method over the raw corpus. The method extracts GEO-political entities, like a country, or PER-tagged entities, like a world leader.

import spacy
import spacy.attrs
nlp = spacy.load('es_core_news_sm')

import org_per
raw_corpus = open('corpora/titularesdemx.txt','r', encoding='utf-8').read().split("\n")[1:]
entities = org_per.sacalasentidades(raw_corpus)
 
# use list of entities that are ORG or GEO and count up each invidividual token.     

tokensdictionary = org_per.map_entities(entities) 

The object tokensdictionary formatted output will look like this:

{'AMLO': 11,
 'Desempleo': 1,
 'Perú': 1,
 'América Latina': 3,
 'Banessa Gómez': 1,
 'Resistir': 2,
 'Hacienda': 1,
 'Denuncian': 7,
 'Madero': 1,
 'Subastarán': 1,
 'Sánchez Cordero': 4,
 'Codhem': 1,
 'Temen': 2,
 'Redes de Derechos Humanos': 1,
 'Gobernación': 1,
 'Sufren': 1,
 '¡Ni': 1,
 'Exigen': 2,
 'Defensoras': 1,
 'Medicina': 1,
 'Género': 1,
 'Gabriela Rodríguez': 1,
 'Beatriz Gasca Acevedo': 1,
 'Diego "N': 1,
 'Jessica González': 1,
 'Sheinbaum': 3,
 'Esfuerzo': 1,
 'Incendian Cecyt': 1,
 'Secretaria de Morelos': 1,
 'Astudillo': 1,
 'Llaman': 3,
 'Refuerzan': 1,
 'Mujer Rural': 1,
 'Inician': 1,
 'Violaciones': 1,
 'Llama Olga Sánchez Cordero': 1,
 'Fuentes': 1,
 'Refuerza Michoacán': 1,
 'Marchan': 4,
 'Ayelin Gutiérrez': 1,
 'Maternidades': 1,
 'Coloca FIRA': 1,
 'Coloquio Internacional': 1,
 'Ley Olimpia': 3,
 'Toallas': 1,
 'Exhorta Unicef': 1,
 'Condena CNDH': 1,
 'Policías de Cancún': 1,
 'Exposición': 1,
 'Nadia López': 1,
 'Aprueba la Cámara': 1,
 'Patriarcales': 1,
 'Sofía': 1,
 'Crean Defensoría Pública para Mujeres': 1,
 'Friedrich Katz': 1,
 'Historiadora': 1,
 'Soledad Jarquín Edgar': 1,
 'Insuficientes': 1,
 'Wikiclaves Violetas': 1,
 'Líder': 1,
 'Alcaldía Miguel Hidalgo': 1,
 'Ventana de Primer Contacto': 1,
 'Parteras': 1,
 'App': 1,
 'Consorcio Oaxaca': 2,
 'Comité': 1,
 'Verónica García de León': 1,
 'Discapacidad': 1,
 'Cuánto': 1,
 'Conasami': 1,
 'Amnistía': 1,
 'Policía de Género': 1,
 'Parteras de Chiapas': 1,
 'Obligan': 1,
 'Suspenden': 1,
 'Contexto': 1,
 'Clemencia Herrera': 1,
 'Fortalecerán': 1,
 'Reabrirá Fiscalía de Chihuahua': 1,
 'Corral': 1,
 'Refugio': 1,
 'Alicia De los Ríos': 1,
 'Evangelina Corona Cadena': 1,
 'Félix Salgado Macedonio': 5,
 'Gabriela Coutiño': 1,
 'Aída Mulato': 1,
 'Leydy Pech': 1,
 'Claman': 1,
 'Insiste Morena': 1,
 'Mariana': 2,
 'Marilyn Manson': 2,
 'Deberá Inmujeres': 1,
 'Marcos Zapotitla Becerro': 1,
 'Vázquez Mota': 1,
 'Dona Airbnb': 1,
 'Sergio Quezada Mendoza': 1,
 'Incluyan': 1,
 'Feminicidios': 1,
 'Contundente': 1,
 'Teófila': 1,
 'Félix Salgado': 1,
 'Policía de Xoxocotlán': 1,
 'Malú Micher': 1,
 'Andrés Roemer': 1,
 'Basilia Castañeda': 1,
 'Salgado Macedonio': 1,
 'Menstruación Digna': 1,
 'Detenidas': 1,
 'Sor Juana Inés de la Cruz': 1,
 'María Marcela Lagarde': 1,
 'Crean': 1,
 'Será Rita Plancarte': 1,
 'Valparaiso': 1,
 'México': 1,
 'Plataformas': 1,
 'Policías': 1,
 'Karen': 1,
 'Karla': 1,
 'Condena ONU Mujeres': 1,
 'Llaman México': 1,
 'Sara Lovera': 1,
 'Artemisa Montes': 1,
 'Victoria': 2,
 'Andrea': 1,
 'Irene Hernández': 1,
 'Amnistía Internacional': 1,
 'Ley de Amnistía': 1,
 'Nació Suriana': 1,
 'Rechaza Ss': 1,
 'Refugios': 1,
 'Niñas': 1,
 'Fiscalía': 1,
 'Alejandra Mora Mora': 1,
 'Claudia Uruchurtu': 1,
 'Encubren': 1,
 'Continúa': 1,
 'Dulce María Sauri Riancho': 1,
 'Aprueba Observatorio de Participación Política de la Mujer': 1,
 'Plantean': 1,
 'Graciela Casas': 1,
 'Carlos Morán': 1,
 'Secretaría de Comunicaciones': 1,
 'Diego Helguera': 1,
 'Hidalgo': 1,
 'LGBT+': 1,
 'Osorio Chong': 1,
 'Carla Humphrey Jordán': 1,
 'Lorenzo Córdova': 1,
 'Edomex': 1,
 'CEPAL': 1,
 'Delitos': 1,
 'Murat': 1,
 'Avanza México': 1,
 'Miguel Ángel Mancera Espinosa': 1,
 'Reconoce INMUJERES': 1,
 'Excluyen': 1,
 'Alejandro Murat': 1,
 'Gómez Cazarín': 1,
 'Prevenir': 1,
 'Softbol MX': 1,
 'Martha Sánchez Néstor': 1}

Erros in Spacy Model

One of the interesting errors in the SPACY powered NER process is the erroneous tagging of ‘ Plantean’ as a named entity when, in fact, this string is a verb. Similarly, ‘Delitos’ and ‘Excluyen’ are tagged as ORG or PER tags. Possibly, the morphological shape, orthographic tendency of headlines throws off the small language model. Thus, even with this small test sample, we can see the limits of out-of-the-box open source solutions for NLP tasks. This shows the value added of language analysts, data scientists in organizations dealing with even more specific or specialized texts.

Handling Large Number of Entries On Matplotlib

One issue is that there will be more Named Entities recognized than is useful or even possible to graph.

Despite the fact that we have a valuable dictionary above, we still need to go further and trim down the dictionary in order to figure out what is truly important. In this case, the next Python snippet is helpful in cutting out all dictionary values that contain a frequency count of only ‘1’. There are occasions in which a minimum value must be set.

For instance, suppose you have 1000 documents with 1000 headlines. Your NER analyzer must read through these headlines which ultimately are not a lot of text. Therefore, the minimum count you would like to eliminate is likely to be ‘1’ while if you were analyzing the entirety of the document body, then you may want to raise the minimum threshold for a dictionary value’s frequency.

The following dictionary comprehension places a for-loop type structure that filters out on the basis of the term frequency being anything but ‘1’, the most common frequency. This is appropriate for headlines.

    filter_ones = {term:frequency for term, frequency in data.items() if frequency > 1}

While this dictionary filtering process is better for headlines, a higher filter is needed for body text. 10,000 words or more potentially words implies that the threshold for the minimum frequency value is higher than 10.

    filter_ones = {term:frequency for term, frequency in data.items() if frequency > 10}

The resulting dictionary now presented as a matplotlib figure is shown:

 

def plot_terms_body(topic, data):
    """
    The no.aranage attribution is to understand how to best and programmatically 
    plot the data. Intervals are determined by the counts within the dictionary. 
    
    Args: Topic is the name of the plot/category. The 'data' is  a list of ter 
    
     """
    #The bar plot should be optimized for the max and min size of
    #individual 
    filter_ones = {term:frequency for term, frequency in data.items() if frequency > 10}  
    filtered = {term:frequency for term, frequency in data.items() if frequency > round(sum(filter_ones.values())/len(filter_ones))  }   
    print(round(sum(filtered.values())/len(filtered)), "Average count as result of total terms minus once identified terms divided by all terms.")
    terms = filtered.keys()
    frequency = filtered.values()   
    y_pos = np.arange(len(terms),step=1)
    # min dictionary value, max filtered value ; 
    x_pos = np.arange(min(filtered.values()), max(filtered.values()), step=round(sum(filtered.values())/len(filtered)))
    plt.barh(y_pos, frequency, align='center', alpha=1)
    plt.yticks(y_pos, terms, fontsize=12)
    plt.xticks(x_pos)
    plt.xlabel('Frecuencia en encabezados')
    plt.title(str(topic), fontsize=14)
    plt.tight_layout()
    plt.show()
Named Entities or Frequent Terms

We are able to extract the most common GER or PER tagged Named Entities in a ‘Women’ tagged set of documents sourced from Mexican Spanish news text.

Surprise, surprise, the terms ‘Exigen‘, ‘Llaman‘, ‘Marchan‘ cause problems due to their morphological and textual shape; the term ‘Victoria‘ is orthographically identical and homophonous to a proper names, but in this case, it is not a Named Entities. These false positives in the NER process from Spacy just reflects how language models should be trained over specific texts for better performance. Perhaps, an NER model trained over headlines would fare better. The data was already cleaned due to a collection process detailed below so normalization and tokenization were handled beforehand.

Guide for Raspberry Pi Assembly

Guide on installation of Raspberry Pi and assembly of components.

Parts

In this guide, we will show a step by step assembly and installation of the Operating System for a Raspberry Pi 4 Model B (8gb Ram).

For this particular installation, we will need a Vilros kit which consists of the following pieces:

• One black aluminum case with a large cooling fan to cool down the unit.
• Power Supply of 5v-3A (USB-C connection).
• A micro HDMI adaptor for normal HDMI screens.
• One Raspberry Pi 4 Model B (8GB ram).

We also must have a microSD card as the Raspberry units disk space. We will launch our OS from said microSD card. The Vilros kit does not include one.

Assembly

The first thing we must do is verify that the Raspberry Pi be in optimal conditions, without damage of any kind due to the shipment process or manufacturing issues that have recently become relevant. Verify that there is no SdCard of any kind, but as mentioned, the Vilros kit does not contain an SD card.

Raspberry Pi 8 model 8GB

Vilros Assembly Kit

At this point, we begin to assemble the Raspberry Pi CPU/Controller into the Vilros aluminum case.

Vilros Kit + Raspberry Pi

The bottom of the aluminum case well mounted.

Once the case is securely attached to the Raspberry Pi, we need to also make sure that the heat dissipaters pictured below are ready.

Heat Dissippaters placed on top of Raspberry Pi Processors.

Each processor will have a dissipater glued to a chip surface. This permits for performance enhancements due to cooler operating conditions.

Ventilation

Now we must connect the cooling fan to the upper part of the case. The kit comes with an image that explains which of the metal pins to connect your fan to along with where to place a cable to ground the electricity (presumably).

Manual specifying the right way to mount the fan.

You’ll need to connect pin #1 to slot number 14.

The pin number assignment is literally at the bottom of the pins over the green surface of the board.

Connect the fan to #14 of the Raspberry Pi pins.

The picture above shows how to assemble the fan and connect it to one of the exposed pins on the Raspberry Pi. Make sure that these cables never intertwine with other unused pins or get somehow crushed in your assembly process.

This board can now be closed up with the case itself.

Assembled case final point.

The kit includes screws and screwdriver. You will also have to screw on bolts to hold the case firm:

Screws go in red zone.

There are some optional stoppers you can place on top of the entry point for these screws. In this bottom image, youll see only the screwed on bolts since because I’ll be opening this case constantly, there will be a needless amount of work for me to access components.

Insert the SD card where our Operating System will be loaded.

This is the Raspberry Pi 4 Model power supply.