Tokenizing Text, Finding Word Frequencies Within Corpora

One way to think about tokenization is to consider it as finding the smallest possible unit of analysis for computational linguistics tasks. As such, we can think of tokenization as among the first steps (along with normalization) in the average NLP pipeline or computational linguistics analysis. This process helps break down text into a manner interpretable for the computer and analyst.

NLTK Is Very Comprehensive

NLTK is likely the best place to start for both understanding and customizing NLP pipelines. Please review their documentation on tokenization here: NLTK – Tokenization Example. While I recommend reviewing NLTK, you should also keep up with Engineers who mostly use Tensorflow. Yes, you must learn two packages at once if you are a linguist in the IT industry.

Learn TensorFlow

“Learn TensorFlow. Given the effort you will place into learning how to combine computing with linguistics, you are also in a strange way lightening the load by proceeding in parallel with industry trends. Likely, a popular topic will contain ample documentation. Consider that most engineers will have a frame of reference for tokenization that is not necessarily grounded in Linguistics, but instead based on interactions with industry centric examples with an intent to prepare data for Machine Learning.”

Industry Perceptions

Thus, if you do not know tokenization both in terms of how engineers perceive tokenization and how linguists work with the concept, then you will likely be perceived as not only not knowing how to program, but also not knowing about your own subject matter as a linguist. While this is obviously not true, perception matters so you must make the effort to reach engineers at their level when collaborating.

# -*- coding: utf-8 -*-
"""
Created on Fri Sep 10 23:53:10 2021

@author: Ricardo Lezama
"""
import tensorflow as tf 

text = """ 
    A list comprehension is a syntactic construct available in
    some programming languages for creating a list based on existing lists. 
    It follows the form of the mathematical set-builder notation (set comprehension) as
    distinct from the use of map and filter functions.
    """
 
content = tf.keras.preprocessing.text.text_to_word_sequence(text)

Obviously, we do not want to repeat the one liner above over and over again within our individual python script. Thus, we neatly repackage the one line as the main line in a function titled ‘tokenize_lacartita‘ as follows:



def tokenize_lacartita(text):
    """ open_lacartita_data references a function  to open our txt documents. 
    
    Arg: This should be a text string that will be converted into a list with individual tokens.
          ex.) ['This', 'is', 'text', 'tokenized', 'by', 'tensorflow']
    
    Returns: New line separated documents. 
    """
    keras_tok  = tf.keras.preprocessing.text.text_to_word_sequence(text)
    return keras_tok

The data we will receive for using this tokenization module is shown below. As you can see, there are individual strings, lowercased and no punctuation as this is by default eliminated in the tokenization process.

['morelia',
 'apoyar',
 'no',
 'es',
 'delinquir',
 'señalan',
 'grupos',
 'feministas',
 'a',
 'sheinbaum',
 'capacitan',
 'a',
 'personal',
 'de',
 'la',
 'fiscalía',
 'cdmx']

Word Frequency and Relative Percentage

We can create a function to find word frequencies. Granted, the counter module in Python can do this already, but, for educational purposes, we include a function to track a word’s frequency within a list. The if-condition below can permit us to count whenever we see our target word within the word list. In this case, we examine a series of headlines related to Mexico that were gathered and classified by hand by Mexican University students.

def word_frequency(word, word_list):
    """
    Function that counts word frequencies.

    Arg: target word to count within word list.
    
    Return: return a count. 
    """
    count = 0
    for word in word_list:
        if word == target_word:
            count += 1
    return count

The word_frequency function receives “AMLO” or it’s normalized version: ‘amlo’ alongside the word list as the second argument. The frequency of the string is listed next to the term when it is returned. Obviously, you can add more elaborate details to the body of the function.

word_frequency("amlo", saca)
Out[164]: 'amlo: 12'

Tokenization In Native Python

At times, an individual contributor must know how to not write a custom function or invoke tokenization from a complex module with heavy libraries. There may be times that linguists are working within siloed environments. This implies that you would not have write privileges to install libraries, like TensorFlow, in a generic linux environment. In this cases, use native python – the term references built in functions or modules that require no additional installation outside of having the most updated version of python.

In fact, you may indeed need to rely more on the raw text you are attempting to tokenize. At times, there are different orthographic marks that are relevant and necessary to find. For example, were you to split based on a space, ” “, or a period, “.”, you can do so by calling the split attribute.

def word(text): 
    return text.split(" ")

All strings contain a split attribute that you can invoke for free in general. Furthermore, you can run a method called ‘strip’ and cleanout a lot of whitespaces. Please see the examples below.

def sentence(text): 
    text_strip = text.strip()
    return text_strip.split(" ")