N-Gram Analysis Over SugarBearAI Corpus

I was recently able to do some analysis over the Sugar Bear AI violence corpus, a collection of documents classified by analysts over at the SugarBear AI group. The group has been classifying manually thousands of documents over the past year. Since the start of the pandemic, they’ve made some of their material available online.

There is a growing consensus that solid data is needed for more precise or specific topics. For instance, we can not solely rely on webscraped data from Reddit.

To put it simply, N-Grams are units of strings in text that appear with some amount of frequency over a given text. The strings are called ‘tokens’; what specific guidelines on how strings are defined as a token depends on the developers of a corpus. They may have specific applications in mind when developing a corpus. In the case of the Sugar Bear AI, the most relevant application is the development of a news classifier for sensitive topics in Mexico.

        if not re.match("[^a-zA-Z]", i) and not re.match("[^a-zA-Z]", j) 

We used NLTK to sample out what a basic N-Gram analysis would yield. As is typical in news text, this corpus containedd lots of dates, non-alphanumeric content, no emojis or content associated with social media. The actual analysis from the NTLK ngram mdoule is sufficient for our purposes. However, some modest use of the tokenized module and regular expressions did the trick of ‘normalizing’ the text.

The final point I should make about the corpus is that it is fairly small. About 400 documents from 8 different news sources in Mexico. The careful approach to its curation and collection, however, ensures any researcher working with the corpus will have a balanced and not commonly analyzed collection of texts on sensitive topics.

import re
from nltk.corpus import stopwords
def find_bigrams(t):
    list = []
    for a in t:
        token = nltk.word_tokenize(a)
        stop_words = set(stopwords.words('spanish')) 
        filtered_sentence = [w for w in token if not w in stop_words]         
        bigrams = ngrams(filtered_sentence,2)        
        for i, j in bigrams:
            if not re.match("[^a-zA-Z]", i) and not re.match("[^a-zA-Z]", j) :
                list.append("{0} {1}".format(i, j))
    return list 
alpha = find_bigrams(list)
...
frente múltiples
visité cerca
absoluto atrocidades
consejeras rendirán
ser derechos
Vazquez hijo
detenga amenazas
puntos pliego
odisea huyen
anunció separación
Covid-19 debe
víctima frente
Acapulco sentenciar
logró enviar
garantías acceso
documentaron atención
Atendiendo publicación
llamado Estados
interior Estaciones
encuentran mayoría
informativas pandemia
relación familiar
comunidad Agua
día finalizar
Jornada Ecatepec
lenta momento
guerras territoriales
Pese Facebook
pedido retiro
Rendón cuenta
A.C. Mujeres
error quizá
iniciativas enfocadas
consciente anticapitalista
afectar mujer
Justicia u
alertas tempranas
mediante expresión
Nahuatzen comunidad
garantizar repetición
alza indicadores
noche martes
creó Guardia
asegura feminicidio
Unión Fuerza
pronunciamiento Red
carbono equivalente
condiciones desarrollo
comparativo agresiones
recorte refugios
agregó pesar
Ejemplos de Ngrams de corpus de violencia.

There were some interesting insights on who typically writes about violent events in Mexico. Most large institutions, like El Financiero or even La Jornada, are capable of finding