N-Gram Analysis Over SugarBearAI Corpus

I was recently able to do some analysis over the Sugar Bear AI violence corpus, a collection of documents classified by analysts over at the SugarBear AI group. The group has been classifying manually thousands of documents over the past year. Since the start of the pandemic, they’ve made some of their material available online.

There is a growing consensus that solid data is needed for more precise or specific topics. For instance, we can not solely rely on webscraped data from Reddit.

To put it simply, N-Grams are units of strings in text that appear with some amount of frequency over a given text. The strings are called ‘tokens’; what specific guidelines on how strings are defined as a token depends on the developers of a corpus. They may have specific applications in mind when developing a corpus. In the case of the Sugar Bear AI, the most relevant application is the development of a news classifier for sensitive topics in Mexico.

        if not re.match("[^a-zA-Z]", i) and not re.match("[^a-zA-Z]", j) 

We used NLTK to sample out what a basic N-Gram analysis would yield. As is typical in news text, this corpus containedd lots of dates, non-alphanumeric content, no emojis or content associated with social media. The actual analysis from the NTLK ngram mdoule is sufficient for our purposes. However, some modest use of the tokenized module and regular expressions did the trick of ‘normalizing’ the text.

The final point I should make about the corpus is that it is fairly small. About 400 documents from 8 different news sources in Mexico. The careful approach to its curation and collection, however, ensures any researcher working with the corpus will have a balanced and not commonly analyzed collection of texts on sensitive topics.

import re
from nltk.corpus import stopwords
def find_bigrams(t):
    list = []
    for a in t:
        token = nltk.word_tokenize(a)
        stop_words = set(stopwords.words('spanish')) 
        filtered_sentence = [w for w in token if not w in stop_words]         
        bigrams = ngrams(filtered_sentence,2)        
        for i, j in bigrams:
            if not re.match("[^a-zA-Z]", i) and not re.match("[^a-zA-Z]", j) :
                list.append("{0} {1}".format(i, j))
    return list 
alpha = find_bigrams(list)
frente múltiples
visité cerca
absoluto atrocidades
consejeras rendirán
ser derechos
Vazquez hijo
detenga amenazas
puntos pliego
odisea huyen
anunció separación
Covid-19 debe
víctima frente
Acapulco sentenciar
logró enviar
garantías acceso
documentaron atención
Atendiendo publicación
llamado Estados
interior Estaciones
encuentran mayoría
informativas pandemia
relación familiar
comunidad Agua
día finalizar
Jornada Ecatepec
lenta momento
guerras territoriales
Pese Facebook
pedido retiro
Rendón cuenta
A.C. Mujeres
error quizá
iniciativas enfocadas
consciente anticapitalista
afectar mujer
Justicia u
alertas tempranas
mediante expresión
Nahuatzen comunidad
garantizar repetición
alza indicadores
noche martes
creó Guardia
asegura feminicidio
Unión Fuerza
pronunciamiento Red
carbono equivalente
condiciones desarrollo
comparativo agresiones
recorte refugios
agregó pesar
Ejemplos de Ngrams de corpus de violencia.

There were some interesting insights on who typically writes about violent events in Mexico. Most large institutions, like El Financiero or even La Jornada, are capable of finding

Sugar Ray Leonard: ‘Canelo Has My Vote As P4P’

Sugar Ray Leonard – Former 5 Weight Division World Champion

In the Ak & Barak show, Sugar Ray Leonard labeled Canelo as the best pound for pound fighter in the world. Sugar Ray recalled his bouts against Roberto Duran and Marvin ‘Marvelous’ Hagler’s boxing ability. Leonard empathized with Hagler’s inability to tolerate his loss against him. He also praised the abilities of the main names within the lightweight division.

Canelo’s entrance into his bout with Callum Smith in San Antonio, Texas

After a dominating performance against Callum Smith, Canelo Alvarez has continued to garner nearly unanimous praise for his latest performances in multiple weight classes. The distinctive approach of Canelo – refined over an over 15 year career – is to methodically time and counter his opponents, while maintaining elusive head movement, but without fleet footed ‘running’.

Elderly Asian Woman Robbed, Beaten In Oakland, California

71 Year Old woman robbed on 6th Internal and 2/3, Oakland, CA

Betty Yu on Instagram: “OAKLAND GRANDMOTHER ROBBED A 71-year-old woman who just left the bank is seen knocked to the ground and robbed near 6th/International on…”

CBS Oakland reporter, Betty Yu, posted on her Instagram account disturbing footage of a group of males who robbed and beat an elderly Asian woman as she had exited a bank in broad daylight.

The recent spat of attacks comes on the heels of disparate racialized events involving Asian Americans. In the past few weeks, Asians have been subject to verbal harassment due to Coronavirus disinformation.

In a parallel trend, there has also been a crime spree and increase of physical assaults on mostly older Asian people walking alone. The underlying animosity is evident, but the targets have routinely been victimized due to their perceived vulnerability. This recent Oakland robbery may be part of this trend.

Many commentators hold that independent of the race of the aggressors, bad actors are reacting to propaganda related to China fearmongering and disinformation surrounding the pandemic.

Southern California: Chicanos Express Outrage Over Attacks On Asian Elderly

Throughout the day, Chicano communities have expressed outrage over violence against Asians stemming from false narratives around the pandemic. In Southern California, Chicano sentiment expresses solidarity with Asian community. This could be observed through social media postings from influential people within the community, like El Indio Botanas & Cervezas.

Due to the vast amounts of disinformation, Asian communities in Northern and Southern California have been targets of racist attacks. The same kind of despicable behavior has been observed in the east coast.

Different Asian communities in the Pandemic reality find themselves subject to attacks.

Coverage has been sufficient to continue raising the issue at the national level. Readers can find more information about hate crimes against Asians in the United States in this link.

Assault On Elderly Strikes Nerve

Much of the online outrage centered on the targeted attacks on the Asian Elderly in Oakland, California. While all acts of racism are despicable, the Chicano community is often most disturbed by acts against the elderly.

While initially the attacks were thought to be centered in Oakland’s Chinatown, the trend has spread to Japantown of San Jose where just today there was vandalism on J-Town monuments. Thankfully, Raul Peralez, city councilmember representing District 3 (San Jose), has vocalized support for the Asian community, making it more likely that other regional politicians will step up support too.

Whose Getting Covid-19 Fatigue?

There are a few ways to interpret the data we have amassed under the web application in Chicano Press. Today, we try to compare the aggregate infection counts to prior months. We hope to infer whose in some sense improving relative to where they were the month before. I’m sure there is a more sophisticated way to do this analysis.

For now, here’s whose generating the most Covid-19 infections based on that logic:

County1/10/20212/6/2021Rate of Change
Santa Barbara21,32329,7550.395441542
San Diego188,600245,3340.300816543
Los Angeles920,1771,143,4220.242610932
San Bernardino224,350277,4300.236594607
My home county, Riverside, is really high in number of new infections. This is so despite the holidays being somewhat far in the rearview.

Using Spacy in Python To Extract Named Entities in Spanish

The Spacy Small Language model has some difficulty with contemporary news text that are not either Eurocentric or US based. Likely, this lack of accuracy with contemporary figures owes in part to a less thorough scrape of Wikipedia and relative changes that have taken place in Mexico, Bolivia and other countries with highly variant dialects of Spanish in LATAM since 2018. Regardless, that dataset can and does garner some results for the purpose of this exercise. This means that we can toy a bit around with some publicly available data.

Entity Hash For Spanish Text

In this informal exercise, we will try to hack our way through some Spanish text. Specifically, making use of NER capacities that are sourced from public data – no rule based analysis – with some functions I find useful for visualizing Named Entities in Spanish text. We have prepared a Spanish news text on the topic of ‘violence’ or violent crime sourced from publicly available Spanish news content in Mexico.

Using spacy, you can hash the entities extracted from a corpus. We will use the lighter Spanish language model from Spacy’s natural language toolkit. This language model is a statistical description of Wikipedia’s Spanish corpus which is likely slanted towards White Hispanic speech so beware it’s bias.

First, import the libraries:

import spacy
import spacy.attrs
nlp = spacy.load('es_core_news_sm')

With the libraries in place, we can import the module ‘org_per’. This module is referencing this Github repo.

The work of identifying distinct entities is done in a function that filters for Geographical entities and People. Both of these tags are labeled as ‘GEO’ and ‘PER’, respectively in spacy’s data.

The variable ‘raw_corpus‘ is the argument you provide, which should be some Spanish text data. If you don’t have any, visit the repository and load that file object.

import org_per
raw_corpus = open('corpus_es_noticias_mx.txt','r', encoding='utf-8').read().split("\n")[1:]
entities = org_per.sacalasentidades(raw_corpus)
# use list of entities that are ORG or PER and count up
# each invidividual token.     

tokensdictionary = org_per.map_entities(entities) 

As noted before, the text has its origins in Wikipedia. This means that newer more contemporary types of text may not be sufficiently well covered – breadth doesn’t imply depth in analysis because stochastic models rely on some passing resemblance with data that may not ever have been seen.

Anecdotally, over a small corpus, we see performance below 80 percent accuracy for this language model. Presumably, a larger sampling of Wikipedia ES data will perform higher, but certain trends in contemporary news text makes this expectation necessary to temper.

The output returned from running `org_per.map_entities(entities)` will look like this:

{"Bill Clinton": 123,
"Kenneth Starr" : 12,

The actual hashing is a simple enough method involving placing NER text with its frequency count as a value in a dictionary. Within your dictionary, you may get parses of Named Entities that are incorrect. That is to say, they are not properly delimited because the Named Entity Language Model does not have an example of your parse. For instance, Lopez Obrador – the current president of Mexico – is not easily recognized as ‘PER’.


This is measured very simply through tabulating how much you agree with the returned Named Entities. The difference between expected and returned values is your error rate. More on accuracy metrics next post.

Los Angeles Pre-Covid In Pictures

Chicanos and Asian community intermingle at the food court in Downtown Los Angeles, California

Pre-Covid the city of Los Angeles was such a pleasant place to wander about. In the radius of a few blocks, you can go from a used, but superb bookstore, then into a food court with delicious food from all over Asia, Mexico and Central America.

Another favorite: Plaza Olvera In The Morning

En Route to one of my workplace sites, we can always observe the LA Area’s beautiful Plaza Olvera. I remember when they did not permit the usage of a drone – a very small one – to grab an overview pic. Well, now no one is around because of Covid-19 infections so no worries security staff :).

Human-AI Symbiosis: Says Who?

In Whose Image? All People Should Provide Input For New Modern AI

Human-AI Symbiosis? All #people should decide how modern #humanity unfolds. Web article considers #it#software#industry developments from under a decade ago that may mold the #future#futurism#Intel#Tech#Mexican#Mexico#nlp#AI#humanityhttp://ricardolezama.com/ml/wearables-speech-recognition-how-intels-loss-could-be-tesla-gain/…

Why Linguistics (And Linguists) Are Always On The Back-Foot In An Enterprise Context

Linguistics is often questioned by practitioners of Natural Sciences during informal and professional scenarios. Perhaps, this is the case due to the fact that the phenomena relevant to them is more readily observable through technical methodologies that are better defined. For instance, a casual (as in an individual who casually has become acquainted with some little morsel of data about a language) may define Linguistics as just ‘adding an ‘s’ to the end of a word to make it plural.

Similarly, asking musicians to recite every note on a scale (which many professionally trained or classically inclined enjoy doing) does not guarantee they can play well to a live audience. So, if the casual definition for Linguistics is graining ground, the priorities for the field in terms of rationale and problem solving is too centered on archiving data that exists without an explanation (data without science, basically).

Trivia In The Office

Figuring out when some rule from Oxford applies in a romance novel is fine trivia, but it is not ‘Linguistics’. To begin with, the scientific practice around explaining language behavior is very broad and interdisciplinary. We should not permit the discipline to become describing the inscriptions literally.

While data and rules about languages are important, memorizing data feels like a somewhat pointless exercise. This is a sign that the field in some corners is overly defined by linguistic trivia about English – or some other pet language – rather than in terms of reproducible general principles that can be easily computed.

Linguistics Takes Time

Currently, my fear is that as Linguistics in the enterprise context gains strength, the finer points around rationale will be overtaken by boring data recitals. If so, we are in for a world of trivia and not developments largely as a result from the influence of non-linguist priorities in the discipline. The drive to subvert some computational tools for linguistic ends does not exist.

Narrative is very important. Understanding why a linguistic analysis helps ground activities. This ability to contribute to NLP activities through adopting a proper narrative for linguistic activities in the enterprise setting has not surfaced beyond the ‘we need better data for this hungry machine learning algorithm’. It pays the bills, but it does not advance the field.

Wearables, Speech Recognition & Musk: How Intel’s Loss Could Be Tesla Gain

Despite the famously late arrival to mobile computing, Intel did make certain strides before many others in the space of wearables in Mid-2013 and onwards. Much of it may have to do with the company’s strategic diversification which took place in mid-2013.

Hundreds of Millions Poured Into Research & Development

Intel invested at the very least 100 million dollars alone into the capital expenditures and personnel for their now defunct ‘New Devices Group’, an experimental branch of Intel charged with creating speech and AI enabled devices.

While many high-profile people were hired, developments took place and acquisitions made, investors were either not aware or not too pleased with the slow roll to market for any of these expenditures.

These capital intensive moves into different technology spaces were possibly done as a proactive measure to not miss the ‘next big thing’ as they had with not providing the chipset for the Apple I-Phone. At the time, Brian Krzanich was newly appointed as Intel’s CEO to permit the company to transition from these failures – rightly or wrongly – attributed to the prior CEO, Paul S. Otellini.

Why Did Intel Invest In Wearables?

Once Krzanich became CEO of Intel in May 2013, he quickly moved to diversify Intel’s capabilities in non-chip related activities. Nonetheless, these efforts were still an attempts to amplify the relevance of the company’s chipsets. The company’s participation within the various places in which computing would become more ubiquitous: home automation, wearables and mobile devices with specialized, speech-enabled features. The logic was that the computing demands would naturally lead to an increased appetite for powerful chipsets.

This uncharacteristic foray into the realm of ‘cognitive computing’ led to several research groups, academics and smaller start-ups being organized under the banner of the ‘New Devices Group’ (NDG). Personally, I was employed in this organization and find that the expertise and technology from NDG may regain relevance in today’s business climate.

Elon Musk’s Tweet: Indicative Of New Trends?

Elon Musk’s tweet on wearables.

For instance, Elon Musk recently tweeted a request for engineers experienced in wearable technologies to apply for his Neuralink company. On the surface, this may mean only researchers who have worked on Brain Machine Interfaces, but as Neuralink and competitors bore down on some of the core concepts surrounding wearables, subject matter experts in other fields may be required as well.

Human/AI Symbiosis

When we consider what Musk is discussion, it would be fair to ask what constitutes ‘Human’?

Without much pedantic overviews, I would assume that linguistics has somethin to do with describing humanity – specifically, the uniqueness of the human mind.

As corporate curiosity is better able to package more variant and sophisticated chunks of the human experience, those experiences yielded primarily through text and speech are best described by Computational Linguistics and already fairly well understood from a consumer product perspective. It’s fair to say that finding the points of contact between neurons (literal ones, not the metaphors from Machine Learning) firing under some mental state and some UI is the appreciable high-level goal for any venture into ‘Human-AI’ symbiosis.

Thorough descriptions of illocutionary meaning, temporal chain of events, negation and various linguistic cues both in text and speech could have a consistent neural representations that are captured routinely in brain imaging studies. Unclear, however, is how these semantic properties of language would surface in electrodes meant for consumer applications.

Radical Thinkers Needed

The need for either linking existing technology or expanding available products so that they exploit these very intrusive wearables (a separate moral point to consider) likely calls for lots of people to be employed in this exploratory phase. Since it’s exploratory, the best individuals may not be the usual checklist based academics or industry researchers found in these corners. If the Pfizer-BioNTech development is any indication, sometimes it’s the researchers who are not standard that are most innovative.