Email Me: lezama at

Category: Computational Linguistics

  • Alucinación – el termino para cuando los modelos de inteligencia artificial se equivocan

    Aunque es impresionante el hecho de que un chatbot responde a un input, académicos, científicos y expertos en la aplicación de la inteligencia artificial no han definido su postura con respeto al IA en términos psicológicos. La ciencia cognitiva bien fue la inspiración para las llamadas ‘redes neuronales’ que definen la arquitectura de algunos de […]


  • WebScraping As Sourcing Technique For NLP

    WebScraping As Sourcing Technique For NLP

    Introduction In this post, we provide a series of web scraping examples and reference for people looking to bootstrap text for a language model. The advantage is that a greater number of spoken speech domains could be covered. Newer vocabulary or possibly very common slang is picked up through this method since most corporate language […]


  • Word2Vec Mexican Spanish Model: Lyrics, News Documents

    Word2Vec Mexican Spanish Model: Lyrics, News Documents

    A Corpus That Contains Colloquial Lyrics & News Documents For Mexican Spanish This experimental dataset was developed by 4 Social Science specialists and one industry expert, myself, with different samples from Mexico specific news texts and normalized song lyrics. The intent is to understand how small, phrase level constituents will interact with larger, editorialized style […]


  • Tokenizing Text, Finding Word Frequencies Within Corpora

    Tokenizing Text, Finding Word Frequencies Within Corpora

    One way to think about tokenization is to consider it as finding the smallest possible unit of analysis for computational linguistics tasks. As such, we can think of tokenization as among the first steps (along with normalization) in the average NLP pipeline or computational linguistics analysis. This process helps break down text into a manner […]


  • Frequency Counts For Named Entities Using Spacy/Python Over MX Spanish News Text

    On this post, we review some straightforward code written in python that allows a user to process text and retrieve named entities alongside their numerical counts. The main dependencies are Spacy, a small compact version of their Spanish language model built for Named Entity Recognition and the tabular data processing library, Matplotlib, if you’re looking […]


  • N-Gram Analysis Over Sensitive Topics Corpus

    I was recently able to do some analysis over the Sugar Bear AI violence corpus, a collection of documents classified by analysts over at the SugarBear AI group. The group has been classifying manually thousands of documents of Mexican Spanish news over the past year that deal with the new topics of today: “Coronavirus”, “WFH”, […]


  • Using Spacy in Python To Extract Named Entities in Spanish

    The Spacy Small Language model has some difficulty with contemporary news text that are not either Eurocentric or US based. Likely, this lack of accuracy with contemporary figures owes in part to a less thorough scrape of Wikipedia and relative changes that have taken place in Mexico, Bolivia and other countries with highly variant dialects […]