Email Me: lezama at lacartita.com



Category: Tokenization

  • Word2Vec Mexican Spanish Model: Lyrics, News Documents

    Word2Vec Mexican Spanish Model: Lyrics, News Documents

    A Corpus That Contains Colloquial Lyrics & News Documents For Mexican Spanish This experimental dataset was developed by 4 Social Science specialists and one industry expert, myself, with different samples from Mexico specific news texts and normalized song lyrics. The intent is to understand how small, phrase level constituents will interact with larger, editorialized style […]

    Read more...

  • Tokenizing Text, Finding Word Frequencies Within Corpora

    Tokenizing Text, Finding Word Frequencies Within Corpora

    One way to think about tokenization is to consider it as finding the smallest possible unit of analysis for computational linguistics tasks. As such, we can think of tokenization as among the first steps (along with normalization) in the average NLP pipeline or computational linguistics analysis. This process helps break down text into a manner […]

    Read more...