WebScraping As Sourcing Technique For NLP

Introduction

In this post, we provide a series of web scraping examples and reference for people looking to bootstrap text for a language model. The advantage is that a greater number of spoken speech domains could be covered. Newer vocabulary or possibly very common slang is picked up through this method since most corporate language managers do not often interact with this type of speech.

Most people would not consider Spanish necessarily under resourced. However, considering the word error rate in products like the Speech Recognition feature on a Hyundai, Mercedes Benz or Spanish text classification on social media platforms, which is skewed towards English centric content, there seems to certainly be a performance gap between contemporary #Spanish speech in the US and products developed for that demographic of speakers.

Lyrics are a great reference point for spoken #speech. This contrasts greatly with long form news articles, which are almost academic in tone. Read speech also carries a certain intonation, which does not reflect the short, abbreviated or ellipses patterning common to spoken speech. As such, knowing how to parse the letras.com pages may be a good idea for those refining and expanding language models with “real world speech”.

Overview:

  • Point to Letras.com
  • Retrieve Artist
  • Retrieve Artist Songs
  • Generate individual texts for songs until complete.
  • Repeat until all artists in artists file are retrieved.

The above steps are very abbreviated and even the description below perhaps too short. If you’re a beginner, feel free to reach out to lezama@lacartita.com. I’d rather deal with the beginner more directly; experienced python programmers should have no issue with the present documentation or modifying the basic script and idea to their liking.

Sourcing

In NLP, the number one issue will never be a lack of innovative techniques, community or documentation for commonly used libraries. The number one issue is and will continue to be a proper sourcing and development of training data.

Many practitioners have found that the lack of accurate, use case specific data are better than a generalized solution, like BERT or other large language models. These issues are most evident in languages, like Spanish, that do not have as high of a presence in the resources that compose BERT, like Wikipedia and Reddit.

Song Lyrics As Useful Test Case

At a high level, we created a list of relevant artists: Artists then looped through the list to search in lyrics.com whether they had any songs for them. Once we found that the request yielded a result, we looped through the individual songs for each artists.

Lyrics are a great reference point for spoken speech. This contrasts greatly with long form news articles, which are almost academic in tone. Read speech also carries a certain intonation, which does not reflect the short form, abbreviated or ellipsis that characterizes spoken speech. As such, knowing how to parse the https://letras.com resource may be a good idea for those refining and expanding language models with “real world speech”.

Requests, BS4

The proper acquisition of data can be accomplished with BeautifulSoup. The library has been around for over 10 years and it offers an easy way to process HTML or XML parse trees in python; you can think of BS as a way to acquire the useful content of an html page – everything bounded by tags. The requests library is also important as it is the way to reach out to a webpage and extract the entirety of the html page.

# -*- coding: utf-8 -*-
"""
Created on Sat Oct 16 22:36:11 2021
@author: RicardoLezama.com
"""
import requests
artist = requests.get("https://www.letras.com").text

The line `’requests.get(“https://letras.com”).text` does what the attribute ‘text’ implies; the call obtains the HTML files content and makes it available within the python program. Adding a function definition helps group this useful content together.

Functions For WebScraping

Creating a bs4 object is easy enough. Add the link reference as a first argument, then parse each one of these lyric pages on DIV. In this case, link=”letras.com” is the argument to pass along for the function. The function lyrics_url returns all the div tags with a particular class value. That is the text that contains the artists landing page, which itself can be parsed for available lyrics.



def lyrics_url(web_link):
    """
    This helps create a BS4 object. 
    
    Args: web_link containing references. 
    
    return: text with content. 
    """
    artist = requests.get(web_link).text
    check_soup = BeautifulSoup(artist, 'html.parser')
    return check_soup.find_all('div', class_='cnt-letra p402_premium')
    
letras.com the highlight portion is contained within <div> tag.

The image above shows the content within a potential argument for lyrics_url “https://www.letras.com/jose-jose/135222/”. See the github repository for more details.

Organizing Content

Drilling down to a specific artist requires basic knowledge of how Letras.com is set-up for organizing songs into a artists home page. The method artists_songs_url involves parsing through the entirety of a given artists song lists and drilling down further into the specific title.

In the main statement, we can call all these functions to loop through and iterate through the artists page and song functions to generate unique files, names for each song and its lyrics. The function generate_text will write into each individual one set of lyrics. Later, for Gensim, we can turn each lyrics file into a single coherent gensim list.



def artist_songs_url(web_link):
    """
    This helps land into the URL's of the songs for an artist.'
    
    Args: web link is the 
    
    Return songs from https://www.letras.com/gru-;/
    """
    artist = requests.get(web_link).text
    print("Status Code", requests.get(web_link).status_code)
    check_soup = BeautifulSoup(artist, 'html.parser') 
    songs = check_soup.find_all('li', class_='cnt-list-row -song')
    return songs
#@ div class="cnt-letra p402_premium

def generate_text(url):
    import uuid 
    songs = artist_songs_url(url)
    for a in songs:
        song_lyrics = lyrics_url(a['data-shareurl'])
        print (a['data-shareurl'])
        new_file = open(str(uuid.uuid1()) +'results.txt', 'w', encoding='utf-8')
        new_file.write(str(song_lyrics[0]))
        new_file.close()
        print (song_lyrics)
    return print ('we have completed the download for ', url )


def main():
    artistas = open('artistas', 'r', encoding='utf-8').read().splitlines()
    url = 'https://www.letras.com/'
    for a in artistas : 
        generate_text(url + a +"/")
        print ('done')
#once complete, run copy *results output.txt to consolidate lyrics into a single page. 


if __name__ == '__main__':
    sys.exit(main())  # 

New B.1.1.529 Coronavirus Variant Poised To Be Deadlier Than Delta

Markets, medical experts and governments are raising concerns over the latest Coronavirus variant.


As the public in the United States gathers in observance of Thanksgiving, South African experts and global governments are alarmed at the B.1.1.529 variant of the Coronavirus. Enough concern has been raised to shutdown air-traffic partially between UK/Europe and parts of Africa. The new variant was reported Wednesday, 8:11 pm PST and will eventually receive a Greek letter name, according to Bloomberg News.

Origin

First spotted in Botswana, the B.1.1.529 variant appears to have more of the protein spikes associated with more aggressive viruses, like the Delta variant. Roughly speaking, the spike protein allows for the virus to penetrate the cellular membrane of a healthy human cell. Afterwards, it inserts RNA into healthy human cells, which then causes issues in vulnerable lung tissues e.g Covid-19.

The B.1.1.529 variant is thought to have evolved from an untreated HIV/AIDS patient, according to Francois Balloux via Bloomberg News. Unfortunately, people who are immunocompromised can carry the coronavirus for longer periods of time, allowing for significantly different variants to emerge and eventually infect others.

This when paired with the fact other healthy individuals who are unvaccinated will contract this variant makes for a perfect storm of conditions to raise the prospect of a new wave of Covid-19 infections.

New Variant B.1.1.529 Raises Concerns Globally (Source: Guardian)

India Increases Testing, Israel Reacts

The country of India is now increasing testing for foreign travelers out of fear that this deadlier, more transmissible than the Delta variant will reach its already vulnerable populace.

Israel is also testing for the variant.

Affects Those Under 25

In places with fewer vaccinations, the populace under 25 is expected to see a spike of B.1.1.529 infections. For instance, South Africa has around 1/4 of its under 25 populace vaccinated and it is this population most affected with the variant in Gauteng. South African authorities have several confirmed cases, with laboratories expecting to confirm additional ones after sequencing is performed on new samples.

According to South African public health authorities:

“This variant is reported to have a significantly high number of mutations, and thus, has serious public health implications for the country, in view of recently relaxed visa restrictions and opening up of international travel.”

National Centre for Disease Control (NCDC) via OdhisaTv

Flights Between South Africa, UK Halted

The UK is now banning flights from six African countries. A strict quarantine will take place from these six countries: South Africa, Lesotho, Botswana – the original site for the variant – Mozambique, Namibia and Eswatini.

Markets React

Shares in Intercontinental are now down 6.7 percent in stock futures after a busy holiday season in the US, according to Dow Jones. If past behavior is an indicator, travelers will now do a double take on travel plans globally and in the domestic US as Christmas travel was gearing up as a windfall for airlines, hotels and oil companies/gas retailers.

Another Zero Day Exploit For Microsoft

Even Windows 11 is affected.

Apparently, one can open a command line window and deploy an exploit to raise permissions on a machine using a .exe file freely available on Github. Nice.

The exploit works on Windows 10, Windows 11 and Windows Server versions of this OS. The exploit consists of a low privileged user raising their own privileges by running basic commands on the CMD prompt. Fascinating.

Bleeping Computer Blog Finds Exploit

The exact issue is described by BleepingComputer yesterday in a much circulated blog post:

[BP] has tested the exploit and used it to open to command prompt with SYSTEM privileges from an account with only low-level ‘Standard’ privileges.

– Bleeping Computer

After Success Unifying Super Middleweight Division, Canelo Calculates Legacy With Cruiserweight Challenge

Saul “Canelo” Alvarez is now contemplating challenging a much bigger man who is the champion at the Cruiserweight weight division in the WBC.

The Canelo legacy keeps rising as the 31 year old Mexican enters his prime and relishes success compounded repeatedly after multiple successful defenses of titles. Most recently, the Mexican has unified the competitive 168lb pound division.

His fanbase is expanding globally, with English speakers placing support behind the ‘face of boxing’ amidst the usual controversies and biases that all combat sports tend to manifest.

The Mexican fanbase too looks on as they toil away at their jobs being the backbone of multiple regional and national economies. Every Canelo fight affirms some of the positive image that exists globally of the Mexican man. At least, that is how most sports watchers interpret the presence of Canelo in media depictions. To talk about this legend in development is to talk about the importance of boxing within the Mexican community. Thus, the moves he makes will define the sport for decades to come.

Early Details On Canelo’s Move To Cruiserweight

According to Michael Benson, Canelo is planning to weigh at 180lbs as he faces Ilunga Makabu, an opponent with a significant weight and height advantage, 200lbs and much taller.

Canelo vs Plant Is Finally Here

Canelo is now set to face his last and potentially most difficult fight for Super Middleweight supremacy: Caleb Plant.

The Super Middleweight unification bout is set to kick off tomorrow at around 6pm PT from Las Vegas, Nevada. It’s at 75 dollars, which is not terrible alongside a decent undercard. Already, Canelo is a four-division world champion but Plant is IBF champion. Whoever wins is the first undisputed super middleweight world champion in boxing history. The stakes can not be higher.

Canelo marks 168lbs vs Plant at 167lbs – fight night rehydration may add 10 pounds, but the muscle density is on Canelo’s side.

Weigh-In For #CaneloPlant.

At 168 pounds, Canelo looked bulky and ready to deliver powerful blows. He made weight spot-on, 168lbs is the Super Middleweight limit, even going as far as to still wear a heavy gold pendant at the scale. For his part, Caleb Plant weighed in at 167 pounds:

The current IBF title holder at 168 pounds looked muscular as well, but thinner and trim as he is over 6ft tall – a bit of a liability when fighting a compact, explosive opponent.Our best guess is that Caleb Plant’s 167lb frame is an indication that he will fight at distance – “run” as some detractors say – during the fight:

Regardless, this looks to be an historic night with one man ready to unify all the belts. Reportedly, Al Haymon and Eddy Reynoso have been planning or open to additional fights.

Resumes Heading In To Fight

Each fighter has a respectable resume, but the best belongs to the current Pound-for-Pound king, Saul “Canelo” Alvarez. He has most recently defeated 2 previously unbeaten Super Middleweights and defended his title against a formidable challenger in Avni Yildrim.

With respect to Plant, he does have 4 title defenses with his best win being over Jose Uzcategui. Mike Lee was also a respectable opponent, but one would be hard pressed to compare either one with Billy Joe Saunders or Callum Smith – the two Brits defeated by Alvarez.

Prediction

Canelo must KO Plant because as the PBC fighter, Plant is likely set to get the judges nod. Canelo must realize this and is predicting an 8th round KO. It’s tough to take this type of assertion seriously, with many cautioning other fighters about making such predictions. However, in Canelo’s case, most make an exception.

Personally, my fear is that this fight will be boring, with Plant excessively moving as his track leg physique indicated at the weigh-in. I hope I am wrong, but think I may not be.

Word2Vec Mexican Spanish Model: Lyrics, News Documents

A Corpus That Contains Colloquial Lyrics & News Documents For Mexican Spanish

This experimental dataset was developed by 4 Social Science specialists and one industry expert, myself, with different samples from Mexico specific news texts and normalized song lyrics. The intent is to understand how small, phrase level constituents will interact with larger, editorialized style text. There appears to be no ill-effect with the combination of varied texts.

We are working on the assumption that a single song is a document. A single news article is a document too.

In this post, we provide a Mexican Spanish Word2Vec model compatible with the Gensim python library. The word2vec model is derived from a corpus created by 4 research analysts and myself. This dataset was tagged at the document level for the topic of ‘Mexico’ news. The language is Mexican Spanish with an emphasis on alternative news outlets.

One way to use this WVModel is shown here: scatterplot repo.

Lemmatization Issues

We chose to not lemmatize this corpus prior to including in the word vector model. The reason is two-fold: diminished performance and prohibitive runtime length for the lemmatizer. It takes close to 8 hours for a Spacy lemmatizer to run through the entire set of sentences and phrases. Instead, we made sure normalization was sufficiently accurate and factored out major stopwords.

Training Example

Below we show some basic examples as to how we would train based on the data text. The text is passed along to the Word2Vec module. The relevant parameters are set, but the user/reader can change as they see fit. Ultimately, this saved W2V model will be saved locally.

In this case, the named W2Vec model “Mex_Corona_.w2v” is a name that will be referenced down below in top_5.py.

from gensim.models import Word2Vec, KeyedVectors

important_text = normalize_corpus('C:/<<ZYZ>>/NER_news-main/corpora/todomexico.txt')

#Build the model, by selecting the parameters.
our_model = Word2Vec(important_text, vector_size=100, window=5, min_count=2, workers=20)
#Save the model
our_model.save("Mex_Corona_.w2v")
#Inspect the model by looking for the most similar words for a test word.
#print(our_model.wv.most_similar('mujeres', topn=5))
scatter_vector(our_model, 'Pfizer', 100, 21) 

Corpus Details

Specifically, from March 2020 to July 2021, a group of Mexico City based research analysts determined which documents were relevant to this Mexico news category. These analysts selected thousands of documents, with about 1200 of these documents at an average length of 500 words making its way to our Gensim language model. Additionally, the corpus contained here is made out of lyrics with Chicano slang and colloquial Mexican speech.

We scrapped the webpages of over 300 Mexican ranchero and norteño artists on ‘https://letras.com‘. These artists ranged from a few dozen composers in the 1960’s to contemporary groups who code-switch due to California or US Southwest ties. The documents tagged as news relevant to the Mexico topic were combined with these lyrics with around 20 of the most common stopword removed. This greatly reduced the size of the original corpus while also increasing the accuracy of the word2vec similarity analysis.

In addition to the stop word removal, we also conducted light normalization. This was restricted to finding colloquial transcriptions and converting these to orthographically correct versions on song lyrics.

Normalizing Spanish News Data

Large corporations develop language models under guidance of product managers whose life experiences do not reflect that of users. In our view, there is a chasm between the consumer and engineer that underscores the need to embrace alternative datasets. Therefore, in this language model, we aimed for greater inclusion. The phrases are from a genre that encodes a rich oral history with speech commonly used amongst Mexicans in colloquial settings.

Song Lyrics For Colloquial Speech

This dataset contains lyrics from over 300 groups. The phrase length lyrics have been normalized to obey standard orthographic conventions. It also contains over 1000 documents labeled as relevant to Mexico news.

Coronavirus and similar words.

Github Lyrics Gensim Model

We have made the lyrics and news language model available. The model is contained here alongside some basic normalization methods on a module.

Colloquial Words

The similarity scores for a word like ‘amor’ (love) is shown below. In our colloquial/lyrics language model, we can see how ‘corazon’ is the closest to ‘amor’.

print(our_model.wv.most_similar('amor', topn=1))
[('corazon', 0.8519232869148254)]

Let’s try to filter through the most relevant 8 results for ‘amor’:

scatter_vector('mx_lemm_ner-unnorm_1029_after_.w2v', 'amor', 100, 8)
Out[18]: 
[('corazon', 0.8385680913925171),
 ('querer', 0.7986088991165161),
 ('jamas', 0.7974023222923279),
 ('dime', 0.788547158241272),
 ('amar', 0.7882217764854431),
 ('beso', 0.7817134857177734),
 ('adios', 0.7802879214286804),
 ('feliz', 0.7777709364891052)]

For any and all inquiries, please send me a linkedin message here: Ricardo Lezama. The word2vec language model file is right here: Spanish-News-Colloquial.

Here is the scatterplot for ‘amor’:

Scatterplot for ‘amor’.

Diversity Inclusion Aspect – Keyterms

Visualizing the data is fairly simple. The scatterplot method allows us to show which terms surface in similar contexts.

Diversity in the context of Mexican Spanish news text. Query: “LGBT”

Below, I provide an example of how to call the Word2Vec model. These Word2Vec documents are friendly to the Word2Vec modules.

from gensim.models import Word2Vec, KeyedVectors
coronavirus_mexico = "mx_lemm_ner-unnorm_1029_after_.w2v"
coronavirus = "coronavirus-norm_1028.w2v"
wv_from_text = Word2Vec.load(coronavirus)

#Inspect the model by looking for the most similar words for a test word.
print(wv_from_text.wv.most_similar('dosis', topn=5))
#Let us see what the 10-dimensional vector for 'computer' looks like.

Mikey Garcia – A Manny Pacquiao Style Loss

The Sandor Martin upset is reminiscent of the great Manny “Pacman” Pacquiao upset against Jeff Horn.

In Fresno, California, Mikey Garcia delivered a slow and methodical performance against Sandor Martin. For his part, Martin delivered on the expected southpaw style, consistent jab and constant lateral movement – this proved more valuable to the California judges who exhibited a preference for elusive activity. Martin scored the upset and is now heralded as Spain’s hero in the boxing world.

Mikey Garcia vs Sandor Martin on October 16, 2021

Boxing is Fair

There is a silver lining of sorts. In recent boxing memory, this Martin upset is not the first time that a relative unknown pulls off an upset against a fighter backed by a devoted fanbase: Jeff Horn’s win over Pacquiao did little to dull Pacquiao star power or even boxing ability. For different reasons, fans can argue giving Garcia a pass: he has returned (grudgingly) after a 20 month fight layoff.

In Pac’s case, fans reasoned he faced a ‘dirty’ fighter whose style was deemed ungentlemanly and home base of Australia provided an unfair judging hand. In a way, the purists who love ‘pure boxing’ and ‘pure officiating’ can signal to the fight as a fair outcome involving a well known boxing star. Garcia fans simply saw a man closing the distance and another engaged in lateral movement. No one saw an embarrassing loss. It was just a boring fight.

The demand stayed constant despite the loss to Horn. Pac-Man moved on and we saw him fight once more for years going towards a title belt in the same welterweight division. Ultimately, Pacquiao never gave Horn a rematch. Curiously, we observe here that it’s likely Garcia will not go out for a Martin.

It is possible that the loss served a few more ends than just catapulting Sandor Martin onto the radar of boxing fans: validate fair judges. It’s likely that Garcia lost the fight technically, but Martins’ style, evasiveness and jab kept Garcia from connecting more than twice a round on his head.

The Loss From An Unwanted Fight

I doubt there will be a rematch.

Like with Horn, the Mikey Garcia fanbase can turn the page and demand another fight with a more offensive minded opponent. Ultimately, the assertion from Robert Garcia about the fight not being even desired Mikey Garcia echoes the thoughts of many fans: Mikey just did not want this fight. According to Garcia, Mikey only took the fight because of the possibility of Bam Rodriguez receiving a title fight. That title fight did not materialize, with the ‘boxing’ and lateral movement winning the day too. A bad night, but undoubtedly, the fans will be happy with a better fight.

Robert Garcia on Mikey Not Wanting The Martin Fight

Robert Garcia Brutally Honest About Mikey Garcia Fight vs Martin EsNews Boxing – YouTube

Semantic Similarity & Visualizing Word Vectors

Introduction: Two Views On Semantic Similarity

In Linguistics and Philosophy of Language, there are various methods and views on how to best describe and justify semantic similarity. This tutorial will be taken as a chance to lightly touch upon very basic ideas in Linguistics. We will introduce in a very broad sense the original concept of semantic similarity as it pertains to natural language.

Furthermore, we will see how the linguistics view is drastically different from the state of the art Machine Learning techniques. I offer no judgments on why this is so. It’s just an opportunity to compare and contrast passively. Keeping both viewpoints in mind during an analysis is helpful. Ultimately, it maximizes our ability to understand valid Machine Learning output.

The Semantic Decomposition View

There is a compositional view that in its earliest 19th century incarnation is attributable to Gottleb Frege, in which the meaning of terms can be decomposed into simpler components such that the additive process of combining them yields a distinct meaning. Thus, two complex meanings may be similar to one another if they are composed of the same elements.

For example, the meaning of ‘king’ could be construed as an array of features, like the property of being human, royalty and male. Under this reasoning, the same features would carry over to describe ‘queen’, but the decomposition of the word would replace male with female. Thus, in the descriptive and compositional approach mentioned, categorical descriptions are assigned to words whereby decomposing a word reveals binary features for ‘human’, ‘royalty’ and ‘male’. Breaking down concepts represented by words into simpler meanings is what is meant with ‘feature decomposition’ in a semantic and linguistic context.

The Shallow Similarity View

Alternatively, Machine Learning approaches to semantic similarity involves a contextual approach towards the description of a word. In Machine Learning approaches, there is an assignment of shared indices between words. The word ‘king’ and ‘queen’ will appear in more contexts that are similar to one another than other words. In contrast, the words ‘dog’ or ‘cat’, which implies that they share more in common. Intuitively, we understand that these words have more in common due to their usage in very similar contexts. The similarity is represented as a vector in a graph. Each word can have closely adjacent vectors reflecting their similar or shared contexts.

Where both approaches eventually converge is in the ability for the output of a semantic theory or vector driven description of words matches with language users intuitions. In this tutorial and series of examples, we will observe how the Word2Vec module does fairly well with new, recent concepts that only recently appeared in mass texts. Furthermore, these texts are in Mexican Spanish, which implies that the normalization steps are unique to these pieces of unstructured data.

Working With Mexican Spanish In Word2Vec

In this series of python modules, I created a vector model from a Mexican Spanish news corpus. Each module has a purpose: normalization.py cleans text so that it can be interpretable for Word2Vec. Normalization also produces the output lists necessary to pass along to Gensim. Scatterplot.py visualizes the vectors.from the model. This corpus was developed as described below.

This dataset of 4000 documents is verified as being relevant to several topics. For this tutorial, there are three relevant topics: {Mexico, Coronavirus, and Politics}. Querying the model for words in this pragmatic domain is what is most sensible. This content exists in the LaCartita Db and is annotated by hand. The annotators are a group of Mexican graduates from UNAM and IPN universities. A uniform consensus amongst the news taggers was required for its introduction into the set of documents. There were 3 women and 1 man within the group of analysts, with all of them having prior experience gathering data in this domain.

While the 4000 Mexican Spanish news documents analyzed are unavailable on github, a smaller set is provided on that platform for educational purposes. Under all conditions, the data was tokenized and normalized with a set of Spanish centric regular expressions.

Please feel free to reach out to that group in research@lacartita.com for more information on this hand-tagged dataset.

Normalization in Spanish

There are three components to this script: normalizing, training a model and visualizing the data points within the model. This is why we have SkLearn and Matplotlib for visualization, gensim for training and custom python for normalization. In general, the pipeline cleans data, organizes it into a list of lists format that works for the Word2Vec module and trains a model. I’ll explain how each of those steps is performed below.

The normalize_corpus Method

Let’s start with the normalization step which can be tricky given the fact that the dataset can sometimes present diacritics or characters not expected in English. We developed a regular expression that permits us to search and find all the valid text from the Mexican Spanish dataset.

from gensim.models import Word2Vec

import numpy as np 
 
import re 

from sklearn.manifold import TSNE

import matplotlib.pyplot as plt

plt.style.use('ggplot')
   
def normalize_corpus(raw_corpus):
    """
    This function reads clean text. There is a read attribute for the text. 
 
    Argument: a file path that contains a well formed txt file.
    
    Returns: This returns a 'list of lists' format friendly to Gensim. Depending on the size of the  
    ""
    raw_corpus = open(raw_corpus,'r', encoding='utf-8').read().splitlines()
    #This is the simple way to remove stop words
    formatted_sentences=[]
    for sentences in raw_corpus:
        a_words = re.findall(r'[A-Za-z\-0-9\w+á\w+\w+é\w+\w+í\w+\w+ó\w+\w+ú\w+]+', sentences.lower())         
        formatted_sentences.append(a_words)
    return formatted_sentences

important_text = normalize_corpus(<<file-path>>)

Once we generate a list of formatted sentences, which consists of lists of lists containing strings (a single list is a ‘document’), we can use that total set of lists as input for a model. Building the model is likely the easiest part, but formatting the data and compiling it in a usable manner is the hardest. For instance, the document below is an ordered, normalized and tokenized list of strings from this Mexican Spanish News corpus. Feel free to copy/paste in case you want to review the nature of this document:

['piden', 'estrategia', 'inmediata', 'para', 'capacitar', 'policías', 'recientemente', 'se', 'han', 'registrado', 'al', 'menos', 'tres', 'casos', 'de', 'abuso', 'de', 'la', 'fuerza', 'por', 'parte', 'de', 'elementos', 'policiales', 'en', 'los', 'estados', 'de', 'jalisco', 'y', 'en', 'la', 'ciudad', 'de', 'méxico', 'el', 'economista', 'organizaciones', 'sociales', 'coincidieron', 'en', 'que', 'la', 'relación', 'entre', 'ciudadanos', 'y', 'policías', 'no', 'debe', 'ser', 'de', 'adversarios', 'y', 'las', 'autoridades', 'tanto', 'a', 'nivel', 'federal', 'como', 'local', 'deben', 'plantear', 'una', 'estrategia', 'inmediata', 'y', 'un', 'proyecto', 'a', 'largo', 'plazo', 'para', 'garantizar', 'la', 'profesionalización', 'de', 'los', 'mandos', 'policiacos', 'con', 'apego', 'a', 'los', 'derechos', 'humanos', 'recientemente', 'se', 'han', 'difundido', 'tres', 'casos', 'de', 'abuso', 'policial', 'el', 'primero', 'fue', 'el', 'de', 'giovanni', 'lópez', 'quien', 'fue', 'asesinado', 'en', 'jalisco', 'posteriormente', 'la', 'agresión', 'por', 'parte', 'de', 'policías', 'capitalinos', 'contra', 'una', 'menor', 'de', 'edad', 'durante', 'una', 'manifestación', 'y', 'el', 'tercero', 'fue', 'el', 'asesinato', 'de', 'un', 'hombre', 'en', 'la', 'alcaldía', 'coyoacán', 'en', 'la', 'cdmx', 'a', 'manos', 'de', 'policías', 'entrevistada', 'por', 'el', 'economista', 'la', 'presidenta', 'de', 'causa', 'en', 'común', 'maría', 'elena', 'morera', 'destacó', 'que', 'en', 'ningún', 'caso', 'es', 'admisible', 'que', 'los', 'mandos', 'policiales', 'abusen', 'de', 'las', 'y', 'los', 'ciudadanos', 'y', 'si', 'bien', 'la', 'responsabilidad', 'recae', 'sobre', 'el', 'uniformado', 'que', 'actúa', 'las', 'instituciones', 'deben', 'garantizar', 'la', 'profesionalización', 'de', 'los', 'elementos', 'los', 'policías', 'son', 'un', 'reflejo', 'de', 'la', 'sociedad', 'a', 'la', 'que', 'sirven', 'y', 'ello', 'refleja', 'que', 'hay', 'una', 'sociedad', 'sumamente', 'violenta', 'y', 'ellos', 'también', 'lo', 'son', 'y', 'no', 'lo', 'controlan', 'declaró', 'que', 'más', 'allá', 'de', 'que', 'el', 'gobernador', 'de', 'jalisco', 'enrique', 'alfaro', 'y', 'la', 'jefa', 'de', 'gobierno', 'de', 'la', 'cdmx', 'claudia', 'sheinbaum', 'condenen', 'los', 'hechos', 'y', 'aseguren', 'que', 'no', 'se', 'tolerará', 'el', 'abuso', 'policial', 'deben', 'iniciar', 'una', 'investigación', 'tanto', 'a', 'los', 'uniformados', 'involucrados', 'como', 'a', 'las', 'fiscalías', 'sobre', 'las', 'marchas', 'agregó', 'que', 'si', 'bien', 'las', 'policías', 'no', 'pueden', 'lastimar', 'a', 'las', 'personas', 'que', 'ejercen', 'su', 'derecho', 'a', 'la', 'libre', 'expresión', 'dijo', 'que', 'hay', 'civiles', 'que', 'no', 'se', 'encuentran', 'dentro', 'de', 'los', 'movimientos', 'y', 'son', 'agredidos', 'es', 'importante', 'decir', 'quién', 'está', 'tras', 'estas', 'manifestaciones', 'violentas', 'en', 'esta', 'semana', 'vimos', 'que', 'no', 'era', 'un', 'grupo', 'de', 'mujeres', 'luchando', 'por', 'sus', 'derechos', 'sino', 'que', 'fueron', 'grupos', 'violentos', 'enviados', 'a', 'generar', 'estos', 'actos', 'entonces', 'es', 'necesario', 'definir', 'qué', 'grupos', 'políticos', 'están', 'detrás', 'de', 'esto', 'puntualizó', 'el', 'coordinador', 'del', 'programa', 'de', 'seguridad', 'de', 'méxico', 'evalúa', 'david', 'ramírez', 'de', 'garay', 'dijo', 'que', 'las', 'autoridades', 'deben', 'de', 'ocuparse', 'en', 'plantear', 'una', 'estrategia', 'a', 'largo', 'plazo', 'para', 'que', 'las', 'instituciones', 'de', 'seguridad', 'tengan', 'la', 'estructura', 'suficiente', 'para', 'llevar', 'a', 'cabo', 'sus', 'labores', 'y', 'sobre', 'todo', 'tengan', 'como', 'objetivo', 'atender', 'a', 'la', 'ciudadanía', 'para', 'generar', 'confianza', 'entre', 'ellos', 'desde', 'hace', 'muchos', 'años', 'no', 'vemos', 'que', 'la', 'sociedad', 'o', 'los', 'gobiernos', 'federales', 'y', 'locales', 'tomen', 'en', 'serio', 'el', 'tema', 'de', 'las', 'policías', 'y', 'la', 'relación', 'que', 'tienen', 'con', 'la', 'comunidad', 'lo', 'que', 'estamos', 'viviendo', 'es', 'el', 'gran', 'rezago', 'que', 'hemos', 'dejado', 'que', 'se', 'acumule', 'en', 'las', 'instituciones', 'de', 'seguridad', 'indicó', 'el', 'especialista', 'apuntó', 'que', 'además', 'de', 'la', 'falta', 'de', 'capacitación', 'las', 'instituciones', 'policiales', 'se', 'enfrentan', 'a', 'la', 'carga', 'de', 'trabajo', 'la', 'falta', 'de', 'protección', 'social', 'de', 'algunos', 'uniformados', 'la', 'inexistencia', 'de', 'una', 'carrera', 'policial', 'entre', 'otras', 'deficiencias', 'la', 'jefa', 'de', 'la', 'unidad', 'de', 'derechos', 'humanos', 'de', 'amnistía', 'internacional', 'méxico', 'edith', 'olivares', 'dijo', 'que', 'la', 'relación', 'entre', 'policías', 'y', 'ciudadanía', 'no', 'debe', 'ser', 'de', 'adversarios', 'y', 'enfatizó', 'que', 'es', 'necesario', 'que', 'las', 'personas', 'detenidas', 'sean', 'entregadas', 'a', 'las', 'autoridades', 'correspondientes', 'para', 'continuar', 'con', 'el', 'proceso', 'señaló', 'que', 'este', 'lapso', 'es', 'el', 'de', 'mayor', 'riesgo', 'para', 'las', 'personas', 'que', 'son', 'detenidas', 'al', 'tiempo', 'que', 'insistió', 'en', 'que', 'las', 'personas', 'encargadas', 'de', 'realizar', 'detenciones', 'deben', 'tener', 'geolocalización', 'no', 'observamos', 'que', 'haya', 'una', 'política', 'sostenida', 'de', 'fortalecimiento', 'de', 'los', 'cuerpos', 'policiales', 'para', 'que', 'actúen', 'con', 'apego', 'a', 'los', 'derechos', 'humanos', 'lo', 'otro', 'que', 'observamos', 'es', 'que', 'diferentes', 'cuerpos', 'policiales', 'cuando', 'actúan', 'en', 'conjunto', 'no', 'necesariamente', 'lo', 'hacen', 'de', 'manera', 'coordinada']

We build the model with just a few lines of python code once the lists of lists are contained in an object. The next step is to provide these lists as the argument of the Word2Vec in the object important_text. The Word2Vec module has a few relevant commands and arguments, which I will not review in depth here.

from gensim.models import Word2Vec

important_text = normalize_corpus(<<file-path>>)

mexican_model = Word2Vec(important_text, vector_size=100, window=5, min_count=5, workers=10)

mexican_model.save("NewMod1el.w2v")

The scatterplot Method: Visualizing Data

The scatter plot method for vectors allows for quick visualization of similar terms. The scatterplot function uses as an argument a model that contains all the vector representations of the Spanish MX content.

def scatter_vector(modelo, palabra, size, topn):
    """ This scatter plot for vectors allows for quick visualization of similar terms. 
    
    Argument: a model containing vector representations of the Spanish MX content. word
    is the content you're looking for in the corpus.
    
    Return: close words    
    """
    arr = np.empty((0,size), dtype='f')
    word_labels = [palabra]
    palabras_cercanas = modelo.wv.similar_by_word(palabra, topn=topn)
    arr = np.append(arr, np.array([modelo.wv[palabra]]), axis=0)
    for wrd_score in palabras_cercanas:
        wrd_vector = modelo.wv[wrd_score[0]]
        word_labels.append(wrd_score[0])
        arr = np.append(arr, np.array([wrd_vector]), axis=0)
    tsne = TSNE(n_components=2, random_state=0)
    np.set_printoptions(suppress=True)
    Y = tsne.fit_transform(arr)
    x_coords = Y[:, 0]
    y_coords = Y[:, 1]
    plt.scatter(x_coords, y_coords)
    for label, x, y in zip(word_labels, x_coords, y_coords):
        plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')
    plt.xlim(x_coords.min()+0.00005, x_coords.max()+0.00005)
    plt.ylim(y_coords.min()+0.00005, y_coords.max()+0.00005)
    plt.show()
    return palabras_cercanas

scatter_vector(modelo, 'coronavirus', 100, 21)

Coronavirus Word Vectors

The coronavirus corpus contained here is Mexico centric in it’s discussions. Generally, it was sourced from a combination of mainstream news sources, like La Jornada, and smaller digital only press, like SemMexico.

We used ‘Word2Vec’ to develop vector graph representations of words. This will allow us to rank the level of similarity between words with a number between 0 and 1. Word2Vec is a python module for indexing the shared context of words and then representing each as a vector/graph. Each vector is supposed to stand-in as a representation of meanning proximity based on word usage. We used Word2Vec to develop a semantic similarity representation for Coronavirus terminology within news coverage.

In this set of about 1200 documents, we created a vector model for key terms in the document; the printed results below show how related the other words are related to our target word ‘coronavirus‘. The most similar term was ‘covid-19’, virus and a shortening ‘covid’. The validity of these results were obvious enough and indicate that our document set contains enough content to represent our intuitions of this topic.

[('covid-19', 0.8591713309288025),
('virus', 0.8252751231193542),
('covid', 0.7919320464134216),
('sars-cov-2', 0.7188869118690491),
('covid19', 0.6791930794715881),
('influenza', 0.6357837319374084),
('dengue', 0.6119976043701172),
('enfermedad', 0.5872418880462646),
('pico', 0.5461580753326416),
('anticuerpos', 0.5339271426200867),
('ébola', 0.5207288861274719),
('repunte', 0.520190417766571),
('pandémica', 0.5115000605583191),
('infección', 0.5103719234466553),
('fumigación', 0.5102646946907043),
('alza', 0.4952083230018616),
('detectada', 0.4907490015029907),
('sars', 0.48677393794059753),
('curva', 0.48023557662963867),
('descenso', 0.4770597517490387),
('confinamiento', 0.4769912660121918)]
The word ‘coronavirus’ in Mexican Spanish text and its adjacent word vectors.

One of the measures for the merit of a large machine learning model is if the output aligns with the intuition of a human judgement. This implies that we should ask ourselves if the topmost ranked ‘similar’ words presented by this word2vec model matches up with our psychological opinion of ‘coronavirus’. Overwhelmingly, the answer is ‘yes’, since Covid and Covid19 nearly always mean the same thing, without a hyphen or if referenced as just ‘virus’ in some texts’.

Strong Normalization Leads To Better Vectors

Better normalization leads to better vectors.

This is verifiable in a scatterplot comparing the distinct text normalization that one intuits is best upon analyzing initial training data.

For example, many place names are effectively compound words or complex strings which can lead to misleading segmentation. This adds noise, effectively misaligning other words in the word vector model. Therefore, finding a quick way to ensure place names are represented accurately helps other unrelated terms surface away from their vector representation. Consider this below scatterplot where the names ‘baja california sur’ and ‘baja california’ are not properly tokenized:

Bad Segmentation Caused By Incomplete Normalization

Replacing the spaces between ‘Baja California Sur’, ‘Baja California’, and ‘Sur de California’, allows for other place names that pattern similarly to shine through in the scatterplot. This reflects more accurate word vector representations.

A better graph from replacing Baja California Sur to ‘bajacaliforniasur’ is a better way to capture the state name.

Shiba Coin Surge Explained: 4 Reasons Why General Market Trends Favor This Latest Surge

In the past 72 hours, the cutely named cryptocurrency, Shiba Inu Coin, an offshoot of the DOGE coin, overtook Bitcoin & Ethereum as the hottest coin in the market. It’s risen at times to nearly 300 percent of it’s original value. Here’s some possible reasons why:

Shiba Inu Coin surges up the ranks in popularity.
  1. Underbanked see new road: Many people find it more profitable to place a bit of their savings on a highly speculative coin as opposed to sitting idle on a bank that may or may not charge them fees for placing so little on their checking account. The underbanked, a category of people with scarce banking resources due to deposit minimums and withdrawal fees, are now migrating to cryptocurrency for both day to day expenditures and to leverage against inflation. Many individuals have rainy day funds deposited in some kind of cryptocurrency.
  2. New surge: Shiba Coin ($SHIB) is permitting people who lost out on the euphoria of the DOGE surge in value to gain on this new value drive. Many are stocking up on the currency which may or may not become traded on Robinhood as well, eventually. That’s a rumor circulating on various Reddit and Twitter forums.
  3. Cryptocurrency transfers replacing wire transfers: Emerging markets, like those in El Salvador, or middle income countries, like India, now see the emergence of Cryptocurrency transfers replacing wire transfers. Wallet to wallet transfers offer a cheaper and quicker way for families to transfer funds between themselves. Recently, Moneygram opted to settle cross-border wiretransfers with the US Dollar Coin. Customers can opt to receive a transfer in the stablecoin ‘USDC’ or cash out. Either way, the customer is now tapped into the Coinbase supported digital coin pegged to the value of the US dollar irrespective of what currency is used to perform the transfer. Many speculate that since families may be going around wire services, groups like Western Union and Moneygram now want in on the action. UC Davis researcher, Alfonso Aranda explains: “Money Gram and Western Union are feeling the heat of small nations making bitcoin legal tender or considering it”. This has been crucial to some countries, like El Salvaldor, that now supports small wealth transfers within families through a government app. Remittances are likely the next big wave of funds that may end up funneled through a blockchain and cryptocurrency.
  4. Musk. Elon Musk tweeted a picture of his dog, which is a Shiba Inu breed. The mechanics of how his influence and this pic led to a surge is beyond me, but even Bloomberg assumes this tweet is one of the causes for Shiba Inu Coin’s surge in value. Here’s the tweet: Elon Musk en Twitter: “Floki Frunkpuppy https://t.co/xAr8T0Jfdf” / Twitter
A simple picture of Elon Musk’s Shiba Inu may have driven the $SHIB coin’s latest surge in value.

Survey: US Based Mexican Average Salary Between 47k to 67k Annually

From May 2nd to May 5th of 2020, I gathered data with an online survey administered with the help of Mexican based Data Analysts who helped me recruit participants and review the data. My goal was to understand how COVID-19 affected my community’s economic status and employment prospects.

The descriptions here apply to 77 confirmed Mexicans, with 48 Mexican individuals verified through known connections and ‘friends-of-friends’ in social media. The overall sample was larger than 160 individuals surveyed online. All were Spanish speaking, thus, classified as ‘Latino’ at a minimum if they did not opt into the ‘Mexican’ category.

Survey Goals

We wanted to know the average salary of Mexicans based in the United States; we wanted to know how Mexicans fared during the peak periods of uncertainty in the pandemic. We consider the peak period of uncertainty to be early March 2020 to June 2020 during the pandemic when no vaccine was available and the most stringent lockdowns took place.

Ultimately, we suspect that the salary prospects of Mexicans during the COVID-19 lockdown were mitigated by their participation in the construction and education sectors of the economy.

Initial Motivations

We wanted to maximize privacy, but understand the moment faced by the Mexican community. During this time period, remittances to Mexico broke records. Unfortunately, many erroneously attributed Mexican resilience to a meager and occasional stimulus check. Those Mexicans with deeper ties to Mexico tended to not have access to the stimulus check; their legal status as economic migrants would not support the idea.

There was also an additional need to hone in on Mexicans specifically, as most research tends to homogenize distinct communities under ‘Latino’. In this research, the outcomes for the entire population of Spanish speaking individuals with ties outside of the US are worse than what is presented in this report.

Ultimately, the data found within this survey is in harmony with the fact that Mexicans somehow found more work or income to send as remittances to Mexico during the multiple lockdowns.

Survey Administered In Spanish

To confirm we were indeed surveying self-identified Mexican individuals, we administered the survey in Spanish; we also exclusively distributed this survey within closed networks of the Mexican community. However, we also went through a paid-tier of Survey Monkey to examine whether those results remained consistent or affected the overall trajectory of the average salary range for Mexicans in the US. The act of including Survey Monkey participants raised the minimum average salary range too.

Due to mandatory social distancing, the usage of online survey methods was the only acceptable way to take a survey data-based snapshot of our community.

Survey Results

The survey results indicated that about half of Mexicans were not impacted in their employment or hours worked during the Coronavirus due to their links to education, clerical, health and construction sectors.

The four main sectors: Health, Education, Clerical and Construction – mostly WFH or essential sectors.

From the ‘most-trusted’ group of 48 survey participants who were verified through social media and prior in-person interactions, we determined that the average salary range was from US 47,347 dollars to 67,243 dollars for Mexicans. Including, the Survey Monkey group raises the lower range to 48,000 and upper range stays flat.

Students, Restaurant Workers

If we remove individuals who are underemployed due to their status as students, restaurant workers (many of whom received some moderate amount of unemployment or switched sectors), then the minimum average salary is raised to over 56 thousand dollars.

Thus, we could surmise that the average Mexican is making over 56 thousand dollars if active and fully employed in the labor market.

To be more exact, here is the average salary range for the fully employed Mexicans surveyed:

57812.567243.94
Fully Employed averages.

48 percent of the Mexicans surveyed make over 50000 USD annually:

Survey participants were asked which salary range their annual income fell under. Nearly half of Mexicans surveyed selected more than 50,000 annually.

When averaged out, the bottom range for an Average Mexican salary – for the participants surveyed in both SM and via closed networks – is 48571 US dollars annually to 67243 dollars annually:

48571.4367243.94
Lower and Upper Bound for average salary range for Mexican individuals in United States.
Bottom 9 percent were students, restaurant employees impacted by Covid-19 Lockdowns

If we included Survey Monkey based participants, we see a bump in Mexican minimum salaries jump to 51000 dollars annually.

Lockdown Impacts

The Coronavirus impacted about half of the Mexicans – mostly women – surveyed when it came to the number of hours worked. Those who were impacted tended to have major losses in hours worked. The rest lost between 1 to less than 20 hours of work.

Women made up 72 to 63 percent of Mexican respondents who were impacted adversely by Covid-19

Our numbers vary due to the exclusion or inclusion of Survey Monkey based respondents. If we went outside of our vetted participants, we received noisy data from individuals who can not be confirmed as Mexican or whom may not be operating in good faith. Nonetheless, income averages creeped higher and unemployment rates lowered when including these unvetted Mexicans.