In this page, we review possibly objectionable content related to Mexicans. We have stored these tweets in a database. Many people make statements on Twitter ‘with a pinch of salt’. However, therein lies a powerful question: who gets to define what is simply a cheeky reference and what crosses the line as cemented or fomenting detrimental worldviews? A few simple questions will help the reader of the tweets figure out what to potentially report to Twitter:
Can this person make this statement in front of the demographic mentioned?
Is this a member of the community referenced?
Can this person make this statement at work without an HR consult afterwards or some other kind of censure?
If the answer to any of the above mentioned is “no”, then this tweet is likely objectionable and worth passing along to Twitter.
The social media representation of communities is important. While Freedom of Speech is important too and we should not seek to prevent statements from being uttered/tweeted, we can check their propagation; a racially biased or offensive view must always be countered by a concerted rebuttal.
Check out the latest chatter from people using the word ‘Chicano’ on twitter.
In an effort to highlight more content, we developed a few database queries to routinely retrieve uncontroversial tweets. Some of these contain frivolous references or insightful comments. Unfortunately, in many social media platforms, some of the least informed content often gets more elevated in the general public’s conscience. This page is an effort to add visibility to the reactions, concerns and ideas of the less prominent (“unliked”, less indexed) voices on Twitter, which are equally valid.
In this page, you can monitor content that contains the keyword ‘Chicano’ without any explicit content. For that more flagrant content, please visit this link. This relatively neutral content should be easy enough to follow along. I sort this list of tweets programmatically; using the Twitter Search API, I am able to amass a daily sampling of tweets on the concepts most.
At any rate, these pages allow one to observe what topics are on the mind of the more vocal members of the community. Feel free to report to twitter any objectionable content. The tweets are shown in their entirety and the views expressed their do not express my own or those of my employer.
The content is refreshed roughly every 24 hours. You will either get today’s results or the day before.
This article is reshared with permission from La Cartita. Originally published in that platform 12/16/2017.
La Cartita — (6/30/2017) — PEGASUS is the worlds most advanced spyware, a special type of software designed to spy on cellular phones and computers without the user’s permission. The software is most often used to target a victim’s phone camera and microphone. The audio and video are recorded and then leveraged against the victim in some way. PEGASUS is designed by the NSO Group, a team of former and current Israeli soldiers from UNIT 8200, a signals intelligence unit from the Israeli army (Israeli Defense Forces or IDF). The company was (and may still be) subsidized by the Israeli government. All of the funds that develop the Israeli’s espionage capacity is ultimately from the large military aid package provided by the US government.
Francisco Partners LP, the real owners of PEGASUS
PEGASUS was recently the subject of a highly circulated article from the NY Times detailing how the NSO Group’s software was found to have been used by the Mexican government against activist lawyers and journalists. The NY Times article was based primarily on a report from Citizen’s Lab group in Toronto. NSO Group works exclusively with governments. The first documented use of the software was against Ahmed Mansoor, a respected legal scholar who speaks out against torture.
Unfortunately, NSO Group does not operate independently of private capital. NSO Group was acquired by a private equity firm: Francisco Partners LP. The firm has several technology holdings, for instance, a software unit from Dell Computers that was spun off to Francisco Partners LP.
II. CALPERS puts 100 Million on Pegasus’ Owner; UC Regents 25 million CALPERS funds Francisco Partners LP, owners of the NSO Group
Francisco Partners LP has two publicly listed locations that function as their corporate offices. There is 1 Letterman Drive, C Suite 410, San Francisco, California and another office in London. Their holdings are valued at 8 billion dollars. Ironically, they are increasingly in a better position to exploit commercial software since they own increasingly ubiquitous software and hardward platforms to which the NSO Group can presumably gain privileged access.
Francisco Partners LP has many government contacts. At least, one can assume this to be case with the high number of public pension funds that have invested in the company. Most notably for some of our readers, CALPERS has paid into a 100,000,000 dollars into a Francisco Partners LP fund. The following is a cursory review of the amount invested in Francisco Partners LP’s funds from US public pensions.
The Following Public Pensions Pay Into Francisco Partners LP Fund:*** How To Interpret Figures: The amount invested is to the right. The rightmost section contains the latest known investment made from the Public Pension funds to the Franscico Partners LP funds that finance company operations, e.g. capitalization, providing loan collateral, operating costs. etc.
California Public Employees’ Retirement System USD 100,000,000 9/30/2016
Oregon Public Employees Retirement System USD 100,000,000 12/31/2016
University of Texas Investment Management Co/The USD 75,000,000 5/31/2016
California State Teachers’ Retirement System USD 75,000,000 9/30/2016
Florida Retirement System USD 75,000,000 9/30/2016
New York City Fire Pension Fund USD 75,000,000 6/30/2016
Colorado Public Employees’ Retirement Association USD 50,000,000 12/31/2015
School Employees Retirement System of Ohio USD 40,000,000 12/31/2016
Regents of the University of California/The USD 25,000,000 9/30/2014
West Midlands Pension Fund USD 30,008,541 3/31/2016
University of Michigan USD 20,000,000 9/17/2009
Pennsylvania State Employees’ Retirement System USD 20,000,000 12/31/2015
Ohio Police & Fire Pension Fund USD 15,000,000 6/30/2014
III. The profit model for NSO Group: Hack More, Pay Less: Realizing Scale
Documents leaked to the NY Times revealed the NSO Group’s external clients and their fee structure. The NSO group charges USD 500,000 dollars to a client state that wishes to install their software in some piece of hardware. An additional USD 650,000 dollars is assessed to intercept/hack 10 I-Phones or 10 Androids. Finally, a client may be charged USD 800,000 dollars more to hack 100 phones of any make or model. This pricing model reflects a disposition to hack more in order for a government to ‘get its money’s worth’.
IV. Government of Mexico: Ayotzinapa Hacks
The Government of Mexico – even before it had a massive fiasco in its hands with the Ayotzinapa case of 2014 – has, at least, 80 million dollars invested in projects with the NSO group since 2013. That figure could only have gone up since the EPN administration struggles to maintain power.
The Ayotzinapa case involves many dozens of lawyers and activist groups. A rough estimate from the Inter American Commission on Human Rights claims that at least 196 people were affected on the night of September 26, 2014. These people and their extended families should presume themselves to be subjects of surveillance in one shape or another because of their legal connection and right to claim restitution. At the time of writing, many of the direct family member’s of the disappeared 43 have phones that exhibit strange behavior.
In this post, we provide a series of web scraping examples and reference for people looking to bootstrap text for a language model. The advantage is that a greater number of spoken speech domains could be covered. Newer vocabulary or possibly very common slang is picked up through this method since most corporate language managers do not often interact with this type of speech.
Most people would not consider Spanish necessarily under resourced. However, considering the word error rate in products like the Speech Recognition feature on a Hyundai, Mercedes Benz or text classification generally on social media platforms, which is skewed towards English centric content, there seems to certainly be a performance gap between contemporary #Spanish speech in the US and products developed for that demographic of speakers.
Lyrics are a great reference point for spoken #speech. This contrasts greatly with long form news articles, which are almost academic in tone. Read speech also carries a certain intonation, which does not reflect the short, abbreviated or ellipses patterning common to spoken speech. As such, knowing how to parse the letras.com pages may be a good idea for those refining and expanding language models with “real world speech”.
Point to Letras.com
Retrieve Artist Songs
Generate individual texts for songs until complete.
Repeat until all artists in artists file are retrieved.
The above steps are very abbreviated and even the description below perhaps too short. If you’re a beginner, feel free to reach out to email@example.com. I’d rather deal with the beginner more directly; experienced python programmers should have no issue with the present documentation or modifying the basic script and idea to their liking.
In NLP, the number one issue will never be a lack of innovative techniques, community or documentation for commonly used libraries. The number one issue is and will continue to be a proper sourcing and development of training data.
Many practitioners have found that the lack of accurate, use case specific data are better than a generalized solution, like BERT or other large language models. These issues are most evident in languages, like Spanish, that do not have as high of a presence in the resources that compose BERT, like Wikipedia and Reddit.
Song Lyrics As Useful Test Case
At a high level, we created a list of relevant artists: Artists then looped through the list to search in lyrics.com whether they had any songs for them. Once we found that the request yielded a result, we looped through the individual songs for each artists.
Lyrics are a great reference point for spoken speech. This contrasts greatly with long form news articles, which are almost academic in tone. Read speech also carries a certain intonation, which does not reflect the short form, abbreviated or ellipsis that characterizes spoken speech. As such, knowing how to parse the https://letras.com resource may be a good idea for those refining and expanding language models with “real world speech”.
The proper acquisition of data can be accomplished with BeautifulSoup. The library has been around for over 10 years and it offers an easy way to process HTML or XML parse trees in python; you can think of BS as a way to acquire the useful content of an html page – everything bounded by tags. The requests library is also important as it is the way to reach out to a webpage and extract the entirety of the html page.
# -*- coding: utf-8 -*-
Created on Sat Oct 16 22:36:11 2021
artist = requests.get("https://www.letras.com").text
The line `’requests.get(“https://letras.com”).text` does what the attribute ‘text’ implies; the call obtains the HTML files content and makes it available within the python program. Adding a function definition helps group this useful content together.
Functions For WebScraping
Creating a bs4 object is easy enough. Add the link reference as a first argument, then parse each one of these lyric pages on DIV. In this case, link=”letras.com” is the argument to pass along for the function. The function lyrics_url returns all the div tags with a particular class value. That is the text that contains the artists landing page, which itself can be parsed for available lyrics.
This helps create a BS4 object.
Args: web_link containing references.
return: text with content.
artist = requests.get(web_link).text
check_soup = BeautifulSoup(artist, 'html.parser')
return check_soup.find_all('div', class_='cnt-letra p402_premium')
The image above shows the content within a potential argument for lyrics_url “https://www.letras.com/jose-jose/135222/”. See the github repository for more details.
Drilling down to a specific artist requires basic knowledge of how Letras.com is set-up for organizing songs into a artists home page. The method artists_songs_url involves parsing through the entirety of a given artists song lists and drilling down further into the specific title.
In the main statement, we can call all these functions to loop through and iterate through the artists page and song functions to generate unique files, names for each song and its lyrics. The function generate_text will write into each individual one set of lyrics. Later, for Gensim, we can turn each lyrics file into a single coherent gensim list.
This helps land into the URL's of the songs for an artist.'
Args: web link is the
Return songs from https://www.letras.com/gru-;/
artist = requests.get(web_link).text
print("Status Code", requests.get(web_link).status_code)
check_soup = BeautifulSoup(artist, 'html.parser')
songs = check_soup.find_all('li', class_='cnt-list-row -song')
#@ div class="cnt-letra p402_premium
songs = artist_songs_url(url)
for a in songs:
song_lyrics = lyrics_url(a['data-shareurl'])
new_file = open(str(uuid.uuid1()) +'results.txt', 'w', encoding='utf-8')
return print ('we have completed the download for ', url )
artistas = open('artistas', 'r', encoding='utf-8').read().splitlines()
url = 'https://www.letras.com/'
for a in artistas :
generate_text(url + a +"/")
#once complete, run copy *results output.txt to consolidate lyrics into a single page.
if __name__ == '__main__':