WebScraping As Sourcing Technique For NLP

Introduction

In this post, we provide a series of web scraping examples and reference for people looking to bootstrap text for a language model. The advantage is that a greater number of spoken speech domains could be covered. Newer vocabulary or possibly very common slang is picked up through this method since most corporate language managers do not often interact with this type of speech.

Most people would not consider Spanish necessarily under resourced. However, considering the word error rate in products like the Speech Recognition feature on a Hyundai, Mercedes Benz or Spanish text classification on social media platforms, which is skewed towards English centric content, there seems to certainly be a performance gap between contemporary #Spanish speech in the US and products developed for that demographic of speakers.

Lyrics are a great reference point for spoken #speech. This contrasts greatly with long form news articles, which are almost academic in tone. Read speech also carries a certain intonation, which does not reflect the short, abbreviated or ellipses patterning common to spoken speech. As such, knowing how to parse the letras.com pages may be a good idea for those refining and expanding language models with “real world speech”.

Overview:

  • Point to Letras.com
  • Retrieve Artist
  • Retrieve Artist Songs
  • Generate individual texts for songs until complete.
  • Repeat until all artists in artists file are retrieved.

The above steps are very abbreviated and even the description below perhaps too short. If you’re a beginner, feel free to reach out to lezama@lacartita.com. I’d rather deal with the beginner more directly; experienced python programmers should have no issue with the present documentation or modifying the basic script and idea to their liking.

Sourcing

In NLP, the number one issue will never be a lack of innovative techniques, community or documentation for commonly used libraries. The number one issue is and will continue to be a proper sourcing and development of training data.

Many practitioners have found that the lack of accurate, use case specific data are better than a generalized solution, like BERT or other large language models. These issues are most evident in languages, like Spanish, that do not have as high of a presence in the resources that compose BERT, like Wikipedia and Reddit.

Song Lyrics As Useful Test Case

At a high level, we created a list of relevant artists: Artists then looped through the list to search in lyrics.com whether they had any songs for them. Once we found that the request yielded a result, we looped through the individual songs for each artists.

Lyrics are a great reference point for spoken speech. This contrasts greatly with long form news articles, which are almost academic in tone. Read speech also carries a certain intonation, which does not reflect the short form, abbreviated or ellipsis that characterizes spoken speech. As such, knowing how to parse the https://letras.com resource may be a good idea for those refining and expanding language models with “real world speech”.

Requests, BS4

The proper acquisition of data can be accomplished with BeautifulSoup. The library has been around for over 10 years and it offers an easy way to process HTML or XML parse trees in python; you can think of BS as a way to acquire the useful content of an html page – everything bounded by tags. The requests library is also important as it is the way to reach out to a webpage and extract the entirety of the html page.

# -*- coding: utf-8 -*-
"""
Created on Sat Oct 16 22:36:11 2021
@author: RicardoLezama.com
"""
import requests
artist = requests.get("https://www.letras.com").text

The line `’requests.get(“https://letras.com”).text` does what the attribute ‘text’ implies; the call obtains the HTML files content and makes it available within the python program. Adding a function definition helps group this useful content together.

Functions For WebScraping

Creating a bs4 object is easy enough. Add the link reference as a first argument, then parse each one of these lyric pages on DIV. In this case, link=”letras.com” is the argument to pass along for the function. The function lyrics_url returns all the div tags with a particular class value. That is the text that contains the artists landing page, which itself can be parsed for available lyrics.



def lyrics_url(web_link):
    """
    This helps create a BS4 object. 
    
    Args: web_link containing references. 
    
    return: text with content. 
    """
    artist = requests.get(web_link).text
    check_soup = BeautifulSoup(artist, 'html.parser')
    return check_soup.find_all('div', class_='cnt-letra p402_premium')
    
letras.com the highlight portion is contained within <div> tag.

The image above shows the content within a potential argument for lyrics_url “https://www.letras.com/jose-jose/135222/”. See the github repository for more details.

Organizing Content

Drilling down to a specific artist requires basic knowledge of how Letras.com is set-up for organizing songs into a artists home page. The method artists_songs_url involves parsing through the entirety of a given artists song lists and drilling down further into the specific title.

In the main statement, we can call all these functions to loop through and iterate through the artists page and song functions to generate unique files, names for each song and its lyrics. The function generate_text will write into each individual one set of lyrics. Later, for Gensim, we can turn each lyrics file into a single coherent gensim list.



def artist_songs_url(web_link):
    """
    This helps land into the URL's of the songs for an artist.'
    
    Args: web link is the 
    
    Return songs from https://www.letras.com/gru-;/
    """
    artist = requests.get(web_link).text
    print("Status Code", requests.get(web_link).status_code)
    check_soup = BeautifulSoup(artist, 'html.parser') 
    songs = check_soup.find_all('li', class_='cnt-list-row -song')
    return songs
#@ div class="cnt-letra p402_premium

def generate_text(url):
    import uuid 
    songs = artist_songs_url(url)
    for a in songs:
        song_lyrics = lyrics_url(a['data-shareurl'])
        print (a['data-shareurl'])
        new_file = open(str(uuid.uuid1()) +'results.txt', 'w', encoding='utf-8')
        new_file.write(str(song_lyrics[0]))
        new_file.close()
        print (song_lyrics)
    return print ('we have completed the download for ', url )


def main():
    artistas = open('artistas', 'r', encoding='utf-8').read().splitlines()
    url = 'https://www.letras.com/'
    for a in artistas : 
        generate_text(url + a +"/")
        print ('done')
#once complete, run copy *results output.txt to consolidate lyrics into a single page. 


if __name__ == '__main__':
    sys.exit(main())  # 

Word2Vec Mexican Spanish Model: Lyrics, News Documents

A Corpus That Contains Colloquial Lyrics & News Documents For Mexican Spanish

This experimental dataset was developed by 4 Social Science specialists and one industry expert, myself, with different samples from Mexico specific news texts and normalized song lyrics. The intent is to understand how small, phrase level constituents will interact with larger, editorialized style text. There appears to be no ill-effect with the combination of varied texts.

We are working on the assumption that a single song is a document. A single news article is a document too.

In this post, we provide a Mexican Spanish Word2Vec model compatible with the Gensim python library. The word2vec model is derived from a corpus created by 4 research analysts and myself. This dataset was tagged at the document level for the topic of ‘Mexico’ news. The language is Mexican Spanish with an emphasis on alternative news outlets.

One way to use this WVModel is shown here: scatterplot repo.

Lemmatization Issues

We chose to not lemmatize this corpus prior to including in the word vector model. The reason is two-fold: diminished performance and prohibitive runtime length for the lemmatizer. It takes close to 8 hours for a Spacy lemmatizer to run through the entire set of sentences and phrases. Instead, we made sure normalization was sufficiently accurate and factored out major stopwords.

Training Example

Below we show some basic examples as to how we would train based on the data text. The text is passed along to the Word2Vec module. The relevant parameters are set, but the user/reader can change as they see fit. Ultimately, this saved W2V model will be saved locally.

In this case, the named W2Vec model “Mex_Corona_.w2v” is a name that will be referenced down below in top_5.py.

from gensim.models import Word2Vec, KeyedVectors

important_text = normalize_corpus('C:/<<ZYZ>>/NER_news-main/corpora/todomexico.txt')

#Build the model, by selecting the parameters.
our_model = Word2Vec(important_text, vector_size=100, window=5, min_count=2, workers=20)
#Save the model
our_model.save("Mex_Corona_.w2v")
#Inspect the model by looking for the most similar words for a test word.
#print(our_model.wv.most_similar('mujeres', topn=5))
scatter_vector(our_model, 'Pfizer', 100, 21) 

Corpus Details

Specifically, from March 2020 to July 2021, a group of Mexico City based research analysts determined which documents were relevant to this Mexico news category. These analysts selected thousands of documents, with about 1200 of these documents at an average length of 500 words making its way to our Gensim language model. Additionally, the corpus contained here is made out of lyrics with Chicano slang and colloquial Mexican speech.

We scrapped the webpages of over 300 Mexican ranchero and norteño artists on ‘https://letras.com‘. These artists ranged from a few dozen composers in the 1960’s to contemporary groups who code-switch due to California or US Southwest ties. The documents tagged as news relevant to the Mexico topic were combined with these lyrics with around 20 of the most common stopword removed. This greatly reduced the size of the original corpus while also increasing the accuracy of the word2vec similarity analysis.

In addition to the stop word removal, we also conducted light normalization. This was restricted to finding colloquial transcriptions and converting these to orthographically correct versions on song lyrics.

Normalizing Spanish News Data

Large corporations develop language models under guidance of product managers whose life experiences do not reflect that of users. In our view, there is a chasm between the consumer and engineer that underscores the need to embrace alternative datasets. Therefore, in this language model, we aimed for greater inclusion. The phrases are from a genre that encodes a rich oral history with speech commonly used amongst Mexicans in colloquial settings.

Song Lyrics For Colloquial Speech

This dataset contains lyrics from over 300 groups. The phrase length lyrics have been normalized to obey standard orthographic conventions. It also contains over 1000 documents labeled as relevant to Mexico news.

Coronavirus and similar words.

Github Lyrics Gensim Model

We have made the lyrics and news language model available. The model is contained here alongside some basic normalization methods on a module.

Colloquial Words

The similarity scores for a word like ‘amor’ (love) is shown below. In our colloquial/lyrics language model, we can see how ‘corazon’ is the closest to ‘amor’.

print(our_model.wv.most_similar('amor', topn=1))
[('corazon', 0.8519232869148254)]

Let’s try to filter through the most relevant 8 results for ‘amor’:

scatter_vector('mx_lemm_ner-unnorm_1029_after_.w2v', 'amor', 100, 8)
Out[18]: 
[('corazon', 0.8385680913925171),
 ('querer', 0.7986088991165161),
 ('jamas', 0.7974023222923279),
 ('dime', 0.788547158241272),
 ('amar', 0.7882217764854431),
 ('beso', 0.7817134857177734),
 ('adios', 0.7802879214286804),
 ('feliz', 0.7777709364891052)]

For any and all inquiries, please send me a linkedin message here: Ricardo Lezama. The word2vec language model file is right here: Spanish-News-Colloquial.

Here is the scatterplot for ‘amor’:

Scatterplot for ‘amor’.

Diversity Inclusion Aspect – Keyterms

Visualizing the data is fairly simple. The scatterplot method allows us to show which terms surface in similar contexts.

Diversity in the context of Mexican Spanish news text. Query: “LGBT”

Below, I provide an example of how to call the Word2Vec model. These Word2Vec documents are friendly to the Word2Vec modules.

from gensim.models import Word2Vec, KeyedVectors
coronavirus_mexico = "mx_lemm_ner-unnorm_1029_after_.w2v"
coronavirus = "coronavirus-norm_1028.w2v"
wv_from_text = Word2Vec.load(coronavirus)

#Inspect the model by looking for the most similar words for a test word.
print(wv_from_text.wv.most_similar('dosis', topn=5))
#Let us see what the 10-dimensional vector for 'computer' looks like.

Tokenizing Text, Finding Word Frequencies Within Corpora

One way to think about tokenization is to consider it as finding the smallest possible unit of analysis for computational linguistics tasks. As such, we can think of tokenization as among the first steps (along with normalization) in the average NLP pipeline or computational linguistics analysis. This process helps break down text into a manner interpretable for the computer and analyst.

NLTK Is Very Comprehensive

NLTK is likely the best place to start for both understanding and customizing NLP pipelines. Please review their documentation on tokenization here: NLTK – Tokenization Example. While I recommend reviewing NLTK, you should also keep up with Engineers who mostly use Tensorflow. Yes, you must learn two packages at once if you are a linguist in the IT industry.

Learn TensorFlow

“Learn TensorFlow. Given the effort you will place into learning how to combine computing with linguistics, you are also in a strange way lightening the load by proceeding in parallel with industry trends. Likely, a popular topic will contain ample documentation. Consider that most engineers will have a frame of reference for tokenization that is not necessarily grounded in Linguistics, but instead based on interactions with industry centric examples with an intent to prepare data for Machine Learning.”

Industry Perceptions

Thus, if you do not know tokenization both in terms of how engineers perceive tokenization and how linguists work with the concept, then you will likely be perceived as not only not knowing how to program, but also not knowing about your own subject matter as a linguist. While this is obviously not true, perception matters so you must make the effort to reach engineers at their level when collaborating.

# -*- coding: utf-8 -*-
"""
Created on Fri Sep 10 23:53:10 2021

@author: Ricardo Lezama
"""
import tensorflow as tf 

text = """ 
    A list comprehension is a syntactic construct available in
    some programming languages for creating a list based on existing lists. 
    It follows the form of the mathematical set-builder notation (set comprehension) as
    distinct from the use of map and filter functions.
    """
 
content = tf.keras.preprocessing.text.text_to_word_sequence(text)

Obviously, we do not want to repeat the one liner above over and over again within our individual python script. Thus, we neatly repackage the one line as the main line in a function titled ‘tokenize_lacartita‘ as follows:



def tokenize_lacartita(text):
    """ open_lacartita_data references a function  to open our txt documents. 
    
    Arg: This should be a text string that will be converted into a list with individual tokens.
          ex.) ['This', 'is', 'text', 'tokenized', 'by', 'tensorflow']
    
    Returns: New line separated documents. 
    """
    keras_tok  = tf.keras.preprocessing.text.text_to_word_sequence(text)
    return keras_tok

The data we will receive for using this tokenization module is shown below. As you can see, there are individual strings, lowercased and no punctuation as this is by default eliminated in the tokenization process.

['morelia',
 'apoyar',
 'no',
 'es',
 'delinquir',
 'señalan',
 'grupos',
 'feministas',
 'a',
 'sheinbaum',
 'capacitan',
 'a',
 'personal',
 'de',
 'la',
 'fiscalía',
 'cdmx']

Word Frequency and Relative Percentage

We can create a function to find word frequencies. Granted, the counter module in Python can do this already, but, for educational purposes, we include a function to track a word’s frequency within a list. The if-condition below can permit us to count whenever we see our target word within the word list. In this case, we examine a series of headlines related to Mexico that were gathered and classified by hand by Mexican University students.

def word_frequency(word, word_list):
    """
    Function that counts word frequencies.

    Arg: target word to count within word list.
    
    Return: return a count. 
    """
    count = 0
    for word in word_list:
        if word == target_word:
            count += 1
    return count

The word_frequency function receives “AMLO” or it’s normalized version: ‘amlo’ alongside the word list as the second argument. The frequency of the string is listed next to the term when it is returned. Obviously, you can add more elaborate details to the body of the function.

word_frequency("amlo", saca)
Out[164]: 'amlo: 12'

Tokenization In Native Python

At times, an individual contributor must know how to not write a custom function or invoke tokenization from a complex module with heavy libraries. There may be times that linguists are working within siloed environments. This implies that you would not have write privileges to install libraries, like TensorFlow, in a generic linux environment. In this cases, use native python – the term references built in functions or modules that require no additional installation outside of having the most updated version of python.

In fact, you may indeed need to rely more on the raw text you are attempting to tokenize. At times, there are different orthographic marks that are relevant and necessary to find. For example, were you to split based on a space, ” “, or a period, “.”, you can do so by calling the split attribute.

def word(text): 
    return text.split(" ")

All strings contain a split attribute that you can invoke for free in general. Furthermore, you can run a method called ‘strip’ and cleanout a lot of whitespaces. Please see the examples below.

def sentence(text): 
    text_strip = text.strip()
    return text_strip.split(" ")

Frequency Counts For Named Entities Using Spacy/Python Over MX Spanish News Text

On this post, we review some straightforward code written in python that allows a user to process text and retrieve named entities alongside their numerical counts. The main dependencies are Spacy, a small compact version of their Spanish language model built for Named Entity Recognition and the tabular data processing library, Matplotlib, if you’re looking to further structure the data.

Motivation(s)

Before we begin, it may be relevant to understand why we would want to extract these data points in the first place. Often times, there is a benefit to quickly knowing what named entity a collection (or even a hadoop sized bucket) of stories references. For instance, one of the benefits is the quick ability to visualize the relative importance of an entity to these stories without having to read all of them.

Even if done automatically, the process of Named Entity Recognition is still guided by very basic principles, I think. For instance, the very basic reasoning surrounding a retelling of events for an elementary school summary applies to the domain of Named Entity Recognition. That is mentioned below in the Wh-question section of this post.

Where did you get this data?

Another important set of questions is what data are we analyzing and how did we gather this dataset?

Ultimately, a great number of computational linguists or NLP practitioners are interested in compiling human rights centered corpora to create tools that analyze newsflow quickly on these points. When dealing with sensitive topics, the data has to center on those related topics for at-need populations. This specific dataset centers around ‘Women/Women’s Issues’ as specified by the group at SugarBearAI.

As for ‘where did they obtain this data?, that question is answered as follows: this dataset of six hundred articles existing in the LaCartita Db – which contains several thousand hand-tagged articles – is annotated by hand. The annotators are a group of Mexican graduates from UNAM and IPN universities. A uniform consensus amongst the news taggers was required for its introduction into the set of documents. There were 3 women and 1 man within the group of analysts, with all of them having prior experience gathering data in this domain.

While the six hundred Mexican Spanish news headlines analyzed are unavailable on github, a smaller set is provided on that platform for educational purposes. Under all conditions, the data was tokenized and normalized with a fairly sophisticated set of Spanish centric regular expressions.

Please feel free to reach out to that group in research@lacartita.com for more information on this hand-tagged dataset.

Wh-Questions That Guide News Judgments

With all of the context on data and motivations in mind, we review some points on news judgment that can help with the selection of texts for analysis and guide the interpretation of automatically extracted data points.

Basic news judgment is often informed by the following Wh-questions:

1.) What occurred in this news event? (Topic Classification; Event Extraction)

2.) Who was involved in the news?

3.) When did this news event take place?

4.) Where did it take place?

5.) Why did this take place?

If you think of these questions at a fairly high level of abstraction, then you’ll allow me to posit that the first two questions are often the domain of Topic Classification and Named Entity Recognition, respectively. This post will deal with the latter, but assume that this issue of extracting named entities deals with documents already organized on the basis of some unifying topic. This is why it’s useful to even engage in the activity.

In other words, you – the user of this library – will be open to providing a collection of documents already organized under some concept/topic. You would be relying on your knowledge of that topic to make sense of any frequency analysis of named entities, important terms (TF-IDF) etc. as is typical when handling large amounts of unstructured news text. These concepts – NER and TFIDF – are commonly referenced in Computational Linguistics and Information Retrieval; they overlap in applied settings frequently. For instance, TF-IDF and NER pipelines power software applications that deal in summarizing complex news events in real time. So, it’s important to know that there are all sorts of open source libraries that handle these tasks for any average user or researcher.

Leveraging Spacy’s Lightweight Spanish Language Model

The actual hard work involves identifying distinct entities; the task of identifying Named Entities involves statistical processes that try to generalize what a typical Named Entity’s morphological shape in text involves.

In this example, my particular script is powered by the smaller language model from Spacy. One thing that should be noted is that the text has its origins in Wikipedia. This means that newer contemporary types of text may not be sufficiently well covered – breadth doesn’t imply depth in analysis. Anecdotally, over this fairly small headline-only corpus sourced by hand with UNAM and IPN students that contains text on the Mexican president, Andres Manuel Lopez Obrador, Covid and local crime stories, we see performance below 80 percent accuracy from the small Spanish language model. Here’s the small 600 headline strong sample: example headlines referenced.

NER_News Module

Using the below scripts, you can extract persons and organizations. Using spacy, you can extract the entities extracted from a corpus. We will use the lighter Spanish language model from Spacy’s natural language toolkit. This post assumes that you’ve dealt with the basics of Spacy installation alongside its required models. If not, visit here. Therefore, we should expect the below lines to run without problem:

import spacy
import spacy.attrs
nlp = spacy.load('es_core_news_sm')

In this example, we use clean texts that are ‘\n’ (“new line separated”) separated texts. We count and identify the entities writing to either memory or printing to console the results of the NER process. The following line contains the first bit of code referencing Spacy and individuates each relevant piece of the text. Suppose these were encyclopedic or news articles, then the split would probably capture paragraph or sentence level breaks in the text:

raw_corpus = open('corpora/titularesdemx.txt','r', encoding='utf-8').read().split("\n")[1:]

The next step involves placing NER text with its frequency count as a value in a dictionary. This dictionary will be result of running our ‘sacalasentidades’ method over the raw corpus. The method extracts GEO-political entities, like a country, or PER-tagged entities, like a world leader.

import spacy
import spacy.attrs
nlp = spacy.load('es_core_news_sm')

import org_per
raw_corpus = open('corpora/titularesdemx.txt','r', encoding='utf-8').read().split("\n")[1:]
entities = org_per.sacalasentidades(raw_corpus)
 
# use list of entities that are ORG or GEO and count up each invidividual token.     

tokensdictionary = org_per.map_entities(entities) 

The object tokensdictionary formatted output will look like this:

{'AMLO': 11,
 'Desempleo': 1,
 'Perú': 1,
 'América Latina': 3,
 'Banessa Gómez': 1,
 'Resistir': 2,
 'Hacienda': 1,
 'Denuncian': 7,
 'Madero': 1,
 'Subastarán': 1,
 'Sánchez Cordero': 4,
 'Codhem': 1,
 'Temen': 2,
 'Redes de Derechos Humanos': 1,
 'Gobernación': 1,
 'Sufren': 1,
 '¡Ni': 1,
 'Exigen': 2,
 'Defensoras': 1,
 'Medicina': 1,
 'Género': 1,
 'Gabriela Rodríguez': 1,
 'Beatriz Gasca Acevedo': 1,
 'Diego "N': 1,
 'Jessica González': 1,
 'Sheinbaum': 3,
 'Esfuerzo': 1,
 'Incendian Cecyt': 1,
 'Secretaria de Morelos': 1,
 'Astudillo': 1,
 'Llaman': 3,
 'Refuerzan': 1,
 'Mujer Rural': 1,
 'Inician': 1,
 'Violaciones': 1,
 'Llama Olga Sánchez Cordero': 1,
 'Fuentes': 1,
 'Refuerza Michoacán': 1,
 'Marchan': 4,
 'Ayelin Gutiérrez': 1,
 'Maternidades': 1,
 'Coloca FIRA': 1,
 'Coloquio Internacional': 1,
 'Ley Olimpia': 3,
 'Toallas': 1,
 'Exhorta Unicef': 1,
 'Condena CNDH': 1,
 'Policías de Cancún': 1,
 'Exposición': 1,
 'Nadia López': 1,
 'Aprueba la Cámara': 1,
 'Patriarcales': 1,
 'Sofía': 1,
 'Crean Defensoría Pública para Mujeres': 1,
 'Friedrich Katz': 1,
 'Historiadora': 1,
 'Soledad Jarquín Edgar': 1,
 'Insuficientes': 1,
 'Wikiclaves Violetas': 1,
 'Líder': 1,
 'Alcaldía Miguel Hidalgo': 1,
 'Ventana de Primer Contacto': 1,
 'Parteras': 1,
 'App': 1,
 'Consorcio Oaxaca': 2,
 'Comité': 1,
 'Verónica García de León': 1,
 'Discapacidad': 1,
 'Cuánto': 1,
 'Conasami': 1,
 'Amnistía': 1,
 'Policía de Género': 1,
 'Parteras de Chiapas': 1,
 'Obligan': 1,
 'Suspenden': 1,
 'Contexto': 1,
 'Clemencia Herrera': 1,
 'Fortalecerán': 1,
 'Reabrirá Fiscalía de Chihuahua': 1,
 'Corral': 1,
 'Refugio': 1,
 'Alicia De los Ríos': 1,
 'Evangelina Corona Cadena': 1,
 'Félix Salgado Macedonio': 5,
 'Gabriela Coutiño': 1,
 'Aída Mulato': 1,
 'Leydy Pech': 1,
 'Claman': 1,
 'Insiste Morena': 1,
 'Mariana': 2,
 'Marilyn Manson': 2,
 'Deberá Inmujeres': 1,
 'Marcos Zapotitla Becerro': 1,
 'Vázquez Mota': 1,
 'Dona Airbnb': 1,
 'Sergio Quezada Mendoza': 1,
 'Incluyan': 1,
 'Feminicidios': 1,
 'Contundente': 1,
 'Teófila': 1,
 'Félix Salgado': 1,
 'Policía de Xoxocotlán': 1,
 'Malú Micher': 1,
 'Andrés Roemer': 1,
 'Basilia Castañeda': 1,
 'Salgado Macedonio': 1,
 'Menstruación Digna': 1,
 'Detenidas': 1,
 'Sor Juana Inés de la Cruz': 1,
 'María Marcela Lagarde': 1,
 'Crean': 1,
 'Será Rita Plancarte': 1,
 'Valparaiso': 1,
 'México': 1,
 'Plataformas': 1,
 'Policías': 1,
 'Karen': 1,
 'Karla': 1,
 'Condena ONU Mujeres': 1,
 'Llaman México': 1,
 'Sara Lovera': 1,
 'Artemisa Montes': 1,
 'Victoria': 2,
 'Andrea': 1,
 'Irene Hernández': 1,
 'Amnistía Internacional': 1,
 'Ley de Amnistía': 1,
 'Nació Suriana': 1,
 'Rechaza Ss': 1,
 'Refugios': 1,
 'Niñas': 1,
 'Fiscalía': 1,
 'Alejandra Mora Mora': 1,
 'Claudia Uruchurtu': 1,
 'Encubren': 1,
 'Continúa': 1,
 'Dulce María Sauri Riancho': 1,
 'Aprueba Observatorio de Participación Política de la Mujer': 1,
 'Plantean': 1,
 'Graciela Casas': 1,
 'Carlos Morán': 1,
 'Secretaría de Comunicaciones': 1,
 'Diego Helguera': 1,
 'Hidalgo': 1,
 'LGBT+': 1,
 'Osorio Chong': 1,
 'Carla Humphrey Jordán': 1,
 'Lorenzo Córdova': 1,
 'Edomex': 1,
 'CEPAL': 1,
 'Delitos': 1,
 'Murat': 1,
 'Avanza México': 1,
 'Miguel Ángel Mancera Espinosa': 1,
 'Reconoce INMUJERES': 1,
 'Excluyen': 1,
 'Alejandro Murat': 1,
 'Gómez Cazarín': 1,
 'Prevenir': 1,
 'Softbol MX': 1,
 'Martha Sánchez Néstor': 1}

Erros in Spacy Model

One of the interesting errors in the SPACY powered NER process is the erroneous tagging of ‘ Plantean’ as a named entity when, in fact, this string is a verb. Similarly, ‘Delitos’ and ‘Excluyen’ are tagged as ORG or PER tags. Possibly, the morphological shape, orthographic tendency of headlines throws off the small language model. Thus, even with this small test sample, we can see the limits of out-of-the-box open source solutions for NLP tasks. This shows the value added of language analysts, data scientists in organizations dealing with even more specific or specialized texts.

Handling Large Number of Entries On Matplotlib

One issue is that there will be more Named Entities recognized than is useful or even possible to graph.

Despite the fact that we have a valuable dictionary above, we still need to go further and trim down the dictionary in order to figure out what is truly important. In this case, the next Python snippet is helpful in cutting out all dictionary values that contain a frequency count of only ‘1’. There are occasions in which a minimum value must be set.

For instance, suppose you have 1000 documents with 1000 headlines. Your NER analyzer must read through these headlines which ultimately are not a lot of text. Therefore, the minimum count you would like to eliminate is likely to be ‘1’ while if you were analyzing the entirety of the document body, then you may want to raise the minimum threshold for a dictionary value’s frequency.

The following dictionary comprehension places a for-loop type structure that filters out on the basis of the term frequency being anything but ‘1’, the most common frequency. This is appropriate for headlines.

    filter_ones = {term:frequency for term, frequency in data.items() if frequency > 1}

While this dictionary filtering process is better for headlines, a higher filter is needed for body text. 10,000 words or more potentially words implies that the threshold for the minimum frequency value is higher than 10.

    filter_ones = {term:frequency for term, frequency in data.items() if frequency > 10}

The resulting dictionary now presented as a matplotlib figure is shown:

 

def plot_terms_body(topic, data):
    """
    The no.aranage attribution is to understand how to best and programmatically 
    plot the data. Intervals are determined by the counts within the dictionary. 
    
    Args: Topic is the name of the plot/category. The 'data' is  a list of ter 
    
     """
    #The bar plot should be optimized for the max and min size of
    #individual 
    filter_ones = {term:frequency for term, frequency in data.items() if frequency > 10}  
    filtered = {term:frequency for term, frequency in data.items() if frequency > round(sum(filter_ones.values())/len(filter_ones))  }   
    print(round(sum(filtered.values())/len(filtered)), "Average count as result of total terms minus once identified terms divided by all terms.")
    terms = filtered.keys()
    frequency = filtered.values()   
    y_pos = np.arange(len(terms),step=1)
    # min dictionary value, max filtered value ; 
    x_pos = np.arange(min(filtered.values()), max(filtered.values()), step=round(sum(filtered.values())/len(filtered)))
    plt.barh(y_pos, frequency, align='center', alpha=1)
    plt.yticks(y_pos, terms, fontsize=12)
    plt.xticks(x_pos)
    plt.xlabel('Frecuencia en encabezados')
    plt.title(str(topic), fontsize=14)
    plt.tight_layout()
    plt.show()
Named Entities or Frequent Terms

We are able to extract the most common GER or PER tagged Named Entities in a ‘Women’ tagged set of documents sourced from Mexican Spanish news text.

Surprise, surprise, the terms ‘Exigen‘, ‘Llaman‘, ‘Marchan‘ cause problems due to their morphological and textual shape; the term ‘Victoria‘ is orthographically identical and homophonous to a proper names, but in this case, it is not a Named Entities. These false positives in the NER process from Spacy just reflects how language models should be trained over specific texts for better performance. Perhaps, an NER model trained over headlines would fare better. The data was already cleaned due to a collection process detailed below so normalization and tokenization were handled beforehand.

N-Gram Analysis Over Sensitive Topics Corpus

I was recently able to do some analysis over the Sugar Bear AI violence corpus, a collection of documents classified by analysts over at the SugarBear AI group. The group has been classifying manually thousands of documents of Mexican Spanish news over the past year that deal with the new topics of today: “Coronavirus”, “WFH”, as well as pressing social issues that have only recently come to the fore of mainstream news. Since the start of the pandemic, they’ve made some of their material available online, but the content is mostly for sale as the Mexico City based group needs to eat. That aside, they produce super-clean and super-specific ‘Sensitive Topics‘ data.

There is a growing consensus that solid data and analysis is needed for NLP activities that involve specific topics, like human rights or documents referencing cultural minorities in any region. For instance, we can not reasonably expect webscraped data from Reddit to fulfill the NLP needs of the Mexican Spanish speaking communities residing outside of official borders. As a whole, the worldview of Reddit as encoded in its texts will not reflect adequately on the needs or key ideas held within that population. Basic NLP tasks, then, like mining for best autocompleted answers, searches or key terms can not be sourced from such a demographic specific (White, 18-45, Male) corpus.

Stopwords

We will work over a couple of hundred documents and try to see what inferences can be drawn from this short sample. We definitely need a new stopword lists. This can help filter out ‘noise’ that is specific to the domain we’re researching. The list(s) can be expanded over time – NLTK’s default Spanish is not enough – and, while there may be some incentive to work with an existing list, the bigrams below show why we would need a domain specific list, one in Spanish and developed by analysts who looked at this data continually.

N-Gram Analysis

As alluded to above, we will start with an N-gram analysis.

To put it simply, N-Grams are units of strings in text that appear with some amount of frequency over a given text. The strings are called ‘tokens’; what specific guidelines on how strings are defined as a token depends on the developers of a corpus. They may have specific applications in mind when developing a corpus. In the case of the Sugar Bear AI, the most relevant application is the development of a news classifier for sensitive topics in Mexico.

We used NLTK to sample out what a basic N-Gram analysis would yield. As is typical in news text, this corpus contained dates, non-alphanumeric content, but no emojis or content associated with social media. The actual analysis from the NTLK ngram module is sufficient for our purposes. That being said, some modest use of the tokenized module and regular expressions did the trick of ‘normalizing’ the text. Linguists should leverage basic notions of Regular Expressions.

        if not re.match("[^a-zA-Z]", i) and not re.match("[^a-zA-Z]", j) 

The violence corpus is fairly small but it is not static. The Mexico City based Sugar Bear Ai group continues to annotate at the document level content from alternative and Mexican mainstream news sources. In the set referenced here, there are about 400 documents from 8 different news sources in Mexico. The careful approach to its curation and collection, however, ensures any researcher working with the corpus will have a balanced and not commonly analyzed collection of texts on sensitive topics.

Code Description

The code here presents a manner in which a list (any list) can be processed by NLTK’s ngram module, but also normalized with some basic tokenization and stop

import re
from nltk.corpus import stopwords
def find_bigrams(t):
    list = []
    for a in t:
        token = nltk.word_tokenize(a)
        stop_words = set(stopwords.words('spanish')) 
        filtered_sentence = [w for w in token if not w in stop_words]         
        bigrams = ngrams(filtered_sentence,2)        
        for i, j in bigrams:
            if not re.match("[^a-zA-Z]", i) and not re.match("[^a-zA-Z]", j) :
                list.append("{0} {1}".format(i, j))
    return list 
alpha = find_bigrams(list)
...
frente múltiples
visité cerca
absoluto atrocidades
consejeras rendirán
ser derechos
Vazquez hijo
detenga amenazas
puntos pliego
odisea huyen
anunció separación
Covid-19 debe
víctima frente
Acapulco sentenciar
logró enviar
garantías acceso
documentaron atención
Atendiendo publicación
llamado Estados
interior Estaciones
encuentran mayoría
informativas pandemia
relación familiar
comunidad Agua
día finalizar
Jornada Ecatepec
lenta momento
guerras territoriales
Pese Facebook
pedido retiro
Rendón cuenta
A.C. Mujeres
error quizá
iniciativas enfocadas
consciente anticapitalista
afectar mujer
Justicia u
alertas tempranas
mediante expresión
Nahuatzen comunidad
garantizar repetición
alza indicadores
noche martes
creó Guardia
asegura feminicidio
Unión Fuerza
pronunciamiento Red
carbono equivalente
condiciones desarrollo
comparativo agresiones
recorte refugios
agregó pesar
Ejemplos de Ngrams de corpus de violencia.

The Ngrams represent a small sampling of typical speech within the ‘violence’ corpus. The usual steps to normalize the content were taken, but, as already mentioned, domain specific stop-word list should be developed.

Interesting Patterns In Authorship, Anonymity

Another interesting insight about this small dataset is that the document set or topic has a usual set of authors. Either its a few authors from a small blog or a large institution(s) publishing without attribution. You can read the Spanish texts here. I share the graph about how attribution and which media groups cover the topic of ‘Violence’ in Mexico.

Most large institutions, like El Financiero or even La Jornada, are capable of finding insight on impactful events, like a major violation of human rights, but they do so anonymously. Larger institutions are better able to publish content without attribution, which translates into potentially less risk for their staff.

Authorship Attribution: Large Institutions Can Post Anonymously

Using Spacy in Python To Extract Named Entities in Spanish

The Spacy Small Language model has some difficulty with contemporary news text that are not either Eurocentric or US based. Likely, this lack of accuracy with contemporary figures owes in part to a less thorough scrape of Wikipedia and relative changes that have taken place in Mexico, Bolivia and other countries with highly variant dialects of Spanish in LATAM since 2018. Regardless, that dataset can and does garner some results for the purpose of this exercise. This means that we can toy a bit around with some publicly available data.

Entity Hash For Spanish Text

In this informal exercise, we will try to hack our way through some Spanish text. Specifically, making use of NER capacities that are sourced from public data – no rule based analysis – with some functions I find useful for visualizing Named Entities in Spanish text. We have prepared a Spanish news text on the topic of ‘violence’ or violent crime sourced from publicly available Spanish news content in Mexico.

Using spacy, you can hash the entities extracted from a corpus. We will use the lighter Spanish language model from Spacy’s natural language toolkit. This language model is a statistical description of Wikipedia’s Spanish corpus which is likely slanted towards White Hispanic speech so beware it’s bias.

First, import the libraries:

import spacy
import spacy.attrs
nlp = spacy.load('es_core_news_sm')

With the libraries in place, we can import the module ‘org_per’. This module is referencing this Github repo.

The work of identifying distinct entities is done in a function that filters for Geographical entities and People. Both of these tags are labeled as ‘GEO’ and ‘PER’, respectively in spacy’s data.

The variable ‘raw_corpus‘ is the argument you provide, which should be some Spanish text data. If you don’t have any, visit the repository and load that file object.

import org_per
raw_corpus = open('corpus_es_noticias_mx.txt','r', encoding='utf-8').read().split("\n")[1:]
entities = org_per.sacalasentidades(raw_corpus)
 
# use list of entities that are ORG or PER and count up
# each invidividual token.     

tokensdictionary = org_per.map_entities(entities) 

As noted before, the text has its origins in Wikipedia. This means that newer more contemporary types of text may not be sufficiently well covered – breadth doesn’t imply depth in analysis because stochastic models rely on some passing resemblance with data that may not ever have been seen.

Anecdotally, over a small corpus, we see performance below 80 percent accuracy for this language model. Presumably, a larger sampling of Wikipedia ES data will perform higher, but certain trends in contemporary news text makes this expectation necessary to temper.

The output returned from running `org_per.map_entities(entities)` will look like this:

{"Bill Clinton": 123,
"Kenneth Starr" : 12,
}

The actual hashing is a simple enough method involving placing NER text with its frequency count as a value in a dictionary. Within your dictionary, you may get parses of Named Entities that are incorrect. That is to say, they are not properly delimited because the Named Entity Language Model does not have an example of your parse. For instance, Lopez Obrador – the current president of Mexico – is not easily recognized as ‘PER’.

Accuracy

This is measured very simply through tabulating how much you agree with the returned Named Entities. The difference between expected and returned values is your error rate. More on accuracy metrics next post.

Linguistics In The Enterprise

Why Linguistics (And Linguists) Are Always On The Back-Foot In An Enterprise Context

Linguistics is often questioned by practitioners of Natural Sciences during informal and professional scenarios. Perhaps, this is the case due to the fact that the phenomena relevant to the Natural Sciences is more readily observable through instrumental means. The irony is that everyone can indeed perceive language, but, unfortunately, this does not necessarily make them experts in the matter.

For instance, a casual observer of a language (an individual who casually has become acquainted with some morsel of data about a language) may define Linguistics as just ‘adding an ‘-s’ suffix to the end of a word to make it plural – that’s not the discipline, at all.

Better Direction For Linguists In Industry

The scientific study of language tries to uncover scientifically sound generalizations about natural language. Over time, different technical devices (formalisms) have been developed to describe language. Some methodologies, like Lexical Functional Grammar, have been useful and easy to transfer into software technologies.

However, even with the aforementioned successes, the priorities for linguists in an enterprise are often misaligned. Annotation for sentiment, for instance, may be better handled by psychology experts. While archiving data is useful, that task is best left to librarians with a strong intuition about how language behaves in the real world vis a vis an Information Retrieval system. Linguists need to spell out their formidable intuitions in code to better exploit the above recommendation. Linguists need to manipulate and train the Language Models, not create the annotations for anything non-linguistic.

Trivia In The Office – Bad Perception

Figuring out when some rule from Oxford applies in a romance novel is fine trivia, but it is not ‘Linguistics’. To begin with, the scientific practice around explaining language behavior is very broad and interdisciplinary. We should not permit the discipline to become describing the inscriptions literally.

When Language Descriptions Meet Computation

While data and rules about languages are important, memorizing data feels like a somewhat pointless exercise. This is a sign that the field in some corners is overly defined by linguistic trivia about English – or some other pet language – rather than in terms of reproducible general principles that can be easily computed. This last point is what really drives progress in Computational Linguistics, and it can be mathematical or statistical in nature. For instance, spellcheck modules depend on anticipating the most likely candidate for a given sequence of neighboring N-Grams.

Linguistics Takes Time

Currently, my fear is that as Linguistics in the enterprise context gains strength, the finer points around rationale will be overtaken by boring data recitals. If so, we are in for a world of trivia and not developments largely as a result from the influence of non-linguist priorities in the discipline. The drive to subvert some computational tools for linguistic ends does not exist.

Narrative is very important. Understanding why a linguistic analysis exists helps ground activities. This ability to contribute to NLP activities through adopting a proper narrative for linguistic activities in the enterprise setting has not surfaced beyond the ‘we need better data for this hungry machine learning algorithm’. It pays the bills, but it does not advance the field.

Wearables, Speech Recognition & Musk: How Intel’s Loss Could Be Tesla Gain

Despite the famously late arrival to mobile computing, Intel did make certain strides before many others in the space of wearables in Mid-2013 and onwards. Much of it may have to do with the company’s strategic diversification which took place in mid-2013.

Hundreds of Millions Poured Into Research & Development

Intel invested at the very least 100 million dollars alone into the capital expenditures and personnel for their now defunct ‘New Devices Group’, an experimental branch of Intel charged with creating speech and AI enabled devices.

While many high-profile people were hired, developments took place and acquisitions made, investors were either not aware or not too pleased with the slow roll to market for any of these expenditures.

These capital intensive moves into different technology spaces were possibly done as a proactive measure to not miss the ‘next big thing’ as they had with not providing the chipset for the Apple I-Phone. At the time, Brian Krzanich was newly appointed as Intel’s CEO to permit the company to transition from these failures – rightly or wrongly – attributed to the prior CEO, Paul S. Otellini.

Why Did Intel Invest In Wearables?

Once Krzanich became CEO of Intel in May 2013, he quickly moved to diversify Intel’s capabilities in non-chip related activities. Nonetheless, these efforts were still an attempts to amplify the relevance of the company’s chipsets. The company’s participation within the various places in which computing would become more ubiquitous: home automation, wearables and mobile devices with specialized, speech-enabled features. The logic was that the computing demands would naturally lead to an increased appetite for powerful chipsets.

This uncharacteristic foray into the realm of ‘cognitive computing’ led to several research groups, academics and smaller start-ups being organized under the banner of the ‘New Devices Group’ (NDG). Personally, I was employed in this organization and find that the expertise and technology from NDG may regain relevance in today’s business climate.

Elon Musk’s Tweet: Indicative Of New Trends?

Elon Musk’s tweet on wearables.

For instance, Elon Musk recently tweeted a request for engineers experienced in wearable technologies to apply for his Neuralink company. On the surface, this may mean only researchers who have worked on Brain Machine Interfaces, but as Neuralink and competitors bore down on some of the core concepts surrounding wearables, subject matter experts in other fields may be required as well.

Human/AI Symbiosis

When we consider what Musk is discussion, it would be fair to ask what constitutes ‘Human’?

Without much pedantic overviews, I would assume that linguistics has somethin to do with describing humanity – specifically, the uniqueness of the human mind.

As corporate curiosity is better able to package more variant and sophisticated chunks of the human experience, those experiences yielded primarily through text and speech are best described by Computational Linguistics and already fairly well understood from a consumer product perspective. It’s fair to say that finding the points of contact between neurons (literal ones, not the metaphors from Machine Learning) firing under some mental state and some UI is the appreciable high-level goal for any venture into ‘Human-AI’ symbiosis.

Thorough descriptions of illocutionary meaning, temporal chain of events, negation and various linguistic cues both in text and speech could have a consistent neural representations that are captured routinely in brain imaging studies. Unclear, however, is how these semantic properties of language would surface in electrodes meant for consumer applications.

Radical Thinkers Needed

The need for either linking existing technology or expanding available products so that they exploit these very intrusive wearables (a separate moral point to consider) likely calls for lots of people to be employed in this exploratory phase. Since it’s exploratory, the best individuals may not be the usual checklist based academics or industry researchers found in these corners. If the Pfizer-BioNTech development is any indication, sometimes it’s the researchers who are not standard that are most innovative.

Use Open Source Speech Recognition

To start, where would you ideally run a quick and dirty Speech Recognition project? Likely, the best place in a Windows 10 (and this could apply for a Mac as well) is in an Anaconda environment. Assuming this to be your case, I will proceed since some complications are avoided by how Anaconda interacts with certain C++ dependencies.

Pocketsphinx

We will use CMU Sphinx, but I am aware of the CMU’s sphinx’ development team pivoting to VOSK. For now, let’s work through the older problems associated with CMU Sphinx since many fundamental points can be addressed through that platform.

If on Windows 10, then you first must install Visual C++ Build Tools, which will take some time. However, somewhat independent of this event, you could also run the following commands to get started:

 pip install SpeechRecognition

Once you try calling out to Speech Recognition, a CMU sphinx set of python bindings via a pip installation, you may receive an error because the C++ build tools are not visible to Anaconda’s python environment. If so, wait and once you can build Speech Recognition (e.g. install means ‘build’ because you literally call out other relevant dependencies).

Next, you may also need to call Port Audio. This will definitely need Visual Studio installed.

pip install PortAudio

Finally, make sure to call Swig because this will permit you build the relevant pocketsphinx dependencies (and pocketsphinx itself) as if you were in a kind of linux environment. Conda install helps fix a lot of discrepancies associated with Windows.

conda install swig

pip install pocketsphinx

Noam Chomsky Interview

SEPTEMBER 27, 2012

An Interview With Noam Chomsky

by RICARDO LEZAMAFacebookTwitterRedditEmail

Noam Chomsky’s latest books are Occupy (Zuccotti Park Press) and Making the Future: Occupations, Interventions, Empire and Resistance(City Lights Publishers).

RICARDO LEZAMA: Have you heard about the Stand With Us group/campaign?

NOAM CHOMSKY: No. Tell me about it.

LEZAMA: They are a group that spread favorable propaganda regarding the IDF on different campuses.

CHOMSKY: Never heard of them.

LEZAMA: Just trying to see how prominent their campaign was – must be a West Coast/Midwest thing. Moving on, What kind of repression do Palestinian Americans face in the U.S.?

CHOMSKY: In America, for one thing, all Muslims are subjected to a kind of Islamaphobia. That is endemic to the United States, and ranges from being detained in the airport, being followed by the FBI, problems at colleges, and elsewhere. Palestinians, of course, are a part of that, and there has been more in the past than today for Palestinian scholars in universities. For example, there have been efforts to defame them as anti-Israeli terrorists. However, it is the kind of repression that is familiar to ethnic groups out of favor with the U.S. government. I have plenty of Palestinian friends who make out fine.

LEZAMA: It is not off the charts?

CHOMSKY: It is not off the charts, it shouldn’t be there, but yeah, if you’re a Mexican American in Arizona and you get pulled over, the police can claim you’re doing anything, basically.

LEZAMA: Ok. Well, in March 2012, the Israeli Air Force bombed the Gaza strip. I thought this was a particularly harsh period for Palestinians. I was hoping you could give us a brief overview of what happened?

CHOMSKY: Well, just to go back a bit to June 2008, when a ceasefire was reached between Israel and Hamas, the dominant force in the Gaza strip. Right after the ceasefire there were no missiles at all fired by Hamas at Israel. The missiles don’t amount to much. They are kind of home-made missiles.

LEZAMA: They never even make it to Tel Aviv.

CHOMSKY: The missile launches from Hamas stopped altogether during that period, even though Israel didn’t observe the ceasefire. Part of the ceasefire was that Israel was supposed to stop the siege. Still, no Hamas missiles. You can read that on the official Israeli government website. In November 2008, the day of the presidential election, Israeli military forces invaded Gaza and killed half a dozen Hamas militants. Well, that was followed by a missile exchange for a couple of weeks in both directions. Like always, all the casualties were Palestinian but there were some Hamas missiles, followed by a much heavier, far bloodier response from Israel. This leads us to mid-December 2008. At that point, Hamas offered to renew the ceasefire. Israel considered the offer, rejected it and decided instead to invade and attack Gaza. That is Operation Cast Lead, which started on December 27, 2008. It was brutal and murderous.

There is a very good account of Operation Cast Lead by independent participants. For example, there were a couple of Norwegian doctors working at the Gaza hospital through the attack. I mean, they just called it infanticide. The IDF killed a lot of children, were attacking ambulances, committing all kinds of atrocities. [These doctors], they wrote a very graphic and dramatic account of what the invasion was like. The Israeli military must have killed 1500 people. There was a UN Security Council effort to call a ceasefire early in January, but the U.S. blocked it – it wouldn’t allow it. It was very carefully planned. It ended right before Obama’s inauguration. The point of that was to protect Obama from having to say anything critical about it. He was asked about it before he was elected and said ‘I can’t comment on that, I am not president’. It started a few days before the election, and ended before the inauguration. When he was asked about it after the election, Obama took the position that we shouldn’t look backwards but should move forwards. There was no punishment for those involved, and, it was a really criminal assault on a completely defenseless population. It was one of the most brutal attacks in recent years – that’s Operation Caste Lead. There is no pretext for it. They claim it was to protect the population from Hamas missile, but an easy way to do that would have been just to renew the ceasefire.

LEZAMA: That’s an interesting point regarding the timing of the attacks. Right now, we have to pick between one really bad candidate, and Romney. It seems like the Israeli government is taking advantage of the Obama administration’s bid for re-election. Israel is talking a lot about attacking Iran, and trying to mobilize support for it in the U.S. These kinds of things tend to have consequences for Palestine; what will happen in Palestine? I think Israel is bluffing, and looking for something else.

CHOMSKY: Well, Israel is a pretty crazy state. My suspicion is that they are trying to create the circumstances under which the U.S. will attack Iran – they don’t want to do it themselves.

LEZAMA: They want to set up a rationale?

CHOMSKY: I would not be surprised if they staged some kind of an incident in the Persian Gulf, which would not be hard. You and I can do it. The Persian Gulf is lined with U.S. Naval missiles, aircraft carriers, destroyers, and so on. Any small incident, a skiff, or, a boat bumping into an aircraft carrier could lead to a vicious response.

Actually, we should bear in mind that the United States is already at war with Iran by Pentagon standards. The assassinations – which is terrorism – the cyberwar, the economic warfare, are all considered by the United States as acts of war if they are done to us, but not if we do it to them. So, by our standards, we are already attacking Iran. The question is how much further we will take it. An important aspect of this never discussed in the United States. You never read about it. I write about it, maybe two or three other people, but you never read about it. There is a pretty straightforward solution to this, a diplomatic solution. Namely, move towards establishing a nuclear weapons free-zone in the region. That is strongly supported by virtually the entire world. The U.S. has been blocking the solution for years. However, support for it is so strong that Obama was forced to agree to it in principle, but stated that Israel has to be excluded. Well, that is a joke. Israel has hundred of nuclear weapons, carries out aggression, is a violent state, refuses to allow inspections, and so on. To say that Israel has to be exempted, then, kills the prospect of a nuclear-weapons free zone in the Middle East. This situation is coming to a head in December. There is to be an international conference on a nuclear weapons free-zone in the Middle East; Israel just announced that it is refusing to participate.

LEZAMA: Will the U.S. participate?

CHOMSKY: Everything always depends on what the U.S. is going to do. So far, there is nothing official. Up until now, Obama has said ‘yes, we are in favor of it, but Israel has to be excluded’. That exception essentially kills the possibility of a nuclear weapons free-zone. If anybody believes Iran is a threat, which I think is pretty much fabricated, but if you believe it, this is the way to do it: impose a nuclear-weapons free-zone.

Of course, that would mean Israel has to join the Non-Proliferation Treaty. The U.S. has to stop protecting the Israeli development of nuclear weapons. That is what is required to end whatever you think the threat of Iran is. There is a straightforward diplomatic approach. As usual, the media is supressing this information. I don’t think they even reported the fact that Israel announced its withdrawal. It was announced on the Israeli press. They all know about it.

LEZAMA: Assuming that the U.S. does not go into all out war, ground troops, airstrikes, and so on, assuming that doesn’t happen, which is what the Israeli’s want.

CHOMSKY: I don’t’ think they expect ground troops, they expect, or, want.

LEZAMA: Airstrikes?

CHOMSKY: A major missile and aerial assault. Israel could do it too. Israel has submarines, which they received from…

LEZAMA: …Germany.

CHOMSKY: …which can carry nuclear tipped missiles. I’m pretty sure they are deployed in the gulf. So, if they want, they can carry out a missile attack.

LEZAMA: Why don’t they do it themselves?

CHOMSKY: They are afraid it would be too costly. For one thing, the world would be furious. Everybody is already furious at Israel. Even in Europe, it is regarded as the most dangerous state in the world, and it is becoming a pariah state. Of course, in the third world, in the Arab and Muslim world it is very much feared and hated. An attack on Iran – maybe they don’t care – could turn them into South Africa. They would rather have the United States do it.

LEZAMA: Whether a larger scale attack on Iran happens or not, there will still be consequences for the Palestinians.

CHOMSKY: The Palestinians are in a dire state now. There is a political settlement, which is agreed upon by the entire world, the UN Security Council, the International Court of Justice, World Court, by everyone, namely, a two state solution. An easy, straightforward solution.

LEZAMA: Just abide to the two state solution, and the conflict is eliminated? What about the idea that Gaza and the West Bank be contiguous?

CHOMSKY: That’s required!

LEZAMA: Right.

CHOMSKY: That is part of the Oslo agreement. The Oslo agreement stipulates explicitly that the West Bank and Gaza strip are a single territory. Ever since they signed the Oslo agreement, the United States and Israel have been dedicated to undermining them. The U.S. can violate law freely but it is never reported. Everybody else is too weak to do anything about it. The U.S. is just a rogue state.

LEZAMA: What should people in the U.S. be doing in response?

CHOMSKY: They should be breaking through the media and general doctrinal barriers to come to know what is going on. They should be helping people learn about this. I don’t have any secret sources of information. Everything I have said is public knowledge, but it is not known by anyone. The problem is self-censorship; the media just don’t report anything about it, and rarely do. There is just a tremendous amount of propaganda and indoctrination so people dont know what is going on. This is not the only case, but it is an important one. Everything I have just mentioned is straight on the public record. What activists ought to be doing is place this in the public’s attention.

LEZAMA: I think that has been done in college campuses in California, and elsewhere. It is a good way to circumvent the media, but then the move administrators make is to begin charging for use of these spaces. They essentially price out minority organizations. (For example, UC Davis now charges for usage of buildings.)

CHOMSKY: I know, and I’ve been following it. It is true, and I’ve spoken at universities in California. There is plenty of activism. Actually, it has changed a lot in the past four or five years. Just to illustrate, at UCLA back in 1985, I was invited to give philosophy lectures. I said ‘sure’, but the next day I got a call from campus police asking if they could have uniformed police accompany everywhere I went. I said ‘no’. The next day I saw police following me everywhere I went. They are not hard to detect in a philosophy seminar … I could not walk across from the faculty club to other parts of campus. The reason is that they had just picked up a lot of death threats. They don’t want someone killed on campus. I gave the talk at Royce Hall, the big campus hall, but it was airport security. One entry, everybody’s bag had to be checked. The next day there was a huge attack on the Daily Bruin. First of all, it was a huge attack on me, but also on the professor who invited me. In fact, there was an effort to take away the tenure of the professor who invited me. It was beaten back, but they tried. Well, that was back in 1985. I was back in UCLA maybe a year ago. There was a huge mob, very supportive, hard to get a critical word of what I was saying. That is a huge change. It changed because of student activism. It’s the kind of thing you asked about, you know, ‘what should people do?’.

LEZAMA: Would you say that the state of the country is reflected on campuses? So, if you get negative responses at a campus, you’ll get the same sort of thing happening in libraries?

CHOMSKY: Its the same thing. Yeah, I can give the talk in public meetings, libraries etc. The general atmosphere has just changed enormously. Even in my own university, MIT, if I was giving a talk on Israel-Palestine, up until maybe 10 years ago, I had to have police protection. Now, it is unheard of. There is just a big change. The same is true in the town that I live, Lexington, MA.

LEZAMA: That is odd because you would expect the exact opposite response from the public. Just consider the enormous amount of September 11 related propaganda.

CHOMSKY: Yeah, the propaganda is not as effective as it used to be. That is exactly why this IDF group (Stand With Us) has to go around campuses, trying to counter the support for Palestine. It is trying to reverse the change in general attitudes.

LEZAMA: Seems like this IDF group was strong enough to get a favorable response from the Regents. Did you hear of Yudof statements regarding anti-Semitism? It was completely false, but he felt he could say them.

CHOMSKY: That’s the board of trustees, or, whoever runs the place. But the actual mood on campus I’m sure is quite different.

LEZAMA: What do we make of those people? Even with this climate, all the positive things going for students, there are tuition hikes, hostile police etc. There are so many things happening on these campuses.

CHOMSKY: Yeah, but that’s true of anything. The same is true for the civil rights movement, the anti-war movement, and so on. You are not going to get support from the authorities!

LEZAMA: Would that then imply that the legislation in California requiring colleges clamp down on anti-semitic speech on campus is nullified by student activism?

CHOMSKY: Yeah, activism can change things.

LEZAMA: Ok, ok.

CHOMSKY: It has done it in plenty of cases. That is how activism works. Take the feminist movement, in the mid-1960’s, these feminists were being ridiculed, people called them fem-nazi’s, all sorts of things, but eventually they broke through in many respects.

LEZAMA: What do you think of the Caravan for Peace?

CHOMSKY: I think it’s important. I met Sicilia a couple months ago; he’s an impressive guy. Everything depends on how many people the message reaches. You can’t count on the media, but others can. In fact, all through Latin America, there is a major effort to decriminalize Marijuana, maybe more, but, at least, Marijuana. In Uruguay, they are instituting state production of Marijuana. In most of the hemisphere, there is a strong effort to decriminalize it. In fact, in the Cartagena meetings, the hemispheric meetings held a couple of months ago, the United States and Canada were totally isolated on that issue. Everyone wanted to move in that direction. The U.S. and Canada refused. In fact, my guess is that if there are ever hemispheric meetings, the U.S. will not attend. The U.S. has lost Latin America on a lot of issues. The reason is pretty obvious: they are the victims! The U.S. is responsible for both the demand and the supply, the supply of arms since the arms are coming in from the U.S. What is tearing Mexico to shreds are the arms coming in from Texas and Arizona. They are getting it at both ends. The United States is creating the demand and providing the supply of arms. They are the ones getting massacred and smashed up. All through the hemisphere, Colombia, Guatemala, Honduras, and Mexico, of course, where it is a disaster. Naturally, they want to get out of it, and the U.S. won’t do it. The Caravan could be a way of educating Americans about it.

RICARDO LEZAMA is a recent graduate of the University of California – at Davis.