Kashaya Language Documentation

In the summer of 2009, I participated in a6 week language documentation course as part of the Linguistic Society of America’s biannual Summer School, where the most engaged graduates (and undergraduates) meet and socialise to work on advanced topics in linguistic theory.

The linked work here is a reference to Berkeleys Language Archive where my field recordings, field transcripts and analysis of Kashaya are held. These notes were derived from working with Anita Silva, a native speaker of Kashaya, a language native to what is now called Santa Rosa, California because of colonisation.

She recently passed away. However, her work in documentation of her native language will have an impact for generations to come as efforts to revitalise languages through new native speakers continues.

ImageAI: Python Library For Recognizing Images

Ricardo Lezama — Image AI is an excellent, easy-to-use, Machine Learning wrapper that allows a python script to identify the dominant concept to describe an image. While this article covers a tiny usecase, I would recommend a user be aware of the need to install the right C++ dependencies.

Facebook Image AI

The developers are a group from a Facebook-backed outfit based in Nigeria. One of the principal developers is Moses Olafenwa, a founder of DeepQuest AI. Aside from this excellent python library, Olafenwa’s group develops AI servers for business applications.

Code Summary: ImageAI Predictions

In this summary, we will review the code examples here: https://github.com/OlafenwaMoses/ImageAI/tree/master/imageai/Prediction

Model Dependencies: ResNet

Aside from the libraries called through import statements, the more important dependencies for our test script using ImageAI’s python moduleare the different models that one can use to run a particular image against the model. In this particular example, we reference the RESNET model trained on ImageNet-1000 images. There is an annual competition in which various neural net models are compared against one another using the ImageNet libraries as a frame of reference.

ResNet is a model that uses ‘residual learning’ to create deeper learning.

According to the authors, ResNet “explicitly reformulate[s] the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.” 

Easy Ways To Interface

Instead of modifying the hardcoded line referencing an image, we modified the sample script to accept a simple command line argument. The script (posted originally here) has been modified slightly. I added a reference to the built-in sys library to pass on a command line argument.

Name the file “predicition.py”, then run the script (copy/paste) from wherever your image file is local. Also, the model is resnet50_weights_tf_dim_ordering_tf_kernels.h5, and is a Microsoft sponsored model developed by Kaiming He et al.

from imageai.Prediction import ImagePrediction
import sys

import os

prediction = ImagePrediction()

predictions, percentage_probabilities = prediction.predictImage(sys.argv[1], result_count=5)
for index in range(len(predictions)):
	print(predictions[index] , " : " , percentage_probabilities[index])

Intro To Linguistic Theory

Languages of Mexico.

Linguistics is the science of language as it relates to human cognition.

Metaphysical considerations on the properties of organic systems may seem far removed from the lower level details of language data, but the general idea is that the language faculty is ‘perfect’, has nearly exact properties that are recurrent and while not wholly describable by formal logic notation, better described by these systems than by statistical methods that try to mimic the process of predicting language competence.

Language As A Discrete Object

Human language is quirky relative to other organic systems because of its discrete properties, but infinite and principled variety. Language differs from other capacities, like the ability to recognise emotion in facial expressions (where a computer can outperform a human); it’s expressive power is provably infinite but human competence – the store of representative information concerning a given language – easily outperforms any computer model.

The core elements of human language are discrete too; there is no ‘half a sentence’ since at some level the stored model assigns an interpretation based on one whole interpretation of that linguistic element.

Objects in language are discrete and deterministic when their true meanings are clear to a speaker, but interesting ambiguities signal at varied interpretations. Their experiences are noisy and chaotic, but for the most part a restricted set of properties in the human mind define what a possible steady state for a human language.

Trends Within The Field: Stochastic vs Discrete Linguistics

Corpus Linguistics and Theoretical Linguistics have often been thought to be in opposition when investigating linguistic phenomena. They are often deemed as distinct ways to view the same object: competence of a linguistic system. However, each are better seen as valid, complementary approaches and distinct ways to model distinct performance and competence phenomena.

In theoretical linguistics, we’re concerned with the discrete study of linguistic competence. Several biological and psychological arguments premise the related questions of “what are the general conditions that the human language faculty should be expected to satisfy in order to execute a language?” and “how do these conditions define the language faculty”? The latter two questions, sourced from Howard Lasnik’s foreword in the Minimalist Program, are the domain of theoretical linguistics.

The broad characterisation of language in text, the sourcing, curation and stochastic analysis of a domain specific text is corpus linguistics, an analysis of language in the context of performance.

Computational Rules & The Lexicon

There is a firm partition between the functional and substantive parts of a language.

The substantive words are what are commonly called ‘nouns’, ‘verbs’ or ‘adjectives’. These elements describe the world, but their relations, nuance that is less salient but necessary for interpretation are relegated to the functional elements of language.

In computational linguistics, functional words are referred to as ‘stop words’, though the term can be given an application-specific definition used to cover high-frequency vocabulary that recurs within a corpus, but does not signal a topic in text. Without much of a definition, a function word tells you about how the substantive words relate.


A lexicon is a repository of word information.

The repository contains all the unprincipled details of a word that defines it uniquely relative to all other words.

In a lexicon, these unique details are idiosyncratic there is no in-depth explanation to why and how these details emerge. Rather, the details are assumed a priori under any theory of language.

Computational System

Any language can be represented as a set of principles instantiated with specific parameters. There are multiple modular components in such a system handling different cognitive tasks, like Semantic Interpretation or Grammatical Inference.

The semantic or interpretive module for a language is called it’s Logical Form (May 1977) while its Phonetic Form parallels this module in the spoken sense. Additionally, interfacing with both modules are grammatical principles or Deep Structure that essentially proves a language string is obeying the principles of that language.