Information Retrieval Based On Linguistic Structure

Often times, a specific linguistic structures can more clearly denote article relevance to a query. An IR query may specify content over metadata rules that would necessarily imply high relevance, but if the term is positioned a certain way, this could also require a caveat to the overall analysis of relevance.

Currently, simply containing a word within a certain depth of an article is sufficient to yield a True Positive. Roughly, the rule can be characterized as below:

If term A (which is the topic in mind) is contained within some metadata, then return all articles containing term A in that metadata.’

However, it is prudent to consider that not all mentions of an entity will equally important. Consider the difference between (1) and (2):

1 ) Apple shares plummet on news of CEO death.

2 ) SugarBearAI, California outfit, now has Apple as minority shareholder.

While both mention Apple, there is a strong sense in which (1) is purely about the company and sentence (2) is surely not about California, not only about Apple and mostly about SugarBearAI.

There is a sense that we could potentially ground IR relevance based on the linguistic structure.

In sentence (1), the subject is most definitely Apple and, as such, then the story most about Apple is (1), not sentence (2).

However, what is it about our ‘SugarBearAI’ noun occupying subject position that makes (2) so much more about ‘SugarBearAI’ than ‘Apple’? We assume this is because SugarBearAI is the subject of the matrix verb: ‘has’.

In Dependency Grammar parlance, ‘Sugar BearAI’ has a dependency with the ROOT-head of the sentence.

SugarBearAI, a california outfit, now has Apple as minority shareholder.

Therefore, we can surmise that whoever is the NSUBJ of a root-head node is the actual topic in question and deserves the highest relevance.

Kashaya Language Documentation

In the summer of 2009, I participated in a6 week language documentation course as part of the Linguistic Society of America’s biannual Summer School, where the most engaged graduates (and undergraduates) meet and socialise to work on advanced topics in linguistic theory.

The linked work here is a reference to Berkeleys Language Archive where my field recordings, field transcripts and analysis of Kashaya are held. These notes were derived from working with Anita Silva, a native speaker of Kashaya, a language native to what is now called Santa Rosa, California because of colonisation.

She recently passed away. However, her work in documentation of her native language will have an impact for generations to come as efforts to revitalise languages through new native speakers continues.

Intro To Linguistic Theory

Linguistics is the science of language as it relates to human cognition.

Metaphysical considerations on the properties of organic systems may seem far removed from the lower level details of language data, but the general idea is that the language faculty is ‘perfect’, has nearly exact properties that are recurrent and while not wholly describable by formal logic notation, better described by these systems than by statistical methods that try to mimic the process of predicting language competence.

Language As A Discrete Object

Human language is quirky relative to other organic systems because of its discrete properties, but infinite and principled variety. Language differs from other capacities, like the ability to recognise emotion in facial expressions (where a computer can outperform a human); it’s expressive power is provably infinite but human competence – the store of representative information concerning a given language – easily outperforms any computer model.

The core elements of human language are discrete too; there is no ‘half a sentence’ since at some level the stored model assigns an interpretation based on one whole interpretation of that linguistic element.

Objects in language are discrete and deterministic when their true meanings are clear to a speaker, but interesting ambiguities signal at varied interpretations. Their experiences are noisy and chaotic, but for the most part a restricted set of properties in the human mind define what a possible steady state for a human language.

Trends Within The Field: Stochastic vs Discrete Linguistics

Corpus Linguistics and Theoretical Linguistics have often been thought to be in opposition when investigating linguistic phenomena. They are often deemed as distinct ways to view the same object: competence of a linguistic system. However, each are better seen as valid, complementary approaches and distinct ways to model distinct performance and competence phenomena.

In theoretical linguistics, we’re concerned with the discrete study of linguistic competence. Several biological and psychological arguments premise the related questions of “what are the general conditions that the human language faculty should be expected to satisfy in order to execute a language?” and “how do these conditions define the language faculty”? The latter two questions, sourced from Howard Lasnik’s foreword in the Minimalist Program, are the domain of theoretical linguistics.

The broad characterisation of language in text, the sourcing, curation and stochastic analysis of a domain specific text is corpus linguistics, an analysis of language in the context of performance.

Computational Rules & The Lexicon

There is a firm partition between the functional and substantive parts of a language.

The substantive words are what are commonly called ‘nouns’, ‘verbs’ or ‘adjectives’. These elements describe the world, but their relations, nuance that is less salient but necessary for interpretation are relegated to the functional elements of language.

In computational linguistics, functional words are referred to as ‘stop words’, though the term can be given an application-specific definition used to cover high-frequency vocabulary that recurs within a corpus, but does not signal a topic in text. Without much of a definition, a function word tells you about how the substantive words relate.


A lexicon is a repository of word information.

The repository contains all the unprincipled details of a word that defines it uniquely relative to all other words.

In a lexicon, these unique details are idiosyncratic there is no in-depth explanation to why and how these details emerge. Rather, the details are assumed a priori under any theory of language.

Computational System

Any language can be represented as a set of principles instantiated with specific parameters. There are multiple modular components in such a system handling different cognitive tasks, like Semantic Interpretation or Grammatical Inference.

The semantic or interpretive module for a language is called it’s Logical Form (May 1977) while its Phonetic Form parallels this module in the spoken sense. Additionally, interfacing with both modules are grammatical principles or Deep Structure that essentially proves a language string is obeying the principles of that language.