Information Retrieval Based On Linguistic Structure

Often times, a specific linguistic structures can more clearly denote article relevance to a query. An IR query may specify content over metadata rules that would necessarily imply high relevance, but if the term is positioned a certain way, this could also require a caveat to the overall analysis of relevance.

Currently, simply containing a word within a certain depth of an article is sufficient to yield a True Positive. Roughly, the rule can be characterized as below:

If term A (which is the topic in mind) is contained within some metadata, then return all articles containing term A in that metadata.’

However, it is prudent to consider that not all mentions of an entity will equally important. Consider the difference between (1) and (2):

1 ) Apple shares plummet on news of CEO death.

2 ) SugarBearAI, California outfit, now has Apple as minority shareholder.

While both mention Apple, there is a strong sense in which (1) is purely about the company and sentence (2) is surely not about California, not only about Apple and mostly about SugarBearAI.

There is a sense that we could potentially ground IR relevance based on the linguistic structure.

In sentence (1), the subject is most definitely Apple and, as such, then the story most about Apple is (1), not sentence (2).

However, what is it about our ‘SugarBearAI’ noun occupying subject position that makes (2) so much more about ‘SugarBearAI’ than ‘Apple’? We assume this is because SugarBearAI is the subject of the matrix verb: ‘has’.

In Dependency Grammar parlance, ‘Sugar BearAI’ has a dependency with the ROOT-head of the sentence.

SugarBearAI, a california outfit, now has Apple as minority shareholder.

Therefore, we can surmise that whoever is the NSUBJ of a root-head node is the actual topic in question and deserves the highest relevance.

Leave a Reply

Your email address will not be published. Required fields are marked *