Categories and Constructions

Mark Steedman

Grammars of the size that are need to ensure wide coverage are so large and syntactically ambiguous that the use of statistical models to limit the search space for the parser are essential. The most important use for treebanks is to provide training data for such models. However, as Collins and Charniak realized in the '90's, it is also tempting to induce the grammar itself from the same treebank, since that is the grammar whose derivations the frequency-based model actually models. Furthermore, the treebank grammar also covers very many constructions that have either been overlooked entirely by linguists, or ignored as too boring to bother about, but are nevertheless quite frequent, such as street addresses.

All this means that writing the annotater manual for a treebanking projects is a lot like working as a construction grammarian. It is interesting in this respect that many of the sentences in corpora such as the Penn Wall Street Journal Treebank involve the same constructions that have received attention in CxG. I'll look at some of these, including Caused Motion and the related Resultative Construction, the Way construction, and the Discontinuous Dependents and Complex Predeterminers recently considered by Kay and Sag 2012.

I'll argue that many of these phenomena have already been analyzed in the treebank, and in particular in its incarnation as CCGbank (Hockenmaier and Steedman 2007). In some cases, I'll propose alternative analyses. More controversially, I shall argue that in all cases, the constructions are lexically headed, and that in all cases they can therefore be syntactically and semantically lexicalized (usually ambiguously, sometimes as multi-word items, and sometimes via lexical rules), obviating the need for any "constructicon" other than the (morpho)lexicon, and leaving syntactic and semantic projection from the lexicon construction-free, language-independent, and universally applicable.

Of course, many constructions in the long tail are missing from treebanks, and therefore from both the model and the treebank-derived grammars. (For example, many species of wh-question are entirely missing from the Penn Treebank.) If there is time, I'll say something about our work on what to do about this.