University of Cambridge > > NLIP Seminar Series > Learning hierarchical structure: strong learning of PCFGs

Learning hierarchical structure: strong learning of PCFGs

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Andrew Caines.

A central problem in language acquisition and NLP is the acquisition of hierarchically structured syntactic representations. The simplest model of this is strong learning of context-free grammars from strings: where the learner is required to recover isomorphic labeled parse trees while observing only unstructured strings of symbols. Recently developed algorithms for learning context-free grammars from strings typically use a rather unrealistic learning model which only guarantees asymptotic weak convergence using membership queries (the ability to test whether a given string is in the language or not). However the classes of languages that can be learned are quite large. Probabilistic variants of these algorithms have also been developed with PAC style guarantees for some smaller classes.

Here we convert the simplest of these algorithms—a primal learning algorithm called a 1-finite kernel learner—into a practical probabilistic learning algorithm that can learn the structure of probabilistic context-free grammars from strings. This involves converting a logical notion of the validity of a production in the grammar to its probabilistic equivalent. We combine this approach with some standard machine learning techniques: non-negative matrix factorisation, and the Inside-Outside algorithm.

We experiment using synthetic data restricting ourselves to CFGs that satisfy three reasonable conditions: being in Chomsky normal form; having many more terminals (words) than nonterminals (syntactic categories) and not being excessively ambiguous. We observe that on nearly all grammars that satisfy these conditions the learner approaches the theoretically optimal performance using standard parsing metrics. This work suggests that syntactic structure can be learned just from strings and more generally that the unrealistic learning models used in recent theoretical work can be converted quite easily into practical algorithms.

This talk is part of the NLIP Seminar Series series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.


© 2006-2024, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity