Skip to content

Latest commit

 

History

History
218 lines (164 loc) · 6.89 KB

README.md

File metadata and controls

218 lines (164 loc) · 6.89 KB

Build Status

NatLang

Copyright (C) 2011-2017 mailto:[email protected]

About

NatLang is an English parser with an extensible grammar. It generates abstract syntax trees for all possible interpretations of an English sentence accepted by a grammar. The algorithm is completely deterministic. No training data is required.

This project is no longer being maintained. See new version here: parse-english

It works as follows:

  1. The user inputs a sentence.
  2. WordNet identifies one or more POS identities for each word in the sentence.
  3. All POS configurations of the sentence are evaluated using the Yacc-generated parser.
  4. A parse tree is generated for each successful parse.

For example:

The sentence "eats shoots and leaves" has four interpretations depending on the POS of "shoots" and "leaves".

eats  shoots and leaves
 |     |      |   |
(V)---(V)----(C)-(V)
   \        /   \
    *-(N)--*     (N)

Path #1: {V V C V}
Path #2: {V V C N}
Path #3: {V N C V}
Path #4: {V N C N}

The sentence "flying saucers are dangerous" has two interpretations depending on the POS of "flying".

flying saucers are   dangerous
 |      |       |     |
(Adj)--(N)-----(Aux)-(Adj)
      /
(V)--*

Path #1: {Adj N Aux Adj}
Path #2: {V   N Aux Adj}

TODO: Additional analyses passes can be applied to the generated trees for further processing.

A Motivating Example

input:

the quick brown fox jumps over the lazy dog

output:

picture alt

best solution (5th tree from left to right):

picture alt

Strategy for Eliminating Grammar Ambiguity

English is a non-context-free language. That means the same word used in different contexts can have different meanings. But Yacc cannot interpret the same lexer terminal differently if it is represented using the same lexer terminal ID. When this happens, it is necessary to split ambiguous terminals.

To do so:

  1. Identify lexer terminal with ambiguous meaning.
  2. Identify parser rules that use the lexer terminals with ambiguous meaning, and assign to each use case a different lexer terminal ID.
  3. Take advantage of stateful lexing to return a different lexer terminal ID for the same lexer terminal.

For example:

The sentence "She and I run and he jumps and shouts." has three conjugations "and".

A possible parse tree may look like this:

                (S)
                 |
         *-------*-------*
         |       |       |
        (S)      |      (S)
         |       |       |
     *---*---*   |   *---*----*
     |       |   |   |        |
    (NP)    (VP) |  (NP)     (VP)
     |       |   |   |        |
 *---*---*   |   |   |   *----*----*
 |   |   |   |   |   |   |     |   |
(N) (C) (N) (V) (C) (N) (V)   (C) (V)
 |   |   |   |   |   |   |     |   |
She and  I  run and  he jumps and shouts.

Yacc chokes on this input due to the ambiguity of "and".

The solution is to split "and" into multiple lexer terminals, each representing a different abstraction level in the grammar.

  • C_NP for noun-part level conjugations.
  • C_VP for verb-part level conjugations.
  • C_S for sentence level conjugations.

And to try all 27 permutations to see which ones work.

She and    I  run and    he jumps and   shouts.
 |   |     |   |   |     |   |     |     |
(N) (C#1) (N) (V) (C#2) (N) (V)   (C#3) (V)

           C#1  C#2  C#3
            |    |    |
Path #1:  {C_NP C_NP C_NP} -- fail!
Path #2:  {C_NP C_NP C_VP} -- fail!
Path #3:  {C_NP C_NP C_S}  -- fail!
Path #4:  {C_NP C_VP C_NP} -- fail!
Path #5:  {C_NP C_VP C_VP} -- fail!
Path #6:  {C_NP C_VP C_S}  -- fail!
Path #7:  {C_NP C_S  C_NP} -- fail!
Path #8:  {C_NP C_S  C_VP} -- success!
...
Path #27: {C_S  C_S  C_S}  -- fail!

Here, path #8's POS configuration results in a successful parse.

She and     I  run and    he jumps and    shouts.
 |   |      |   |   |     |   |     |      |
(N) (C_NP) (N) (V) (C_S) (N) (V)   (C_VP) (V)

Usage

./app/bin/NatLang -e "the quick brown fox jumps over the lazy dog" -d | dot -Tpng -oast_fox.png

Requirements

Unix tools and 3rd party components (accessible from $PATH):

gcc flex bison wordnet valgrind cppcheck doxygen graphviz ticpp

Environment variables:

  • $INCLUDE_PATH_EXTERN -- where "ticpp/ticpp.h" resides
  • $LIB_PATH_EXTERN -- where "libticppd.a" resides

Supported Language Features

  • Present tense
  • Progressive tense
  • Past tense
  • Past perfect tense

Limitations

  • Passive voice not supported.
  • Questions not supported.
  • Conditionals not supported.
  • WordNet does not provide POS look-up for inflected verb forms and mechanical words such as prepositions, leading to a reliance on hard-coded POS definitions in the lexer for some words.
  • A brute force algorithm tries all supported interpretations of a sentence. This is slow for long sentences.
  • BNF rules are suitable for specifying constituent-based phrase structure grammars, but are a poor fit for expressing non-local dependencies.

Make Targets

target action
all make binaries
test all + run tests
pure test + use valgrind to check for memory leaks
dot test + generate .png graph for tests
lint use cppcheck to perform static analysis on .cpp files
doc use doxygen to generate documentation
xml test + generate .xml for tests
import test + use ticpp to serialize-to/deserialize-from xml
clean remove all intermediate files

References

"Part-of-speech tagging"
http://en.wikipedia.org/wiki/Part-of-speech_tagging
"Princeton WordNet"
http://wordnet.princeton.edu/
"Syntactic Theory: A Unified Approach"
ISBN: 0340706104
"Enju - A fast, accurate, and deep parser for English"
http://www.nactem.ac.uk/enju/

Keywords

Lex, Yacc, Flex, Bison, NLP, Natural Language Processing, WordNet, Part-of-Speech Tagging, Yacc Grammar for English, English parser, parsing English, Linguistics, Phrase Structure Grammar, X-bar Theory, POS, PSG, BNF, CFG, GLR