Friday, October 19, 2012

Google Ngrams 2.0 and POS tagging

As Ben Zimmer blogged yesterday, there's a new and improved version of the Google Ngram viewer. The "improved" bit has a number of elements, but one is POS tagging. This is a wonderful thing, and I'm inordinately happy about it. Unfortunately, there are some very odd quirks to deal with.

The subordinators that, whether, and if (e.g., she asked me whether/if I'd be able to go; he told me that he'd be able to go) are tagged as _ADP_ (adposition, a more general term for prepositions). I've never seen such a classification, and it strikes me as deeply strange.

The second is their list of _DET_ (determinatives, or what they call "determiners"). I'm happy to report that they do not include the dependent possessive pronouns (my, your, his, her, our, etc.). These are tagged as _PRON_.

It seems that the following words are tagged as determiners at least part of the time:

a
100%
all
100%
an
100%
another
100%
any
100%
both
100%
each
100%
every
100%
some
100%
the
100%
these
100%
this
100%
those
100%
whatever
100%
which
100%
whichever
94%
no
88%
neither
56%
that
43%
either
42%
what
3%

Apart from these, many and much are usually tagged _ADJ_ (and _ADV_ for much), but less than 1% of the time they get tagged _DET_. I believe all cases tagged _ADJ_ should have been _DET_, so I have no idea what distinction is being made here.

I believe that which is generally a determinative, but in relative uses, it's usually a pronoun:
They could be late, which would be a problem. [pron]
They could be late, in which case we would have a problem. [det]

The following words are generally considered determiners (at least in some cases) and yet they are never tagged as such in the corpus:

few, fewer, fewest, last, least, less, little, more, most

Some other words which are determinatives in some accounts but not here are:
a few, a little, anyone, anything, certain, enough, everything, none, said, us, we, you

Finally, the list above doesn't appear to be exhaustive as I cannot get to 100% when dividing by _DET_, even when I include upper case first letters and all upper case. I wonder what I'm missing.

[Update: It seems that these oddities are based mostly in the Part-of_Speech Tagging guidelines for the Penn Treebank Project (3rd Revision, 2nd Printing) by Beatrice Santorini.
This category includes the articles a(n), every, no and the, the indefinite determiners another, any and some, each, either (as in either way), neither (as in neither decision), that, these, this and those, and instances of all and both when they do not precede a determiner or possessive pronoun (as in all roads or both times). (Instances of all or both that do precede a determiner or possessive pronoun are tagged as predeterminers (PDT).) Since any noun phrase can contain at most one determiner, the fact that such can occur together with a determiner (as in the only such case) means that it should be tagged as an adjective (JJ), unless it precedes a determiner, as in such a good time, in which case it is a predeterminer (PDT).
This explains the missing m determiners, but it doesn't explain why a small subset of many and much are tagged as _DET_. ]

[Update 2: The instances of many_DET are mostly of the form many a, for example many a day went by. Under the Penn system, this is a predeterminer. Thanks to Slav Petrov for solving this puzzle!]

No comments: