Useful examples for language learners

The odd choices of example sentences that sometimes show up in these "teach yourself to speak..." type books along with phrase books has been rightly mocked in the past. In fact, the subtext of this blog's title references just such a phrase book.

On meeting 'otiose' twice again

I asked Mark Liberman to have a look at what I wrote yesterday since I was struggling to get my head around the probabilities. He was kind enough to write the following guest post:

Maybe a better way of thinking about it is this:

Say the probability that word w_i will be selected at random from a collection of text is P(w_i). Then assuming independence, the probability that the next word will NOT be w_i is (1-P(w_i)), and the probability of failing to find w_i in N successive draws is


If P(w_i) is 1/10^7 (one in ten million), and N is 1000, then we get


which is 0.9999. So if we take notice of a rare-ish (P = 1/10000000) word, and draw 1,000 other words at random looking to see it again, then 9,999 times out of 10,0000, we'll fail to find the moderately rare word we were waiting for. And if we draw 10,000 additional words instead of 1,000, the probability of failure is still

(1-(1/10^7))^10000 = 0.999

so we're still gonna fail 999 times out of a thousand.

But the thing is, Rare Words Are Common. That is, a large proportion of word tokens belong to relatively rare types. So suppose that there are 10,000 other words of approximately equal rareness, and every time we see one of them, we set a subconscious process to watch for recurrences of that word within the next thousand instances

If we do this a thousand times, then the chances of failure (for a thousand instances of noting a rare word and looking for it to occur again) become 

((1-(1/10^7))^1000)^1000 = about 0.9
((1-(1/10^7))^10000)^1000 = about 0.368

So if you do enough reading for these conditions to be satisfied once a day, you should expect to have this experience several times a week.

Now, none of this reasoning really applies, because you aren't picking words at random from a well-mixed urn, you're reading them in order in coherent text. And words in coherent text are far from independent Bernoulli trials -- when a rare word appears, the probability that it will appear again before long in the same text is massively increased by topic effects (and to a lesser extent style and priming effects).  But this just means that the experience should be more common rather than less common -- unless you insist that the texts be separate and on different topics, and so forth, in which case it gets complicated.

But still, I think that the real puzzle is not why you had this apparently odd experience, but why such we occasionally notice the kinds of coincidences that are in fact rather common.

This is not an unimportant question, since it has a lot to do with the genesis of superstition (and probably science, for that matter...)

The above is a guest post by Mark Liberman.

On meeting 'otiose' twice in a day

Well, not in the same day, but certainly within a 24-hour period. As I was lying in bed last night, reading Charles Mann's 1493, I came across the phrase the otiose Percy on p. 78.

As of this morning, I've read to p. 90, so that's about 4,500 words later. I also read a few NY Times articles, adding perhaps another 1,200 words. And then I set about to edit an article for Contact, the TESL Ontario magazine for which I'm the editor. Almost immediately, I came across a quote from David Crystal in which he wonders,
whether the presence of a global language will eliminate the demand for world translation services, or whether the economics of automatic translation will so undercut the cost of global language learning that the latter will become otiose.

Climbing the grammar tree

I've started a new blog called "Climbing the grammar tree". The idea is that I will respond to readings I'm doing for my doctoral studies, so check it out.

A title misparsed

This morning, I was reading this article at New Statesman, when I came across the following:
Yet surely, when night after night atrocities are served up to us as entertainment, it's worth some anxiety. We become clockwork oranges if we accept all this pop culture without asking what's in it.
The plural clockwork oranges suddenly threw into sharp relief the title of Burgess's book A clockwork orange. For some reason that I am unable to articulate now, if I ever was aware of it, I had always parsed that title like this:
That is to say, I took orange to be a postpositive modifier of clockwork (like proof positive, governor general, the city proper, etc.) instead of clockwork as an attributive modifier of orange, like this:

This was, I must admit and odd and, even to me, puzzling title, but then it's an odd and puzzling book, so I just rolled with it. As I say, it was the plural oranges that made me see the light: adjectives don't do plurals.

I somehow overlooked the frequency of clockwork as a modifier, which should have tipped me off: in COCA, almost 40% of all instances of clockwork are attributive modifiers. Another thing that I was aware of, but which just seemed like more of the weirdness, is that clockwork is rarely--but sometimes--countable, so a clockwork is kinda weird, but not totally beyond the pale.

Perhaps one thing the pushed me to the first analysis was the stress pattern. Usually, an NP with a noun as modifier gets the main stress in the NP. It's a  
  • FAculty office, not faculty OFfice
  • SOCcer ball, not soccer BALL, and  
  • poLICE officers, not police OFficers. 
My impression is that people tend to say a clockwork ORANGE, rather than a CLOCKwork orange. This is the same pattern you get with postpositive modifiers like proof POsitive.

Whatever the reason, what really impressed me is how decades of misapprehension can be overcome by a single choice example.

Antedating "determinative"

The OED gives:

b. Gram. determinative adjective, determinative pronoun, etc. (see quots.); determinative compound = tatpurusha n.

1921   E. Sapir Lang. vi. 135   The words of the typical suffixing languages (Turkish, Eskimo, Nootka) are ‘determinative’ formations, each added element determining the form of the whole anew.
1924   H. E. Palmer Gram. Spoken Eng. ii. 24   To group with the pronouns all determinative adjectives..shortening the term to determinatives.
1933   L. Bloomfield Language xiv. 235   One can..distinguish..determinative (attributive or subordinative) compounds (Sanskrit tatpurusha).
1961   R. B. Long Sentence & its Parts 486   The, a, and every are exceptional among the determinative pronouns in requiring stated heads.
Today, I was reading Kellner's Historical outlines of English syntax from 1892 and came across the following on pp. 113–114 (emphasis added):

In Old English the possessive pronoun, or, as the French say, "pronominal adjective," expresses only the conception of belonging and possession ; it is a real adjective, and does not convey, as at present, the idea of determination. If, therefore, Old English authors want to make such nouns determinative, they add the definite article : 
"hæleð min se leofa" (my dear warrior). —Elene, 511.
"ðu eart dohtor min seo dyreste" (thou art my dearest daughter). —Juliana, 193.
§179. In Middle English the possessive pronoun apparently has a determinative meaning (as in Modern English, Modern therefore its connection; German, and Modern French) with the definite article is made superfluous, while the indefinite article is quite impossible. Hence arises a certain embarrassment with regard to one case which the language cannot do without. 
Suppose we want to say "she is in a castle belonging to her," where it is of no importance what-ever, either to the speaker or hearer, to know whether "she" has got more than one castle how could the English of the Middle period put it? The French of the same age said still "un sien castel," but that was no longer possible in English.

§180. We should expect the genitive of the personal pronoun ("of me," &c., as in Modern German)—and there may have been a time when this use prevailed—but, so far as I know, the language decided in favour of the more complicated construction "of mine, of thine," &c.

This was, in all probability, brought about by the analogy of the very numerous cases in which the indeterminative noun connected with mine, &c., had a really partitive sense (cf. the examples below), and, further, by the remembrance of the old construction with the possessive pronoun.
And later:

Later on, the possessive pronoun apparently implies a determinative meaning (as in Modern German and Modern French) ; therefore its connection with the definite article is made superfluous, while the indefinite article is quite impossible. Instead of the old construction we find henceforth what may be termed the genitive pseudo-partitive. See above, 178–180.

Proscribing, narrowly

Over at the NYT, Alexander Nazaryan has a rather strident article about "The fallacy of balanced literacy." Therein, he writes, "balanced literacy is an especially irresponsible approach, given that New York State has adopted the federal Common Core standards, which skew toward a narrowly proscribed list of texts, many of them nonfiction." [Now changed to narrowly prescribed.]

These texts are prescribed. That is, they're imposed, not declared unacceptable or invalid. Nevertheless, the Google Books corpus suggests narrowly proscribed is a new and growing phrase.
So, I'm curious: was this simply a typo, or did he have in mind some metaphor of narrowing down by proscription. Or was it something else?

Thinking like a freak

I listen to the Freakonomics Radio podcast from time to time, and back in May they aired an episode called "the three hardest words...," which, purportedly, were I don't know. The premise was that people hate to admit ignorance and so they hardly ever say, "I don't know."

Except that in most corpus studies, the head-and-shoulders most common, number one, top-of-the-heap three-word string in English is I don't know (It's a three-word string, not four, since -n't is an inflectional suffix, not just a contraction as is taught in elementary schools, but that's another issue.) For instance, in the 3-grams list from the Corpus of Contemporary American English. I don't know is by far the most frequent 3-gram with 199,110 instances (second is one of the at 167,785). In business meetings, we find the same results. Consider table 3.10 on p. 59 of this book, or table 5.8 on p. 183 of this paper.

Now, these are not mostly "I don't know (period)." Far more commonly, they're "I don't know if..." "I don't know what..." etc., which can often be used as a signal of disagreement rather than as an admission of ignorance. Nevertheless, the data stands in rather stark contradiction to the freaky claim. It looks pretty silly to be saying people should fess up to their ignorance, while basing the argument on a point on which you're so ignorant that you assert the most common phrase is the least (or at least the hardest).

(If you're interested in other freaky foolishness, see Joseph Heath's recent post on their simplistic view of the UK medical system.)

Audio and the OED

As I mentioned, Schwa Fire is now out, and I've been quite enjoying it. Arika Okrent (whose name I have inexplicably misread for years as Akira) has written an article called "Ghost voices" about preserving audio-tape recordings of our all-too-impermanent voices, dialects, and languages. As I was reading it, it occurred to me that the OED should include audio recordings of the quotations it uses. These should be in the dialect, and where possible the actual voice, of the original author.

Schwa Fire

Back in November, 2013, there was a proposal on Kickstarter for a new language magazine. I chipped in to sponsor it and ended up on the editorial panel as a result. The first issue is now out.

The golden age of language journalism begins now. In this inaugural issue, Arika Okrent tells the story of 5,700 hours of Yiddish recordings that were almost lost ("Ghost Voices"), and Russell Cobb writes about Americans' fondness for the Englishes we used to speak and what that fondness obscures ("The Way We Talked"). Michael Erard describes and defends "language journalism," and Robert Lane Greene provides a lesson on the languages of love ("Wooing in Danish"). Also included: an English homophone puzzle.

When "syndrome" is a final "s"

1982 gave us the acronym AIDS formed from acquired immune deficiency syndrome. This is pronounced /eɪdz/. The fact that the final S is pronounced /z/ is notable, since a final s is typically pronounced /s/ (e.g., bus) unless it is an inflectional morpheme (e.g., dogs). There are cases such as news and lens, in which a final s is pronounced /z/, but the -s in news was originally a plural morpheme. That leaves lens, which comes from the Latin word for lentil. Apparently, it was pronounced /leːns/ in Latin, so why it has a final /z/ in English is something of a mystery to me. I cannot find another example of an English noun with a final s pronounced /z/.

This brings us back to AIDS. Presumably, this final /z/ was influenced by the homographs aids, the noun, and aids, the verb. But then in 2003 we got SARS. There is no English word sar, so there is no preexisting homograph from which to analogously get /sɑɹz/, but that is the only pronunciation I've ever heard. I've never heard anyone say /sɑɹs/. So this seems to be an extension of the AIDS analogy to aids.

And now today we have MERS. On CBC's Metro Morning this morning, Matt Galloway started out pronouncing it /mɜɹs/, which initially threw me. I'd been mentally pronouncing it with a final /z/, and indeed Galloway finished up with /mɜɹz/ (I couldn't tell what the person he was interviewing was saying, but I suspect she was using the /z/ form, given his shift.)

So perhaps we have a new rule developing: acronym-final s for syndrome is pronounced /z/.

As I was looking around writing this post, it appears that at least one other person has taken note of the pronunciation of MERS.

[John Wells points out "Latin fifth-declension nouns in -es have final noninflectional /z/ in English, too: species, series... Why that should apply to MERS is a further question, which I cannot answer: but compare Mars."]

It's turtles round and round

Part I
I've been trying to understand categories better, and one of the books I've been reading in pursuit of this goal is George Lakoff's Women, Fire, and Dangerous Things. In fact, a few nights ago, I fell asleep reading it, and it must have stirred something in my mind because the next morning in the shower, it occurred to me that perhaps categories are just a distraction and it's really properties we should be looking at.
The category of red things is just a human convenience. But red is a property. Almost immediately, though, I realized that red is a category of electromagnetic waves and that electromagnetic waves are themselves a category. And from there, well, it's turtles all the way down. I set the idea aside as I dried myself and got ready to leave home.

Part II
When I got to work, our Blackboard system was down, so I couldn't do the grading I had intended to do. Distractedly I opened the Simple English Wiktionary and saw that somebody had edited the entry for preposition pace. The change was an improvement on, what I thought was a rather odd previous definition. Going through the history, though, I noticed that the older definition was one I had provided. Curious about what I had been thinking, I went to the OED's entry for pace. There I found the following example:
1995   Computers & Humanities 29 404/1,   I do not believe, pace Peirce and Derrida, that it is signs all the way down.
This struck me as a huge coincidence. The expression shows up in the Corpus of Contemporary American English about once per 150 million words. I had encountered it, or a variation thereof, twice in a morning.

Part III
I looked up the expression and found that Wikipedia has an entry (linked above). One of the citations listed there is due to John (Haj) Ross's 1967 linguistics dissertation Constraints on Variables in Syntax. I followed the link and opened up his dissertation, which indeed contains the story with the line "It's turtles all the way down."
On the next page was the Acknowledgements, which list a number of linguists, but on page x, Ross writes,
This thesis is an integral part of a larger theory of grammar which George Lakoff and I have been collaborating on for the past several years.
This is, of course, the same Lakoff whose book I had been reading the night before.

Looking to the futurate

The verb look has been used to talk about the future for a long time. Perhaps the most common use is in the expression look forward to (something). This use may be based on the metaphor that time is a landscape we move through. As such, our future should be visible to us. This is probably the same metaphor that underlies the use of go for the future in expressions like we're going to get to that in a moment.

Despite its venerable history, futurate look began a significant upsurge in about 1980, particularly, in the looking + to infinitive construction.
I noticed that this seems to be particularly common with are. Especially, you are. And even more specifically if you are. By this time we are looking at a small minority of the cases. But I wondered if it might tell us something about the meaning of the looking futurate as opposed to the going futurate. When I started looking at various corpus genres, though things got a little to complex. Maybe you have some thought to add.

The opacity of etymology

The word disseminate is a familiar one. It appears hundreds of times on my hard drive and in well over 20 email messages I've read or written in the last five years. But until today, I had never seen the seeds in the word.

We often use a plant metaphor to talk about words. Morphology is a branch of linguistics just as plant morphology is a branch of biology. Both sciences talk of roots and stems, but in linguistics, seeds aren't part of the metaphor.

The root of disseminate though is semin or semen, from the Latin word meaning seed. Dissemination is the spreading of seeds. Semin is also the root of the word seminary, literally a seed plot, but now metaphorically used to mean a place to train priests. This is also where seminar comes from. I hadn't connected up these words either.

Disseminate appears in a passage that my level-8 class is studying. It's a passage that I've been over many times with other classes, and I have had to explain the word before. But never before have I made this connection.

How could this be?

On the flip side of this are cases of people seeing connections where there are none. Consider the regular flare ups about the word niggardly based on a mistaken perception that it's based on a racial slur. There are so many opportunities for false positives. The string semen, for instance shows up in a variety of words such as basement and horsemen, and nobody would ever think it referred to seeds.

How does this noticing thing work?