In "Nebuchadne__ar", I mentioned Vietnamese Nê-bu-cát-nết-sa 'Nebuchadnezzar' and asked,

was s intended to be [s] or [ʂ]?; how would 17th century missionaries have spelled it?

I still don't know the answer to the second question, but I can guess an answer to the first question. Google has 21,400 results for Nê-bu-cát-nết-xa and 101,000 results for Nê-bu-cát-nết-sa. The coexistence of spellings with x [s] and s [ʂ]  ~ [s] (depending on dialect) leads me to think that [s] might have been intended. The 1926 Vietnamese Bible translated by William Cadman's team has s in Đa-ni-ên 'Daniel'.

1.20.2:45: According to Wikipedia, Cadman and his wife "ran a printing shop in Hanoi from 1917 until 1942". Therefore it is likely that his team had the Hanoi pronunciation [s] of s in mind.

S in the 1926 Bible can correspond to Hebrew sh as well as ṣ, but the names as a whole look more like their Hellenized or Anglicized versions than the Hebrew originals. This is surprising since Cadman's wife knew Hebrew as well as Greek and English.

English Hebrew Greek 1926 Bible 2011 Bible New Bible
Joshua Yehoshua` Iesoũs Giô-s Giô-sua Giô-s
Samuel Shmu'el Samouḗl Sa-mu-ên
Ezra `Ezra Ésdras Ê-xơ-ra
Esther Ester Esthḗr Ê-xơ-tê
Isaiah Yeshayahu Ēsaḯas* Ê-sai I-sa-gia I-sa
Ezekiel Yeḥezqel Iezekiḗl Ê-xê-chi-ên Ê-xê-ki-ên Ê-xê-chi-ên
Hosea Hoshea Ōsēé Ô-sê Hô-sê-a Ô-sê
Zephaniah Tsfanya (< -) Sophonías Sô-phô-ni Xê-pha-ni-a Sô-phô-ni
Zechariah Zekharya Zakharías Xa-cha-ri Xê-ca-ri-a Xa-cha-ri

Still, it seems that the choice of s or x in the 1926 Bible was determined on the basis of Hebrew, even if the other segments were influenced by Greek and English:

Hebrew Greek English 1926 and New Bible 2011 Bible
z z, s z x x
s s s
z s
sh s(h) s

Both x and s were [s] in the Hanoi dialect known to the Cadmans. Their team might have chosen s for Hebrew sh on the basis of the southern pronunciation [ʂ] of s, but I don't know why they Vietnamized Hebrew emphatic [ṣ] as s [s] ~ [ʂ].

The above table predicts correctly that the 2011 Bible - unlike the older or newer (!) Bibles - has Nê-bu-cát-nê-xa with -x- for 'Nebuchadnezzar'.

It's also surprising that Hebrew q and Greek k and kh were Vietnamized as Vietnamese palatal ch [c] in the 1926 and New Bibles. Xa-cha-ri is reminiscent of English Zachary, but Ê-xê-chi-ên could not be predicted on the basis of English Ezekiel.

In theory, Hebrew and Greek z could have been Vietnamized as d, gi, or r which are all [z] in Hanoi. However, d and gi are [j] and r is [r] in the south. x does not match the voicing of z, but at least it is a sibilant [s] throughout Vietnam.

*Wikipedia has a long vowel ā before -s. How can the unwritten length of a be determined? NEBUCHADNE__AR

David Boxenhorn suggested that the Bible could be used for crosslinguistic infodensity comparisons with the exception of names even in the original languages. He mentioned נבוכדנצר <nbwkdnṣr> 'Nebuchadnezzar' as an example of a non-Hebrew name that doesn't tell us much about Hebrew itself. I had never seen 'Nebuchadnezzar' in other languages before and was surprised by the variation at Wikipedia: e.g.,


Akkadian <na.bi.uv.ku.du.ur.ri.u.u-ur> Nabū-kudurri-uur (transliteration of cuneiform from O'Conor 1885: 11); also <nabū.ku.dur.ur.u.u-ur> (transliteration of cuneiform from O'Conor 1885: 42; see cuneiform in line 1 on page 17)

Aramaic ܢܵܒܘܼ ܟܘܼܕܘܼܪܝܼ ܐܘܼܨܘܼܪ ‎ <nåbū kūdūrī ʔūūr> (Is that right or at least close? I've never transliterated Syriac before.)

Arabic نبوخذ نصر <nbwxð nr>



Greek Ναβουχοδονόσωρ <Naboukhodonósōr>

Latin Nabuchodonosor (obviously based on Greek)

Russian Навуходоносор <Navuxodonosor> (ditto)

Georgian ნაბუქოდონოსორ <nabuxodonosor> (ditto?)

Lithuanian Nabuchodonosaras

Swedish Nebukadnessar (> Finnish?; Danish and Norwegian have -s-)

Welsh Nebuchadnesar

Mandarin 尼布甲尼撒 Nibujianisa


Dutch (and hence Indonesian) Nebukadnezar

Polish Nabuchodonozor (unlike Czech and Slovak with -s-)

Slovenian Nebukadnezar (unlike other South Slavic languages with -s-; -z- an orthographic [but not phonetic] borrowing from German?)

Turkish Nebukadnezar

Hungarian Nabukodonozor

Mongolian Небухаднезар <Nebuxadnezar>

Japanese ネブカドネザル Nebukadonezaru


German Nebukadnez(z)ar (with z(z) = [ts])

Estonian Nebukanetsar (< German?)

Latvian Nebukanecars (< German?)

Esperanto Nebukadnecar (< German?)

Hungarian Nebukadneccár (< German?)

Vietnamese Nê-bu-cát-nết-sa (< ?; not from German; was s intended to be [s] or [ʂ]?; how would 17th century missionaries have spelled it?)


Korean 네부카드네자르 Nebukhadŭnejarŭ (Korean has no z)

Why does the English version have -zz- for <ṣ> instead of -s(s)- or -tz-? Are there other examples of this correspondence?

It would be interesting to see a family tree of translations of Biblical names or of the Bible itself.

1.19.3:19: The -zz- spelling in English is in the King James Bible. Does it go back any further in English?

1.19.4:39: The original Akkadian name has an -rr- corresponding to -n- in most other languages. This reminds me of liquid ~ nasal alternations in old transcriptions of Korean peninsular names such as

古陵 Middle Chinese *koʔ lɨŋ (> modern Korean Korŭng)

古寧 Middle Chinese *koʔ n (> modern Korean Koryŏng with an irregular -r-!)

and even

古冬欖 Middle Chinese *koʔ touŋ lamʔ (> modern Korean Kodongnam)

which correspond to modern 咸昌 Hamchhang, capital of Koryŏng Kaya. Were */r (and/or l?) n t/ neutralized as [r] in intervocalic position in the language of that region? (It is more difficult to reconcile the vowels between that consonant and *ŋ: *ɨ, *e, and *ou.) ʻEWALU-ATING INFODENSITY: REDEFINING IT IN ROUND THREE

As I wrote part one, I realized that David Boxenhorn and I had different definitions of 'infodensity', but I decided to stick with mine until I finished writing what was on my mind in part two. Now I'm going to try to paraphrase his definition so I can understand it.

I don't know anything about information theory, so I invented the term infodensity on my own, unaware that the term 'information density' already existed. David defined information density as

optimal compression / original text

with the difference between the two being redundancy (i.e., what can be removed without reducing information).

While I defined infodensity as

phonemes / syllables (in part 1) or moras (in part 2) for some 'unit of meaning' (e.g., the concept 'eight' as in the last two parts)

David defined it as

actual morphemes / potential morphemes

Although in theory, morphemes can be of any length, in practice, they tend to be short. For example, I cannot think of any trisyllabic unanalyzable case markers*. However, David sees an 'information barrier': a limit to creating morphemes of a certain length because that size range is already heavily populated.

In a 'monosyllabic' language like Vietnamese, the default length of a morpheme is one syllable. Would the information barrier for Vietnamese be the number of possible syllables? Homophony allows syllables to be used for multiple purposes, but even that must have some limit.**

*1.18.1:49: I did, however, recall seeing some long case marker in Hodson's 1864 An Elementary Grammar of Kannada or Canarese Language.

Page 12 lists eight cases for Kannada nouns (obviously influenced by the eight cases of Sanskrit). The ablative suffix is listed as tetrasyllabic (!) -deseyinda, but I suspect it is a compound of instrumental -inda with -dese- '?'.

The trisyllabic plural suffix -arugaḷ on page 13 is clearly a pleonastic (i.e., redundant) combination of the plural suffixes -ar and -gaḷ. I am reminded of the Korean double plural -들들 -tŭl-dŭl /tɯltɯl/ (Martin 1992: 833). The double plural suffixes in English children and Dutch kinderen (cf. German Kinder with a single suffix) also come to mind.

There must be selective pressures that rule out long morphemes for more frequent functions. I predict no language has tetrasyllabic suffixes (monomorphemic or otherwise) for the nominative or absolutive case.

**1.18.3:39: Here is a frequency table of monosyllabic a-morphemes from Nguyễn (1966).

I have excluded unanalyzable polysyllabic morphemes like A-Căn-Đình 'Argentina'.

0 indicates the absence of a morpheme: e.g., there is no morpheme *ã.

1 indicates a syllable with only one entry (i.e., no homophones): e.g., 'lass'.

Numbers higher than 1 indicate syllables associated with more than one morpheme: e.g., 2 in the cell for anh refers to anh 'older brother' and Anh 'England'.

x indicates an impossible syllable: e.g., no Vietnamese ending in a stop can have a tone other than sắc or nặng.

In theory no a-syllable should have a huyền, nặng, or ngã tone since those tones developed only after voiced consonants that never preceded a, but in reality there are a couple of exceptions: a polite particle and sound-symbolic ào 'rushing, roaring, gushing'.

tone ngang huyền sắc nặng hỏi ngã
a 2 1 2 1 1 0
ac x 1

ach 1
ai 1 0 2 1
am 1 1 0
an 1 1
ang 0 1
anh 2 1 1
ao 1 1 2 1
ap x 1 x
at 1
ay 0 1 0
total 8 2 15 1 4 0

The empty (0/x) slots are of three types:

1. Chance gaps: e.g., there is no reason why there can't be a morpheme ang.

2. Historical gaps: the lack of huyền, nặng, or ngã syllables due to the lack of initial voiced consonants conditioning those tones in the ancestors of a-syllables

3. Phonotactic gaps: the lack of ngang, huyền, hỏi, and ngã syllables with final stops due to a constraint against such tone-coda combinations

Borrowings could fill type 1 or even type 2 gaps: e.g., there may not have been any native morpheme án until Chinese 案 was borrowed, though there has never been any constraint against án.

I have never seen a type 3 gap filled in Vietnamese. Can monolingual Vietnamese without linguistics training pronounce syllables like ac, àc, ảc, or ạc? Are such syllables possible in word games?

Do Vietnamese science fiction writers invent names containing syllables that fill type 2 and 3 gaps, or do they avoid Vietnamese syllable structure altogether in alien names, opting for long, un-Vietnamese strings of letters instead?

If Vietnamese speakers avoid all gaps and create more homophonous morphemes, what is to stop them from having a dozen homophonous morphemes? The limits are hard to quantify because homophony is tolerable as long as the morpheme is in a disambiguating context, and such contexts are difficult to model.

Vietnamese borrowed both Late Middle Chinese 明 *mɨeŋ 'bright' and 冥 *mieŋ 'dark' as minh. It's not surprising that minh 'bright' has, uh, eclipsed its homophone minh 'dark'. Only one homophone with the same part of speech could predominate in similar contexts. ʻEWALU-ATING INFODENSITY: TAKE TWO

I wasn't happy with the simplistic formula for infodensity in my last post, and am even less happy now that I've been thinking about Sanskrit prosody. I was using syllables as a measurement of length, but it makes more sense to use moras instead.

If Classical Tibetan had presyllables and if presyllables counted as one mora, then brgyad might have had three moras, and its infodensity would be 2 = 6 phonemes / 3 moras (br, gya, d) - identical to Hawaiian ʻewalu which also has six phonemes and three moras.

Now let me add another factor which I'll call 'infopoints'. Infopoints are the number of possible phonemes in each slot of a syllable. Here are two hypothetical languages with very different numbers of infopoints:

Language A

Monosyllabic roots with the structure

First mora Second mora
29 onsets 9 nuclei 3 tones 6 codas including zero
k-, kh-, g-, ŋ-
c-, ch-, j, ɲ-
t, th-, d-, n-
p-, ph-, b-, m-
y-, r-, l-, w-
x-, ɕ-, s-, f-
ɣ-, ʑ-, z-, v-
i, ɨ, u
e, ə, o
ɛ, a, ɔ
Level (unmarked)
Rising (ˊ)
Falling (ˋ)
-ŋ, -n, -m, -y, -w, -Ø

Zero coda syllables are bimoraic with long vowels:
/CV/ = [CVː]

A root would have an infodensity of (29 x 9 x 3 x 6 infopoints) / 2 moras = 2349.

Language B

Disyllabic roots with monomoraic syllables:

14 onsets 5 nuclei
k-, ŋ-
c-, ɲ-
t-, n-
p-, m-
y-, l-, w-
h-, s-, ʔ-
i, u
e, o

A root would have an infodensity of ((14 x 5 infopoints) x 2) / 2 moras = .70.

Even nearly homophonic roots in the two languages have very different infodensity:

A ka [kaː ˧]: (29 x 9 x 3 x 6 infopoints) / 2 moras = 2349

B ka [ka] (14 x 5 infopoints) / 1 mora = 70

A k in language A has more infopoints (29) than its exact homophone in language B (14).

Similarly, an a in language A has more infopoints (9) than its near-homophone in language B (5).

Tonality counts for 3 infopoints in language A and codas count for 6 infopoints in language B, but neither has any value in language B.

In short, language A makes more distinctions in roughly the same amount of space. I am still not pleased with my newest formula. Nonetheless, I think infodensity - however quantified - can be a useful tool for understanding

language acquisition: Do/should learners initially focus on high-infopoint phonemes?

language history: What effects do changes in infodensity have on language structure? Do languages become more analytic as their infodensity increases? Conversely, do they become more synthetic as their infodensity decreases? The answers seem obvious, but Written Tibetan is a counterexample with high infodensity plus synthesis: e.g., vowel ablaut in monosyllabic roots.

the structure of writing systems: All writing systems I have seen have little or even no consonant 'fudging', whereas a lot of vowel 'fudging' is possible, particularly in abjads:

Script type Consonant 'fudging' Vowel 'fudging'
Alphabet, abugida Limited Limited
Abjad Lots (but never total due to some vowels being written with matres lectionis)
Syllabary Limited
Sinography Looser than the above three but still limited

Vowels have fewer infopoints than consonants, so they convey less information, are more expendable, and hence are not completely represented in abjads and are reduced or lost in speech. There is no vowel-based equivalent of an abjad: a script which has primary graphemes for low-infopoint vowels but not high-infopoint consonants. If such a script existed, some consonants would be written with matres lectionis: e.g., New York would be <iu io> with <u> and <i> for w and y. Even such a compromise would not be sufficient to make a vowel-centered script viable. ʻEWALU-ATING INFODENSITY: A FIRST APPROXIMATION

David Boxenhorn and I have discussed what I've called 'infodensity': the amount of information packed into a linguistic unit such as a word. How could this be compared across languages? Here's a crude method for measuring infodensity at the level of a single word. Given two words with the 'same'* meaning in two languages (i.e., words with the 'same' amount of information), divide the number of phonemes by the number of syllables: e.g.,

Written Tibetan brgyad 'eight': 6 phonemes / 1 syllable** = 6

Hawaiian ʻewalu*** 'eight': 6 phonemes / 3 syllables = 2

Written Tibetan brgyad has three times the infodensity of Hawaiian ʻewalu. Or does it?

Next: How Can Homophones Have Different Infodensities?

*1.16.5:10: Truly absolute translation equivalents are rare across languages.

**1.16.3:16: If brgyad was a sesquisyllable [br̩ˈgjat̚], then its infodensity would be 6 phonemes / 1.5 syllables = 4.

***1.16.3:02: Hawaiian also has disyllabic walu 'eight', but I chose the longer word because it has the same number of syllables as brgyad.

1.16.5:20: Hawaiian (ʻe)walu may be cognate to Kra forms like Pubiao /rɯ/ A2 'eight' (Ostapirat 2000: 245) which has an infodensity of 3 (= 3 phonemes [A2 is a tone class] / 1 syllable). It is not certain whether the resemblance between Kra and Austronesian numerals is due to contact or common ancestry. A MONOSYLLABIC CRITICAL MASS?

One might get the impression from Daniels' (1996: 585) quotation in my previous post that the 'syllabically organized' languages Sumerian, Old Chinese, and Classic Maya are all isolating* unlike highly inflected Sanskrit which I used as an example of an 'asyllabically organized' language.

However, David Boxenhorn reminded me that Sumerian is in fact agglutinative: e.g.,

The verbal root is almost always a monosyllable and, together with various affixes, forms a so-called verbal chain which is described as a sequence of about 15 slots, though the precise models differ.

Some of those affixes were subsyllabic: e.g.,

The pronominal prefixes are /-n-/ and /-b-/ for the 3rd person singular animate and inanimate respectively

Moreover, Classic Maya verbs also inflect, and its affixes are not necessarily syllabic: e.g., there is a passive infix -h-**.

And lastly, reconstructions of Old Chinese have been gradually becoming more sesquisyllabic: e.g., Karlgren's (1957: 266) 毒 *d'ôk 'poison' corresponds to my Old Chinese *Cʌ-duk with an unstressed minor syllable *Cʌ- needed to condition the partly lowered vowel in Middle Chinese:

*Cʌ-duk > *Cʌ-douk > *douk

Pulleyblank and Beckwith have even gone so far as to reconstruct disyllables corresponding to monosyllables in earlier Old Chinese reconstructions: e.g.,

馬 'horse'

Karlgren (1957: 29): *må

Pulleyblank (1999: 158): *mráka̯ʔ (arguably sequisyllabic with a final minor syllable, a structure I have not seen in any Asian language)

Beckwith (2009: 402): *marka

This site: *mraʔ < ?*mʌraʔ < ??*moraʔ (cf. Mongolian morin, Middle Korean mʌr 'horse')

How can all of that be reconciled with Daniels' model of the invention of writing? Without knowing Sumerian and Mayan, here is my guess: all three languages had a 'critical mass' of picturable monosyllabic words*** whose pictograms could be used as rebuses to represent a large number of other (near-)homophones****. I assume all languages have monosyllabic words, but not all languages have that critical mass. Sanskrit certainly lacks it.

*1.15.00:50: Daniels did not actually use the term 'isolating', though "most morphemes and in particular independent words comprise single syllables" sounds to me as if

- "most [...] independent words" (excluding derivatives and compounds) are bare monosyllabic roots capable of standing alone without obligatory inflectional affixes: e.g., CVroot rather than CVroot-CVsuffix

(otherwise "most [...] independent words" would be longer than one syllable, unless most affixes were not monosyllabic, which would contradict "most morphemes [...] comprise single syllables": e.g., CVroot-Csuffix)

- most affixes are monosyllables: CVaffix rather than Caffix, CVCVaffix, etc.

**1.15.1:00: The examples of -h- correspond to zero in Wikipedia's transliterations (their 'transcriptions') of Classic Maya passives: e.g.,

<TZUTZ.tza.ja> tzuhtzaj 'it was finished'

This -h- disappeared in reconstructions of later written Mayan. How did Mayanists know it existed? Was it reconstructed in 'finished' by analogy with other passive verbs whose -h- was written? Could Classic Maya have had two kinds of passives, one with -h- and one without?

***1.15.2:33: These words would be nouns and perhaps verbs with zero affixes (if any).

A Sumerian monosyllabic noun root would still be monosyllabic in the absolutive case which had zero marking. Suffixes for other cases might have been clitics (Zólyomi 1993: 15) or postpositions (Johnson 2004: 26).

Old Chinese and Classic Maya nouns did not inflect.

****1.15.2:45: It has long been established that aspiration, voicing, and final *-ʔ and *-h (< *-s) were disregarded in Old Chinese phonetic series: e.g., 古 *kaʔ could be phonetic in





Some aspirates and voiced initials might be secondary: e.g., *kh- < **sk-, *g- < **Nk-.

Presyllables were also sometimes disregarded: e.g., 居 *-ka has the phonetic 古 *kaʔ. WORDS > PICTOGRAMS > REBUSES > WRITING

Tonight I rediscovered this passage by Peter T. Daniels in The World's Writing Systems (1996: 585):

But why did writing emerge only for these three civilizations [Sumerian, Chinese, and Mayan]?


The answer seems to me (Daniels 1988) to lie in the syllable. In Sumerian, Chinese, and Mayan, most morphemes and in particular independent words comprise single syllables. A word is the shortest stretch of speech that can be uttered by someone without linguistic training (an Inuit-speaker who makes a mistake can't break off in the middle of a word and correct part of it, but after breaking off must begin to say it at the beginning). Thus in 'syllabically organized' languages like the three where writing was born, speakers can speak single syllables [without training]. So pictograms [initially] represent things with monosyllabic names. This in turn offers a means of representing those syllables that are not words for picturable objects - and that sort of representation is the defining characteristic of writing (section 1). Using a picture of some object to represent the sound of a homophonous word is known as rebus writing. While rebuses today are party games, at the dawn of history they were the foundation of writing.

Here's an example of such a rebus. In Old Chinese, 'to go' was *tə, written as a drawing of a foot: 之. This character was recycled to write the homophonous abstract words

*tə 'this'

*tə 'him, her, it, them' (third person object pronoun)

*tə (genitive particle)

which are difficult to visualize.

Languages with polysyllabic morphemes and independent words could be called 'asyllabically organized'.

Why would speakers of an asyllabically organized language be at a disadvantage when inventing a writing system without knowledge of any other writing system? Let's look at how a Sanskrit speaker might have tried to write the Sanskrit equivalents of Old Chinese *tə.

Unlike Old Chinese, Sanskrit is a highly inflected language. Monosyllabic words are rare because most words have affixes: e.g., one word for 'goes' is gacchati from Proto-Indo-European *gʷm̩-ske-ti, a root followed by two suffixes. It would be impossible to write gacchati with Sanskrit-based pictograms for monosyllabic words because Sanskrit had no words, picturable or otherwise, pronounced ga, gac, ccha, cha, chat, ti, or i.

One would encounter the same problem writing Sanskrit

idam 'this' (neuter nominative singular; no words i, id, da, dam)

tam 'him' (no picturable homophone tam)

tasya 'his/its' (masculine/neuter nominative singular; no words ta, tas, sya, ya)

Even picturable words pose difficulties. A Sanskrit speaker could draw a foot 之 and declare it to represent the word pāt 'foot', but how would he write the accusative singular pādam*, much less the rest of its paradigm, if there are no picturable Sanskrit words that are homophonous with its endings?

The only way out I can see is the acrophonic principle. Pādam could be written as


with two phonograms after 之 <pāt> 'foot'

目 <a>, a drawing of an eye (akṣi)

人 <m>, a drawing of a man (manuṣyaḥ)

but that is much more complicated than the recycling of monosyllabic homophones.

Next: Was Old Chinese really syllabically organized?

*1.14.0:19: Sanskrit pāt 'foot' is from Proto-Indo-European *pōd-s. The final -s was lost (though it was retained in Greek πούς <poús> and Latin s which lost the *-d-). The remaining *-d- devoiced to *-t in final position but remained intact before vowel-initial endings: e.g, d-am.

Tangut fonts by Mojikyo.org
Tangut radical and Khitan fonts by Andrew West
Jurchen font by Jason Glavy
All other content copyright © 2002-2013 Amritavision