I completely forgot to note the 10th anniversary of my blog on August 31st until a week ago, and I kept putting off this post to celebrate that until today.

I've been talking to SF author Robinson Mason about creating names in fictional languages. As a reader, I lose track of who's who (and lose interest in a story) when two or more factions have soundalike names and/or no discernible naming patterns. In a visual medium, you can rely on makeup, costumes, etc. to distinguish between different peoples. But in prose, it would be tedious to constantly remind the reader what those people look like. The reader sees character names many more times than character descriptions, so the names must not merely 'sound nice'; they must be meaningful in the sense that they immediately indicate the character's affiliation and gender. I consider the names of the following two peoples to be successful.

The Urdreh of Hadanum

Robinson's planet of Hadanum has several species of genetically engineered posthuman colonists. The Urdreh people's names stand out because they all end in -h. Males have -eh names and females have -ah names. (See this post for further details on Urdreh nouns.)


On Superman's homeworld of Krypton, men had names with the structure X-Y and women had names with the structure Z X-Y (often just Z): e.g., Superman is Kal-El, his father was Jor-El, and his mother was Lara (Lor-Van). Women had disyllabic first names (e.g., Lara and Superman's cousin Kara); all other names were monosyllabic.

Your names

You don't have to imitate those precedents. There are other ways to make sure your fictional peoples don't sound alike.

Before you come up with any names, think about the phonetic palettes for their languages. If you were costuming your peoples, you wouldn't let two or more groups wear the same colors at random. One group might wear red, the other blue, etc.

Similarly, when naming characters of multiple origins, you shouldn't let them have similar-sounding names. Remember, as a writer you are in full control of your universe. Use your power, and use it wisely. It's possible for characters to have similar or even identical names, but it's not desirable. You know who's who, but your reader doesn't, and your task is to help the reader construct a straightforward mental map of the cast.

Put yourself in your reader's shoes. Suppose you see that Qwerty and Uiop of Asdf are battling Ghjkl and Zxcv of Bnm. Who? Where? Huh? That's not the kind of reaction you want. Ideally, your reader should be able to figure out on his own who's who at a glance without thinking. Distinct languages facilitate identification. Here are ways to make your languages sound different from each other.

1. Word length

The House of Kamehameha (ten letters, five syllables) in Hawaii and the Lý Dynasty (two letters, one syllable) in Vietnam exemplify two extremes of word length. I don't have to tell you which of these names are Hawaiian and which ones are Vietnamese:

Công Uẩn





Thiên Hinh

Which word for 'house' belongs to which language?



2. Syllable complexity

Hawaiian has long words with many syllables, but each of those syllables have a simple structure:


(C stands for consonant and V stands for vowel. Parentheses surround optional elements.)

They can begin with no consonant or a maximum of one consonant. There must be at least one vowel which may be followed by another vowel. No final consonants are possible: e.g, Sảm could not be Hawaiian. Therefore Hawaiian syllables can have one to three letters each.

On the other hand, Vietnamese syllables have a more complex structure from a purely orthographic perspective*:



A Vietnamese syllable can begin and end with consonant letter sequences: e.g., ngoảnh 'to turn back one's head' begins and ends with such sequences (ng-, -nh). Vietnamese allows up to three vowel letters between consonants: e.g., -uyễ in the surname Nguyễn. (I count -y- as a vowel.) A Vietnamese syllable can either have three vowels or two final consonants but not both: e.g., -uyễnh is not possible.

It's possible to create long fictional words with complex syllables, but can your readers remember them? Kamehameha is easier to process - the duplication of meha helps - than Qwert-yups-dif-ghoj-kalz which also has five syllables. (Imagine how much less manageable Qwertyupsdifghojkalz would be without hyphens to indicate syllable boundaries!)

It's also possible to have huge consonant clusters: e.g., Georgian

გვფრცქვნი gvprckvni 'You peel us' (not the most useful word!)

მწვრთნელი mc'vrtneli 'trainer'

But can you remember those words after looking at them only once? Maybe if you're Georgian ...

Keep clusters short and only mildly unfamiliar at most for English speakers. Even a 'mere' four-consonant cluster like brgy- in Written Tibetan བརྒྱད་ brgyad 'eight' might be too much. However, English speakers seem to be able to handle Khmer. It helps that the -h- is decorative for them; they would pronounce a hypothetical Kmer the same way. Perhaps the limits are

- three consonants for familiar clusters found in English words: e.g., str- in strengths (but not the five-consonant sequence -ngths, even if it is familiar!)

- three consonants for unfamiliar clusters: e.g., Written Tibetan གཉའ་ཁྲི་བཙན་པོ་ gNya'-Khri bTsan-po

two of the three should be familiar: e.g., k-r- in khr- or Ts- in bTs-

Fill in these blanks for each language in your world:

Number of syllables in a typical word: _

Number of consonants at the beginning of a syllable: _

Number of vowels in a syllable: _

Number of consonants at the end of a syllable: _

Once you settle on these numbers, stick to them. Don't name three characters Ka, Si, and Tu and then name the fourth Njiop. All four syllables could exist in the same language - notice how Tibetan above has po with one consonant as well as brygad with four - but breaking your own patterns implies you don't know what you're doing, undermining your reader's confidence in the realism of your world.

3. Consonant inventory

The number of consonants in human languages varies: e.g., Central Rotokas has six (b, d, g, k, p, t) and West !Xoon (DoBeS) may have 164 (no, I'm not going to list them). An alien language could have zero, but your reader might have a hard time with word after word containing only a, e, i, o, and/or u: .e.g., Aeiou from Uoiea, etc. If one takes English consonants as a starting point, one can subtract or add groups of consonants: e.g.,

- drop all voiced obstruents: b, d, g, j, z (none of which are in Hawaiian)

- drop all nasals: m, n, ng (absent from Rotokas)

- add h

- to obstruents: e.g., bh, dh, gh, jh as in dharma

- to nasals: e.g., hm as in Hmong

- allow prenasalized consonants at the beginnings of syllables: e.g., Nkrumah

- allow double consonants at the beginnings of syllables: e.g., Korean 딸 ttal 'daughter' (which contrasts with 달 tal 'moon')

It's also possible to subtract or add individual consonants, but it's hard to do so in a realistic way. Consonant systems are nonrandom. For instance, standard Arabic has no p or g (two common gaps in languages), but I can't think of a language that lacks b and k. So you can't put the twenty-six letters on a dart board, throw a few darts and declare the struck letters to be absent from your language. Well, you could, but the result wouldn't be plausible.

4. Vowel inventory

English has five vowel letters - six including y.

One could subtract vowels. Proto-Indo-European is often reconstructed with only two vowels, e and o, though I am skeptical about that. The other three vowels (a, i, u) are the only vowels possible in Classical Arabic. (Hence the Classical Arabic equivalents of Mohammed and Koran are Muhammad and Qur'an without e or o.) An Arabic-like a-i-u system may be the best bare minimum since two-vowel systems are controversial.

One could add vowels

- by doubling them: aa, ee, ii, oo, uu

but doubling them entails having single vowels a, e, i, o, u; no point in doubling a vowel if it has no single counterpart

- by using w as well as y as a vowel symbol (as in Welsh and Hmong)

- by doing both: i.e., having ww and yy as well as w and y

- by creating graphic diphthongs: e.g., e-sequences like ae, oe, ue

- by allowing syllabic consonants like l, m, n, ng, r: e.g., Czech strč prst skrz krk 'stick your finger through your throat'

5. Augmented alphabet

I recommend sticking with the basic twenty-six letters which are easier for you to type and for the reader to remember. But for the sake of completeness, you could

- use lower and upper case letters to represent different sounds: e.g, in the Harvard-Kyoto romanization of Sanskrit, lower case a and upper case A are two different vowels

but this means you can't capitalize names anymore or you'll be using capitals for two different purposes

- use punctuation marks as letters: e.g., the ! stands for a click in !Xoon

- use numbers as letters: e.g., 3 for ع in the Arabic chat alphabet

- add diacritics: e.g., ă, â, à, á, ạ, ả, ã as in Vietnamese, etc.

- use letters outside the twenty-six like ŋ, ɔ, θ, etc.

The last two options are particularly troublesome because there is no guarantee that ebook readers or even computers will support all the characters that you can type into your word processor.

And even though apostrophes and the like will appear without any problems, what I call n'a'm'e's have become a cliche (e.g., Na'vi). It's not worth the trouble to hit the apostrophe key hundreds or even thousands of times in your stories when your readers will ignore them. (How many times has Na'vi been misspelled as Navi?)

Be sure you create spellings you can live with. Q́w̋ẹ̆řţŷ may look cool now, but do you really want to copy and paste it or search and replace it throughout your book even though people will just perceive it as Qwerty? If you really want a language full of diacritics, fine, but you can reserve that for an appendix with precise spellings for the hardcore fans and use a 'lay spelling' in your stories, just as Việt Cộng is normally written as Viet Cong in English. (The circumflexes and subscript dots are crucial in Vietnamese but are decorative at best in English.)

In real life, it's possible to have two names that differ only in terms of diacritics - e.g., the Chinese Jìn Dynasty (265–420 AD) and the Jurchen Jīn Dynasty (1115–1234), but your world shouldn't have such names. Even pairs of names that differ only by one letter might be dangerous: e.g., the Qin Dynasty (221-206 BC) and the Qing Dynasty (1644-1911).

Working backwards

What if you already have names and words that you want to use? You could do what I did with Robinson's world of Hadanum: collect the existing material, look for patterns, and ampllify form. For instance, say your story has two people, the Exa and the Mpl, and you've named eight of them.

Exa names: Hypo, Theti, Cal, Jnris

- mostly disyllabic

Mpl names: Nda, Ti, Vr, Xuzm

- mostly monosyllabic

- often containing prenasalized consonants (Mp-, Nd-)

- often containing syllabic consonants (l, r, m)

Here's how they could be improved:

- Change Cal to something disyllabic to fit the pattern of the others

- Jnris looks more Mpl than Exa due to the initial consonant sequence; make it Genris

- Ti is the simple odd man out among the Mpl names; complicate it a bit by adding an N-: Nti

- Xuzm is too long for an Mpl name; shorten it to Xm or Zm

Now suppose there are two new characters, Neos and Sn. Guess who's an Exa and who's an Mpl.

Obviously the above example (Exa and Mpl!) was meant in jest. I hope your serious names are better. Have fun!

*In other words, ignoring the fact that two-letter sequences in Vietnamese can represent single sounds: e.g., ng represents a single consonant [ŋ]. Similarly, -nh represents a single consonant [ñ] (as in Spanish), not n followed by an h. SINO-TANGUT RETROFLEX VOWELS

At first one might be surprised that there are retroflex vowels in Chinese loanwords in Tangut, since such vowels were conditioned by an *r that is absent from most Middle Chinese reconstructions (and do not correspond to the *r that is in Pulleyblank's 1984/1991 Late Middle Chinese reconstruction). One might conclude that if these are truly loanwords rather than native lookalikes, their retroflexion might have originated from Tangut *r-prefixes added to Chinese roots. However, that need not be the case. There are at least three potential Chinese sources for Sino-Tangut retroflex vowels. The first is very likely and the other two are much more shaky.

1. Northwestern Late Middle Chinese *-r

Tibetan transcriptions indicate that northwestern Late Middle Chinese (NWLMC) had an *-r corresponding to what is still *-t in the south today. This NWLMC *-r corresponds to vowel retroflexion in the following words identified as loans from Chinese by Gong (1981: 776-777):

1daʳ < NWLMC 達 *dar 'to reach' (Tibetan transcriptions: dar, Hdar)

1khaʳ < NWLMC 渴 *khar 'dry'

2saʳ < NWLMC 撒 *sar 'to spread, break up' (see section 3 below for an alternate etymology)

2dziəʳ < NWLMC 疾 *dzir 'rapid' (Tibetan transcription tshir dates after NWLMC *dz- devoiced to *tsh-)

1ʐɨəʳ < NWLMC 實 *ʐir 'solid' (Tibetan transcriptions: shir, shɨr, zhir)

According to Gong (2002: 374-377), the Chinese spoken in the Tangut Empire no longer had final *-r. Therefore this class of loans must originate from the preimperial period. Imperial period loans have no trace of *-r: e.g.,

1vɨə < Tangut period NW Chinese *fə < *vur < *but 'Buddha'

2. Northwestern Late Middle Chinese *-ɣ

Li Fanwen (2008: 802) derived

1ɣwəʳ 'crane'

from Chinese 鶴 which was read as *ɣak in the preimperial period. This final gamma is from an earlier *-k retained in the south today: e.g., Cantonese hok. Final gamma was lost in Tangut period northwestern Chinese.

Are there any other examples of this correspondence, or are the Tangut and Chinese words lookalikes? There is no Chinese basis for -w- (which may be from a labial-initial presyllable) or -ə- instead of -a-. If I did not know about Li's etymology, I would derive the word from *Pʌ-Kər. The *K may have been *x if this word is cognate to

1xwəʳ 'crane'

with a voiceless initial. *-x- lenited (i.e., voiced) to -ɣ- between vowels before the presyllable was lost:

*Pʌ-xər > *Pʌ-ɣər > *Pɣər > *ɣwər > 1ɣwəʳ

1xwəʳ may be from a variant which lost its presyllabic vowel:

*Pʌ-xər > *Pxər > *xwər > 1xwəʳ

3. Northwestern Late Middle Chinese *-n

Li Fanwen (2008: 485) derived

2saʳ 'to spread, break up' (mentioned in my last post)

from Chinese 散 which was read as *sàn in the preimperial period rather than from 撒 *sar (see section 1 above). Most -VN rhymes became nasal vowels in Tangut period northwestern Chinese. If Li's etymology is correct, it is odd that -n might have been borrowed as pre-Tangut *-r, though there may be parallels in the Japanese usage of Middle Chinese *-n graphs to write Old Japanese CVrV sequences*:

雲箇 MC *wun kah for OJ Uruka

雲潤 MC *wun for OJ Urumi

駿河 MC *tswinh ɣa for OJ Suruŋga

and soroban 'abacus' from a Song Chinese 算盤 *sonban long after *-r was thought to have disappeared from Old Chinese. (The modern -r of Mandarin is not related.) Could *-r have survived in colloquial Chinese pronunciation?

播磨 MC *pah ma (without -n!) for OJ Parima

does not make sense unless one knows that MC *pah is from OC *pals ~ *pars. Whoever devised that spelling for OJ Pari must have had a post-OC *par-like reading of 播 in mind.

駿 is from OC *tsurs, so perhaps whoever chose it for OJ suru had a post-OC *tsur-like reading in mind. However, I know of no reason to reconstruct 雲 with *-r in OC**, so its use for OJ uru is a mystery to me. Perhaps some of these uses reflect an -r-/-n-alternation in Paekche and have nothing to do with Chinese.

9.22.00:27: Li Fanwen (2008: 389) also derived

1sã 'scattered'

from Chinese 散 *sàn. Was 散 borrowed twice, once as 1sã and again as 2saʳ (with a different tone as well as a different vowel)? The most straightforward solution is to assume the Tangut vowel qualities reflect different Chinese rhymes:

1sã < 散 *sàn

2saʳ < *sar (following Gong 1981 rather than Li Fanwen 2008)

*Thanks to John Bentley for the examples.

**Baxter and Sagart (2011: 70) reconstruct 雲 as *ɢʷən {*[ɢ]ʷə[n]} with *-n but reconstruct its phonetic 云 as OC  *ɢʷər {*[ɢ]ʷə[r]} with *-r. Perhaps the OJ spellings support an *-r-reconstruction for 雲 as well as 云.

I reconstruct OC *-r as the source of MC *-n in phonetic series that have the MC alternations

*-n ~ *-Ø (disregarding final *-ʔ/-h): e.g.,

番 MC *phuan < OC *Cɯ-phar

播 MC *pah < *pals ~ *pars

*-n ~ *-j: e.g.,

軍 MC *kun < OC *kʷər

揮 MC *xuj < OC *xʷəl ~ *xʷər

Neither alternation is in the 云 series.

The word 云 'to say' which I would reconstruct as OC *wən is probably somehow related to OC 曰 *Cɯ-wat 'to say' and perhaps 話 'to speak' which could be reconstructed as OC *wrat-s < *r-wat-s < *T-wat-s.

9.22.10:18: Another cognate is 謂 OC *wət-s 'to say'. All of these words can be derived from a root √w-t:

Sinograph Prefix First root consonant Root vowel Second root consonant Suffix
*Cɯ- w- a -t
*T- -s


The significance of the root vowel alternations and the affixes is unclear. Sagart (1999: 113) regarded the *-r- of *wrat-s as durative. THE ORIGINS OF RETROFLEX VOWELS IN TANGUT AND KALASHA

Retroflex vowels have been in Tangut reconstructions at least since Nishida (1964). (I would like to know if such vowels appeared in earlier publications.) They reflect the fact that certain Tangut rhymes (most of Sofronov 1968's 'second small cycle') were transcribed in Tibetan with

- preinitial r-: e.g., rkyi(H) for

1kiʳ 'strong'

- final -r: e.g., zar for

2saʳ 'to spread, break up'

(9.21.2:10: The Tibetan transcription implies that z had merged with s in the writer's Tibetan dialect.)

Initial r- is almost exclusively before those rhymes. I reconstructed three pre-Tangut sources of retroflex vowels:

- preinitial *r-: e.g.,

*r-ləə > 1lɨəəʳ 'four'

- initial *r-: e.g.,

*Cɯ-re > 1rieʳ 'horse'

- final *-r: e.g.,

*kaar > 1kaaʳ 'scale' (left); 'to measure' (right); cf. gDong-brgyad rGyalrong kɤ-skɤr

The Tibetan transcriptions may reflect Tangut dialect(s) with partial retention of these sources (in careful pronunciation?).

Oddly medial *-r- did not condition vowel retroflexion; instead it conditioned vowel lowering (as in Chinese) or fused with dentals into retroflex affricates: e.g.,

*brə > *1bʌ 'willow' (cf. gDong-brgyad rGyalrong qa-ʑmbri)

*k-truk > 1hɨiw 'six' (cf. Written Tibetan drug)

Kalasha is a living language with retroflex vowels. I think I first mentioned it on this blog in 2008. Back then, I wrote:

Note that Kalasha nasal retroflexion originates from an earlier retroflex nasal ɳ, whereas Tangut nasal retroflexion presumably originated from nasal and rhotic segments in the same syllable: e.g.,

kõʳ R97 2.82 'tooth'

may be from pre-Tangut *rkoNH.

I would still reconstruct 'tooth' the same way*.

Tonight I discovered that an earlier retroflex stop ɖ was the source of Kalasha oral retroflex vowels in Arsenault & Kochetov (2009: 18):

ʂeʳa 'blind' < *śrēɖa- 'squinting' (unattested in Skt but reconstructible)

kuʳaʳ˞k 'little child' < *kuɖa- 'boy, son' (unattested in Skt but reconstructible)

pẽʳ 'palm of hand' < Skt ɳí- 'hand'

bõʳ 'arrowhead, bullet' < Skt ɳá- 'arrow'


*Vɖ > Vʳ

is not unlike

*Vr > Vʳ

in Tangut. Did Kalasha medial *-r- also condition retroflex vowels, or were and the only sources? What happened to medial *-ʈ-, *-ʈh-,  and *-ɖh- in Kalasha?

*9.21.1:02: Is Tangut 2kõʳ < *rkoNH 'tooth' cognate to 雅 江却域 Yajiang Queyu ku 'id.'? The STEDT database lists both the Tangut and Queyu words as descendants of Proto-Tibeto-Burman *s-ŋa 'tooth'. I think this etymology is unlikely unless there are other cases of Tangut and Queyu k- corresponding to ŋ in other languages. Could the root of the Tangut and Queyu words be Proto-Tibeto-Burman *gam 'jaw' (though the semantic fit is loose and the initial voicing doesn't match)? (I am not convinced Proto-Tibeto-Burman exists; if it doesn't, the 'PTB' roots may actually be Proto-Sino-Tibetan.) 'ITALIAN WONTON'

Cappuccino, an Italian restaurant in Ho Chi Minh City, calls tortellini hoành thánh Ý 'Italian wonton'. is 'Italian' (from Chinese 意 'id.') but why is 'wonton' hoành thánh? None of the Chinese characters for words for 'wonton' are pronounced hoành thánh in Vietnamese: e.g., 雲吞 is vân thôn and 餛飩 is hồn đồn.

My guess is that hoành thánh is an ear borrowing (no characters involved) from some Chinese language with a word like [hwan than] for 'wonton' into a southern Vietnamese dialect with [n] for -nh ([ɲ] in the north). Cantonese 雲吞 wan than is close to [hwan than], but it is unlikely that 雲 wan ever had an h-. Sino-Vietnamese 雲 vân without h- was probably borrowed from early Cantonese or a dialect closely related to it. Could the h- of hoành be due to confusion with an h-word for 'wonton' in some other Chinese language? The rising sắc tone of thánh does not correspond to the level or falling tone of Cantonese than.

Vằn thắn 'won ton' is clearly an ear borrowing from Cantonese 雲吞 wan than. It has the same tones as hoành thánh, and the final -n of both syllables indicates that its spelling is of northern origin. If a southerner had coined the spelling vằn thắn, it would be pronounced [vaŋ thaŋ] with two [ŋ] absent from Cantonese wan than.

9.19.2:27: Other Vietnamese restaurants use the term hoành thánh Ý to refer to ravioli. This practice probably originated in Chinese:

ravioli and tortellini are sometimes collectively referred to as "Italian jiaozi" (意大利饺子) or "Italian wonton" (意大利馄饨). GREAT *KO-VIỆT?

It took almost two decades for me to realize that the mysterious second syllable of 大瞿越 Đại Cồ Việt 'Great ... Viet' (968-980 AD) might represent a presyllable *ko- rather than the second of three words. There are at least two problems with this interpretation.

First, I would not expect an *o in presyllables which typically have a limited range of vowels (e.g., a, i, u in Pacoh [Watson 1964: 144] and Jeh [Cohen 1966] or just à, "a total neutralization of all points of vowel articulation"*, in Halăng [Cooper 1966: 98]). None of those Mon-Khmer languages are Vietic. Are there any Vietic languages with o in presyllables? Kri has a, i, u like Pacoh (Enfield & Diffloth 2009: 35).

Perhaps 瞿 was chosen because it can mean 'to fear' (like 懼), so 大瞿越 implied 'Great Fearsome Viet'. (I can't remember where I first saw this idea - was it in DeFrancis' Colonialism and Language Policy in Viet Nam or an article by Nguyễn Đình Hoà - possibly in Language in Vietnamese Society?). If 瞿 represented a presyllable like *kə-, 瞿 might have been more meaningful than better phonetic matches like 基 [kəː] 'basic'.

Second, I do not know what the function of the presyllable *kV- was. It is impossible to guess solely on the basis of one example without a gloss.

My Old Chinese reconstruction requires a presyllable in 越 *Cɯ-wat to condition the breaking of *a:

*Cɯ-wat > *Cɯ-wɨat > *wɨat > *wiet (> borrowed into Vietnamese > Việt)

Could 瞿 be an attempt to write that Chinese presyllable? I doubt it, unless the presyllable survived in the colloquial southern Late Middle Chinese word corresponding to literary *wiet as late as the 10th century. Furthermore, the vowels do not match: is a cover vowel for an unknown high vowel, whereas the *o implied by 瞿 Cồ (if taken at face value) is nonhigh. Finally, it is not clear what the initial of the Chinese presyllable was. Other members of the phonetic series of 越 may have had a variety of presyllabic initials. (Underlining indicates 'emphasis': i.e., pharyngealization.)

Sinographs Early Old Chinese Middle Old Chinese Late Old Chinese
戉鉞越 (as in 越南 'Vietnam') *Cɯ-wat *Cɯ-wɨat *wɨat
越 'plait' *wat *wat *ɣwat
*sɯ-wat *sɯ-wɨat > *swɨat *xwɨat
*s(ʌ-)wat *hwat *xwat
*sɯ-wats *sɯ-wɨats > *swɨas *xwɨaɕ
*sɯ-wats *sɯ-wɨats > *sɯ-wɨas *swɨaɕ
*ʔɯ-wats *ʔɯ-wɨats > *ʔ(ɯ-)wɨas wɨaɕ
*kɯ-wats *kɯ-wɨats > *kɯ-wɨas *kwɨaɕ

越 may or may not have had a *k-presyllable like 劌.

*However, "[w]hen reduplication is present, any short vowel (i, e, a, u, o) may occur" in Halăng presyllables. TRA-SỢI-NG A THREAD

When I first encountered the Vietnamese word sợi [ʂəːj] 'thread' almost twenty years ago, I immediately thought of Middle Chinese (MC) 絲 *sɨ 'id.' MC *-ɨ corresponds to -ợiːj] in

Viet thời : MC 時 *dʑɨ 'time'

However, Vietnamese s [ʂ] corresponds to MC retroflex *ʂ, not alveolar *s, in loans from Chinese. I know of no MC -initial word for 'thread'. And s- is from *Cr-clusters in native words. See my post on 's-implification' for details. Moreover, the nặng tone indicates an earlier voiced initial and a glottal stop coda absent from MC *sɨ. So sợi must be from an earlier *C [+voice] rəːjʔ.

There are four likely candidates for the first consonant: *b, *d, *ɟ, and *g. The ideal nom spelling of sợi for a historical phonologist would have a double-phonetic structure: one phonetic for the initial consonant *C- combined with an l-phonetic for *-rợi*. However, the only spellings I found at nomna.org are semantophonetic compounds with s-phonetics:

糸 'thread' + 士 sĩ

糸 'thread' + 仕 sĩ

If I did not have any further evidence, I might guess that the first consonant was *ɟ-, since *ɟr- is closer to s- [ʂ] than *br, *dr, or *gr. However, the comparative data in the SEAlang Mon-Khmer Comparative Dictionary points toward *g-:

Cuoi Cham Tho kʰrəːj⁴

Hoa Binh Muong kʰəːj⁴

Bi and Son La Muong kʰɨəj⁴⁶

I assume the even tone numbers (4, 46) point to an earlier *voiced initial. 46 may refer to a tone category that is a merger of 4 and 6 (cf. a similar merger in Chinese between yangshang = 4 and yangqu = 6).

I was surprised that Ferlus reconstructed Proto-Vietic *k-rəːjʔ  with a voiceless initial. Wouldn't *k-rəːjʔ have developed into Vietnamese sới with a sắc tone (= tone 3; *voiceless-initial syllables developed odd-numbered tones)? Ferlus' PV *k-roːŋ 'river' with the same initial cluster became Viet sông with a ngang tone (= tone 1) as expected. Perhaps the different tones reflect different patterns of presyllabic collapse:

Gloss Proto-Vietic Modern languages
Stage 1 Stage 2 Stage 3 Stage 4
thread *kə-rəːjʔ *k-rəːjʔ
̣(presyllabic vowel loss)
(preinitial assimilation to voiced *r)
*grəːjʔ r-retention, aspiration**: Tho kʰrəːj⁴
-r- > -ʰ-: Muong kʰəːj⁴, kʰɨəj⁴⁶
fusion: Vietnamese sợi
river *kə-roːŋ *kroːŋ̣
(presyllabic vowel loss)
r-retention, aspiration**: Tho rɔŋ¹
-r- > -ʰ-: Muong kʰoːŋ¹
fusion: Vietnamese sông

The nom spellings of sợi with s-phonetics must date after fusion (i.e., 's-implification') in Vietnamese. Are there any pre-fusion spellings of that word?

*The glottal stop was probably gone by the time the nom script was developed in the centuries after the end of the first Chinese domination. Vietnamese probably had a true tonal system by the end of colonial rule.

**If Tho has no kr- contrasting with kʰr-, then the aspiration is nonphonemic: /kr/ = [kʰr]. Could *-r- have devoiced to assimilate with the preceding *k-: *kr- > [kr̥], perceived as [kʰr]? *S-IMPLIFICATION IN VIETNAMESE

The Vietnamese initial s- [ʂ] is from earlier *Cr-clusters, not a simple *s-. These clusters are implied by double-phonetic nom spellings such as

巨 atop 郎 (cự + lang) for *krang, now sang 'noble'

In some cases, the initial consonant is unclear in single-phonetic nom spellings: e.g.,

瀧 (< 氵'water' + 龍 long)  for *Crong, now sông 'river'

I know of no *k-spelling for 'river'; the *k- has to be inferred from other Vietic forms like

Cuoi Cham Tho rɔŋ (but a Vietnamese-like ʂɔːŋ in Lang Lo Tho!)

Pong lɔːŋ

Muong oːŋ

from the SEAlang Mon-Khmer Comparative Dictionary. Ferlus reconstructed their ancestor as Proto-Vietic *k-roːŋ ~ *k-rɔːŋ.

Not all modern Vietnamese s-words have l-phonetic spellings in nom: e.g., sẽ 'will' and its homophones are only written with 仕 sĩ (which may be accompanied by 口 'mouth' or 后 'later'). Does this mean that sẽ [ʂɛ] goes back to *srɛ̃* or was the word an innovation postdating the 's-implification' of *Cr-clusters in Vietnamese? In any case, there is no nom evidence supporting the velar initial of Thompson's (1976: 116) reconstruction of Proto-Viet-Muong *ɛC.

Was Sino-Vietnamese s- [ʂ] originally *sr-, an approximation of Chinese retroflex *ʂ-: e.g., was  仕 sĩ [ʂi] once *srĩ?

If so, one might similarly expect Sino-Vietnamese tr- to be from earlier *tr-, an approximation of Chinese retroflex *ʈ-, but native Vietnamese tr- is from *Cl-** as indicated by nom and comparative evidence: e.g.,

巴 atop 賴 (ba + lại) for *plái, later Middle Vietnamese blái and now trái 'fruit'

cf. Muong plaːj 'id.'

Was Sino-Vietnamese tr- once *tl-? But why would Chinese *ʂ- be borrowed as an *r-cluster while Chinese *ʈ- was borrowed as an *l-cluster? Moreover, de Rhodes' Middle Vietnamese dictionary has tr-, not tl-, for modern Sino-Vietnamese tr-. This implies that modern Sino-Vietnamese tr- and modern native tr- have different origins.

At the moment I think the Chinese retroflexes might have been borrowed as retroflexes that later merged with native *r- and *l-clusters:

Proto-Vietic before Chinese borrowings Old Vietnamese with Chinese borrowings Middle Vietnamese Modern Vietnamese
(no retroflexes) *ʂ- s- [ʂ]
*Cr- *Cr-
(no retroflexes) *ʈ- tr- [ʈ]
*Cl- *Cl- bl-, tl-, tr- tr- ~ gi-

The development of Vietnamese retroflexes under Chinese influence is superficially reminscent of the development of Sanskrit retroflexes under Dravidian influence, but in fact the 's-implification' and the *Cl- to tr-shift both occurred after the end of Chinese rule, whereas Indic and Dravidian speakers have lived side by side for millennia.

*The tone of *srɛ̃ indicates an even earlier voiced initial: e.g.,*zrɛ̃ with a *z- that may not have ever existed in Vietnamese (and was probably not in native words) or perhaps *ɟrɛ̃ with a palatal stop *ɟ-. The Middle Chinese initial of 仕 was a retroflex affricate *dʐ-.

9.18.7:27: There is no nom evidence for reconstructing a cluster with a grave initial (*br-, *gr-) in the ancestor of sẽ.

**Excluding *ml- which became l- or nh-, not tr-: e.g.,

理 Old Chinese *mʌ-rəʔ > *mrɛʔ > Middle Vietnamese mlẽ > Modern Vietnamese lẽ ~ nhẽ 'reason'

A variant of Old Chinese *mʌ-rəʔ without the presyllable became Middle Chinese *liʔ, the source of Sino-Vietnamese lý 'reason'. *SII[ʔ]-COND HAND STORE

Old Vietnamese voiceless *t- regularly became a Middle and Modern Vietnamese voiced implosive đ- [ɗ]. (This was part of a larger set of consonant shifts.) Sino-Vietnamese morphemes were borrowed during the Old Vietnamese period and were therefore subject to this change: e.g.,

點 Late Middle Chinese *tiém > Old Vietnamese *tiểm* > Modern Vietnamese điểm 'dot'

I would expect

店 Late Middle Chinese *tièm 'store'

to correspond to Modern Vietnamese điếm, but in fact 'store' is also tiệm which should go back to an Old Vietnamese *ziệm as well as điếm. I don't think *ziệm ever existed. I suspect tiệm was borrowed from a southern Chinese *tiem (cf. modern Cantonese tim) after the Vietnamese *t-to-đ-shift was complete sometime between c. 900 (when most Sino-Vietnamese morphemes were borrowed) and 1600 (i.e., Middle Vietnamese). The irregular tone (nặng instead of sắc) of tiệm is an attempt to imitate the tone of that later Chinese word which was phonetically (but not phonemically) different** from the tone of the source of the earlier borrowing điếm; it is not an indicator of an earlier voiced initial which usually conditions the nặng tone.

An unexpected t- instead of đ- is also in the native Vietnamese word tay 'hand' which should come from an earlier *say. I would have expected *đay < *tay on the basis of other Mon-Khmer forms with t-: e.g., Mon တဲ <tai> toa. I've never seen  any Vietic forms with initials from *t-, so I'm not surprised that Ferlus reconstructed Proto-Vietic *siː. However, I was surprised that Shorto (2006 #244, 66) reconstructed *s- ~ *t- variation in 'hand, arm' all the way back to Proto-Austroasiatic:

*sii[ʔ] ~ *t{1}iiʔ

(What's the *{1} for?)

The only non-Vietic forms with s- that he lists are Munda: Sora sʔiː-n and Pareng sʔiː. Did an *s-variant of 'hand' really only survive on the western and eastern edges of the Austroasiatic-speaking world: i.e., India and Vietnam?

*For ease of comparison, I use Middle and Modern Vietnamese tone diacritics for Old Vietnamese even though the latter only had three phonemic tones.

**店 belongs to the 'departing tone' category through Chinese languages, but the phonetic realization of this tone varied and continues to vary across space and time. There is no guarantee that the source of the older borrowing điếm is ancestral to the source of the later borrowing tiệm.

ADDENDUM: Pan Wuyun reconstructed the Old Chinese readings of 店 as

*k-liims (eastling.org)

*k-leems (Thesaurus Linguae Sericae)

Leaving aside the question of whether the word even existed in Old Chinese***, what is the reasoning behind *k-l- instead of a simple *t- like Zhengzhang's *tiims or my own *tems? I suppose the *l- is meant to accomodate 阽 Middle Chinese *jiem which probably had a liquid initial in Old Chinese:

Pan *[g]lem (eastling), *k-lem (TLS)

Zhengzhang, Baxter and Sagart *lem

My *Cɯ-lem (with a high vowel presyllable to condition the breaking of *e to *ie)

But why *k- as opposed to a generic *C-? And should a mostly *t-phonetic series (GSR 618 占) be interpreted as a liquid series even though *t- is not a typical liquid series initial? If I had to reconstruct a liquid throughout this series, its archetypal initial would be *tɯ-l-

Middle Chinese Old Chinese Example sinographs
*t- *tl- (early fusion of *tɯ-l-? partly from *tʌ-l-?)
(but *t- for 店?)
*ʈ- *tɯ-r- or *Tɯ-tl- > *rtl-
*ʈh- *tɯ-hr- or *Tɯ-t-hl- > *rthl- 佔覘
*tɕ- *Cɯ-tl- or *tɯ-l-
*ɕ- *sɯ-tl- or *s-tɯ-l- > *stl- or *sɯ-l- > *hl-
*j- *tɯ-l-
*n- *N-tl- (but *n- for 鮎?) 拈(鮎?)
*ɳ- *nɯ-r- or *Tɯ-n-? (粘黏?)

Sinographs in parentheses are not attested in Early or Middle Old Chinese; their phonetics were chosen long after the initial of 占 had simplified to *t-, so there is no need to reconstruct *tɯ-l-type initials for them.

The tone of Mandarin 拈 niān implies an earlier *hn- which might be from *s-N-tl-.

Jiyun (1037) listed 溓 as a variant of 粘/黏. 溓 was normally read with *l- in the 11th century, so this might be evidence of *ɳ- ~ *l-confusion. (I wonder if such confusion also underlies the choice of an *l-phonetic for *ɳ-initial 娘 which I can't find in any source predating the Yupian dictionary (6th century AD).

兼, the phonetic of 溓, can represent velar-initial as well as *l-initial syllables, but the use of 溓 for 粘/黏 is not evidence for Pan's velar-initial reconstructions of 占 because there is no indication that *kC-type clusters still existed in any Chinese language as late as the 11th century.

***Karlgren (1957: 164-165) did not include 店 as a pre-Han character. However, this doesn't necessarily mean the word didn't exist before its inclusion in the Yupian dictionary (6th century AD); perhaps the word simply wasn't written until then. Innumerable words in modern Chinese languages remain 'characterless' even today.

Given the late attestation of 店, there is no way anyone who created the character would have known that 占-graphs originally had *tɯ-l- as their archetypal initial (see above), so it is likely that Middle Chinese 店 *temh went back to an unwritten Old Chinese *tems with a simple *t- rather than *tlems with a *tl-cluster. OLD SPELLINGS FOR A NEW CHANGE

(On Friday I happened to write this post containing the phrase năm mới 'new year', and I thought it would be appropriate to complete and post it at the beginning of Rosh Hashanah.)

I wondered what the name of Vietnam's Đổi Mới ('Innovation'; lit. 'New Change') reforms would look like if it were written in an older fashion: i.e., in nom.

nomna.org lists eight different nom characters for đổi 'change'*. Each is a semantophonetic compound:

Semantic element Phonetic element (all three are read as đối 'pair' in Sino-Vietnamese)
忄 'heart' (left-hand version)
心 'heart' (bottom version)
扌 'hand'
易 'change'
昜 'open' (error for 易 'change'?)

(I tried combining the first three 対 cells into one, but couldn't. Is that possible in HTML?)

No contexts are given for the first two. The context for the other six is

đổi chác, trao đổi; thay đổi

'exchange, exchange; change'

I wonder if the three types of semantic elements ('heart', 'hand', 'change') correlate with different uses of đổi. Or are they truly interchangeable?

There is some significance in the different semantic elements used to write mới 'new':

Semantic element Phonetic element Context
none 某 SV mỗ; use as a phonetic for mới implies an extinct Old Sino-Vietnamese *mợi from Late Old Chinese *məjʔ < Early Old Chinese *Cʌ-məʔ; SV mỗ is based on a different Chinese dialect in which *Cʌ-məʔ became Late Old Chinese *moʔ mới cũ 'new and old', còn mới 'still new', mới đến 'newly arrived'
買 SV mãi 'buy' mới làm 'just done'
< with diacritic added (not in nomna.org) (not at nomna.org)
氵 'water' same as 某
始 'begin' (on top, left, or right)
亲 < abbreviation of 新 'new', not to be confused with the simplified Chinese character 亲 from 親 'intimate'** năm mới 'new year'

貝, an abbreviation of 買 SV mãi 'buy', represents the preposition mới 'with'. Both 貝 and 買 can also represent its variant với 'with'. Could the two go back to a shared root *ʔbəəjʔ? *ʔb- regularly becomes m-, and v- could be from a lenited *ʔb- after a presyllable. If I am correct, the Middle Vietnamese spelling for với should be (bới [βəəj] but I can't find (bới, mới, or với in de Rhodes' dictionary.

Nom is often treated as one giant pool of characters. Would studying texts categorized by period and region reveal more patterns of usage among the many 'variant' spellings?

*Vietic cognates for at the SEAlang Mon-Khmer dictionary are

Muong tổi 'to change' (Barker 1966)

Chứt (Rục) tò̰ːl 'to change' (Phu 1998)

I would tentatively reconstruct their common Proto-Vietic ancestor as *toolh.

Shorto's (2006) Mon-Khmer Comparative Dictionary lists their non-Vietic cognates:

1615 *turh to change, exchange. A: (Khmer, Katuic, South Bahnaric, Nicobaric, Viet-Mương)

Khmer ដូរ doː ṭūr to barter, to give change

Kuy toːr to buy

Stieng tuːr to change [places]

Nancowry tóh to change

Mương tổi (Barker 1966 20)

Vietnamese đổi to change (→ Sre ɗuih). = following? (Shafer 1965 406.)

Other possible cognates are

Middle Mon gətah 'to change' (Shorto 1971)

Phong tʰaːj (Bui 2000)

Pacoh toːj (Watson 2009)

**nomna.org lists 亲 and many other characters that are identical to simplified Chinese characters, but I am not certain whether they are true nom characters or were automatically supplied because actual nom characters were in some simplified-traditional Chinese character conversion table. I am particularly skeptical of simplified characters whose structure only makes sense in Mandarin: e.g., I doubt 亿 was a nom simplification of 億 ức 'hundred million' since its phonetic is 乙 which is nearly homophonous with 億 in Mandarin but not Vietnamese:

Sinograph Mandarin Vietnamese
SV ất, nom ắc, ậc, át, ắt, hắt, lớt***
亿 (no 亿 in Vietnamese?)
SV ức

乙 does have other nom readings listed at nomna.org, and some of them are vaguely like ức (ắc, ậc), so it's remotely possible that the Vietnamese devised 亿 on their own as a simplification of 億, but I doubt it.

I also doubt that 艺 was a nom simplification of 藝 nghệ 'art'; no reading of 乙 in Vietnamese sounds like nghệ, though 乙 and 藝 are near-homophones in Mandarin:

Sinograph Mandarin Vietnamese
SV ất, nom ắc, ậc, át, ắt, hắt, lớt***
nom ớt 'pepper'
SV nghệ, nom nghề, nghế

ớt is probably a true nom character that is not a simplification of 藝; it is a semantophonetic compound of 艹 'grass' and 乙 ất.

Variants of 艺 ớt have 木 'tree' and/or 辛 'spicy':




木+遏 (with 遏 SV át, nom ớt, ợt, ượt as phonetic)

***lớt [ləət] is reminiscent of Old Chinese *ʔrət; its tone indicates an earlier voiceless initial. <PA.AR.ΓA.NAI>?

Farghāna was part of the Second Khitan Empire (i.e., the Qara Khitai). I wonder how its name would have beene written in the Khitan large script (KLS).Would it have been something like


The first character is clearly related to Chinese 伐 which was pronounced as *faʔ in Liao Chinese. Khitan had no f in native words, so <pa> was the closest available approximation of foreign fa-. <pa> also transcribed Liao Chinese 發 *faʔ and appeared in the name


transcribed in Liao Chinese as 杷八 *pha paʔ (= *pa baʔ in a Khitan transcription-compatible notation).

I wrote about the second character <ar> in "<Xua>-t Sinographic Sources?".

The third character <ɣa> looks exactly like Chinese 何 and its reading matches the Late Old Chinese and Early Middle Chinese reading *ɣa for 何 rather than Late Middle Chinese *xa or Liao Chinese *xo. I doubt that the creator(s) of the KLS happened to reinvent a reading *ɣa that was long obsolete by the 10th century. It is more likely that *ɣa was a carryover from an earlier writing tradition devised when 何 was read as *ɣa in Chinese: i.e., Koguryo or Parhae.

I don't know of any KLS character for <na>*, so I chose <nai> (derivation unknown**) as the fourth character, hoping that the genitive of 'Farghāna' might end in <i>. Kane (2009: 132-136) lists eight types of Khitan genitives:

<-n> type <-ń> type nonnasal type


<-in> <-iń> <-i>



There is a strong but not absolute correlation between stem vowel classes and sufffix vowels. <-i> appears after non-<i>-stems:

<qid.un.i> 'Khitan-GEN' (instead of <qid.un.un>)

<bo.qo.i> 'child-GEN' (instead of <bo.qo.on>)

I do not know whether <i> can also appear after <a>-stems. I also do not know if a word ending in -CV followed by a vowel suffix could be spelled with <...CV.V> (e.g., <... na.i>) or <CVV> (e.g., <... nai>) or both in the KLS. Suppose, for instance, that there were KLS characters for <bo>, <qo>, <i>, and <qoi>. Would 'child-GEN' be written as <bo.qo.i> or <bo.qoi> or both?

I suspect that <i>-genitives were only in a small class of native words - remnants of an earlier, larger declension class? Obviously 'Farghāna' would not belong to this class.

*Here is a table of all native Khitan consonants (including zero) based on Kane's (2009) transcription followed by <a>. Numbers are from Kane's (2009: 177-182) list of KLS characters.


5.037, 5.126


5.052 (for Chinese loans only?)
(no <ga> in native words) (no <ŋa> in native words)
<ja> <ńa> <śa>
5.030, 5.133
5.021, 5.025, also
<na> <sa> <la>
<ba> <ma> (no <fa> in native words) (no <wa> in native words)

I assume that Khitan was like Mongolian and Manchu: velars could not precede a in native words. In strict notation, <ɣa> should be <ʁa> with uvular <ʁ>. I do not know whether <xa> could represent both uvular-initial [χa] and velar-initial [xa].

I also assume KLS graphs also existed for <ja>, <ńa>, <ra>, <na>, <sa>, <ba>, and <ma>.

**The KLS character


does not resemble any Chinese character pronounced *nai or representing words for 'head' (Khitan nai meant 'head').

Tangut fonts by Mojikyo.org
Tangut radical and Khitan fonts by Andrew West
Jurchen font by Jason Glavy
All other content copyright © 2002-2012 Amritavision