David Boxenhorn just gave me a tool to search for tangraphs (Tangut characters) based on Andrew West's work.

David assigned a pronounceable three-letter alphacode to each of 840 Tangut components. For example,


is 'dex'. Such alphacodes are easier for me to learn than numbers (e.g., dex is equivalent to Nishida radical 204).

I can use regular expressions to find how many tangraphs contain a component in different positions: e.g., PERSON (dex) appears

- in 1,187 tangraphs regardless of position (roughly one out of five tangraphs; cf. its Chinese equivalent which only appears in about 3-4% of sinographs)

- once by itself

- in initial position in 555 tangraphs

- as the second of three components in 196 tangraphs

- as the second of four components in 49 tangraphs

- as the second of five components in 4 tangraphs

- as the third of four components in 63 tangraphs

- as the third of five components in 5 tangraphs

- as the fourth of five components in 7 tangraphs

- in final position in 370 tangraphs

Note that tangraphs can contain more than one of the same component: e.g.,

have more than one PERSON each. This is clear from their alphacodes:

dexdex (PERSON next to PERSON)

dexbeldex (PERSONs flanking bel)

boxdexbeldex (PERSON in second and fourth position)

boxdexbixbiadex (PERSON in second and fifth position)

I can also search for component sequences: e.g., 'dexbel' for PERSON + SURROUND 干:

Such sequences can be abbreviations of full characters according to the analyses in Tangraphic Sea:

e.g., PERSON + SURROUND 干 (dexbel) in

na R17 1.17 'night, darkness' (dexbeldex)

is from the left two-thirds of

na R17 1.17 (dexbelpax), first half of na-raʳ 'tomorrow'

which is in turn from

na R17 1.17 'night, darkness' (dexbeldex)

bringing us full circle. One might guess that their shared component


is a phonetic na, but it is read as pi R11 1.11 by itself and means 'majestic, glorious' which has no semantic connection to 'night, darkness' or 'tomorrow' (or any obvious connection to PERSON or SURROUND!).

na 'night, darkness' and perhaps na-raʳ 'tomorrow' are probably related to

nɨaa R21 1.21 'black' (duudexcok)

cf. Classical Tibetan nag-po, gDong-brgyad rGyalrong kɯ-ɲaʁ, Mawo Qiang ɲiq

which shares dex, but not dexbel. HOW MANY GEMINATES DOES NHA HEUN HAVE? (PART 2)

I found three more in Jacq (2006) that I have added in bold to the following table. Hypothetical geminates are in parentheses. I predict that I could find them in a larger Nha Heun sample.

(kk-?) cc- (tt-?) pp-
gg- ɟɟ- (dd-?) (bb-?)
ŋŋ- ɲɲ- nn- mm-
ll- ww-

I assume they all or mostly come from original clusters, though some might originate from simple initials: cf. how Korean 꽃 kkot 'flower' is from Middle Korean 곶 /koc/ [kos] rather than the expected *(p)skoch.

Not all Nha Heun initial clusters became geminates. I previously mentioned these Nha Heun obstruent-sonorant clusters from Sagart (1999):

ŋr- mr-

After looking at Davis (1973) and Jacq (2006), I can expand that table to include exotica like ʔŋk-, hʔ-, hb-, ppr-, and (h)ɲr-. I find the latter very hard to pronounce.

First element/second element -ʔ- -ŋk- -j- -ɲr- -r- -l- -n- -b- -m- -w-
ʔ- ʔŋk- ʔl- ʔn- ʔm- ʔw-
h- hʔ- hj- hɲr- hl- hb- hm-
k- kr- kl- kw-
kh- khj-
g- gj-
ŋ- ŋr-
c- cr-
ɲ- ɲr-
s- sr-
t- tr-
d- dr-
p- pr- pl-
pp- ppr-
b- br- bw-
m- mr-

Nha Heun hʔ- reminds me of similar clusters in Khmer:

kʔ- cʔ- sʔ- lʔ- tʔ- pʔ- mʔ-

(4.24.0:32: SEAlang's Khmer dictionary has no entries with ច្អ- cʔ-, though Huffman listed it on p. 8 of Cambodian System of Writing, which was my Khmer script textbook. I wish I had my print copy of Judith M. Jacob's A Concise Cambodian-English dictionary with me because the pages that might include ច្អ- cʔ- entries aren't accessible at Google Books.)

I am not sure if Nha Heun h-sonorant sequences are phonetic clusters or phonemic clusters pronounced as voiceless sonorants.

Although Nha Heun is not related to either Tangut or Chinese, earlier stages of those two languages may have contained similar clusters. All three languages are at different points of the phonological 'collapse' spectrum with long polysyllabic words at one end and monosyllables with tones at the other.


Davis and Jacq use different transcription systems for Nha Heun.

Davis uses a vowel symbol resembling 乀 that goes below the line. I've never seen it before. It is distinct from his ɨ. He translates 'lɔɔng kar (in his transcription) as 'kiar wood', implying that 乀 is i-like. What does 乀 represent? Did it merge into ɨ in the variety of Nha Heun studied by Jacq?

Davis' transcription has both c and č. I assume c is a typo for č since I see no similar distinction in Jacq's transcription which only contains c. Davis' c appears in wan can 'Monday', a loanword from Lao ວັນຈັນ wan can (lit. 'day moon').

Davis transcribed 'bird' as čeem with a long vowel but Jacq transcribed it as cem with a short vowel. Did the vowel shorten, or are these forms from different dialects?

Jacq's example 16 includes the term ʔindiaŋ dɛŋ 'Native Americans' (lit. 'red Indians'). Why was her informant talking about them? HOW MANY GEMINATES DOES NHA HEUN HAVE?

I don't know, but I thought it would be fun to find as many as I could in John J. Davis' (1973) "Notes on Nyaheun grammar".

Sagart's (1999: 15-17) brief discussion of Nha Heun [ɲahəɲ] gave me the impression that it only had sonorant geminates:

ŋŋ- < *tŋ-, *pŋ-

nn- < *kn-, *pn-

mm- < *km-, *tm-

ll- < *kəl-, *cəl-, *təl-

(*CN-clusters also have alternate reflexes*.)

This contrasts with Korean which only has tense obstruents:

kk- cc- ss- tt- pp-

Tense vowels follow all Tangut consonants with the exceptions of j- and r- in my reconstruction. If Tangut tense vowels were conditioned by earlier tense consonants, then pre-Tangut must have had a nearly full set of tense consonants or geminates:

*ʔʔ- *kk- *ttʃ- *tt- *pp-
*kkh- *ttʃh- *tth- *pph-
*gg- *dʒ- *dd- *bb-
*ŋŋ- *nn- *mm-
*xx- *ʃʃ- *ss-
*ɣɣ- *ʒʒ- *zz-
(no *jj-) *ll- (but no rr-) *vv-

Ferlus' reconstruction of Old Chinese would also require a large set of tense consonants.

Does any attested language allow that many tense consonants or geminates in initial position?

The record-holder for number of tense consonants in UPSID is Shuswap which only has fourteen. (UPSID uses the term 'laryngealized' for Korean tense consonants, so I assume other laryngealized consonants are similar.) No language in UPSID has aspirated laryngealized or 'long' (= geminate) consonants.

If I am interpreting Davis' (1973) transcription of Nya Heun correctly, there are at least four obstruent geminates

čč- [cc]?
gg- jj- [ɟɟ]?

in addition to ŋŋ-, nn-, mm-, ll-. The presence of gg- and a voiceless/voiced palatal pair čč- and jj- imply the existence of more obstruent geminates (in bold):

kk-? čč- [cc]? tt-? pp-?
gg- jj- [ɟɟ]? dd-? bb-?

Could Nya Heun set a precedent for reconstructing a large number of tense or geminate initials in Tangut?

*Not all *CN- clusters became geminate nasals. Some became nasal + nonnasal sonorant sequences:

*km-, *tm-, *tŋ- > nw-

(*pŋ- has no known alternate reflex.)

*kn- > ŋr-


Last night, I asked,

Does any language in the world have geminate or tense initial ll-?

I forgot about Nha Heun which has ll- from *Cəl- (but Cl- from *Cl-; Sagart 1999: 17).

I think pre-Tangut once had *ll- and other initial geminates via compression:

*CVlV > *ClV > *llV

Tension spread from geminates into following vowels:

*llV > *llṾ

The subscript dot represents vocalic tension.

Later, tension was lost in the initial but became phonemic in the vowel:

*llṾ /CCV/ > lṾ /CṾ/

Hence Tangut

lạ 'hand, arm'

may go back to *llak < *Clak < *CV-lak. (4.22.0:11: Other languages preserve a final velar: e.g., Old Chinese 翼 *lək 'wing', Written Tibetan lag-pa 'hand, arm'.)

Did Chinese also once have *ll-? Michel Ferlus (2004) has proposed that the Chinese Great Split* that I account for with 'emphasis' (pharyngealization, indicated here by underlining) actually involved tense-lax contrasts: e.g.,

Sinograph Gloss My Old Chinese Ferlus' Old Chinese Late Old Chinese Middle Chinese Mandarin
I *Cɯ-la > *la (nonemphatic) *la (lax) *jɨa *jɨə yu [jy]
road *(Cʌ-)la > *la (emphatic) *lla (tense) *da *do tu [thu]

In Ferlus' OC *lax syllables, initials lenited and vowels raised.

In Ferlus' OC *tense syllables, initials hardened and vowels lowered. (OC *a was already low and couldn't get any lower.)

I hope to explore Ferlus' hypothesis further in the future.

*The 'Great Split' refers to how Old Chinese near-homophones later became completely different syllables: e.g., 途 'road' sounded like 余 'I' in Old Chinese and was therefore written with 余 'I' as a phonetic, but the two had nothing in common in Middle Chinese. There is no consensus on what caused the Great Split. ICELANDIC AND OLD CHINESE LL

According to Language Log, (native) Icelandic ll is [tɬ] at the end of Eyjafjallajökull and [tl] in its middle. (Thanks to Andrew West for the link. I can't find a description of ll in the pronunciation section of my copy of Teach Yourself Icelandic.)

I presume that ll was once *[ll], or else it would have been spelled as tl. This *[ll] then might have become *[dl] before its first half devoiced to [t] (because it was a coda?).

This got me thinking about how emphatic *l- in Old Chinese hardened to Middle Chinese *d-. Not too long ago, some linguists wrote their equivalent of my *l- [lˁ] as *ll-. If that geminate notation were taken at face value, could the following change have occurred?

OC *ll- > *dl- > MC *d-

(Other OC *Cl- were also simplified to MC *C-.)

Does any language in the world have geminate or tense initial ll-?

(4.21.2:06: The only language in UPSID with /ll/ is Wolof. I don't know whether initial /ll/ is possible in Wolof. The Wikipedia entry on Wolof doesn't mention this sound at all.)

And has the change l- > d- been attested in any language? Finding such changes might be hard since that consonant is so rare. It's only in two languages in UPSID (Kurdish and Shilha), and I know of a third: Modern Standard Arabic (in ʔAllaah) and some Arabic dialects (Kaye 1987: 669).

4.21.2:08: According to Wikipedia, Kurdish has velarized ɫ instead of pharyngealized l, and this sound cannot occur initially (unlike Polish velarized ł- which shifted to [w]). This distinction cannot be very old. Avestan only has r and Kurdish l and l correspond to Sanskrit and Avestan r(V)d:

'heart': Kurdish dil : Sanskrit hṛd

'year': Kurmanji sal, Sorani sal : Avestan sarәd, Skt śarad 'summer'

My guess is that Proto-Iranian *rd became */l/ with conditioned allophones *[l] and *[lˁ] (or *[ɫ]). When this conditioning factor was lost, the distribution of the two l-sounds was no longer predictable, and they became phonemes: cf. the development of plain and palatalized l in Slavic languages. ROOTLESS RADICALS

Last night, I mentioned that Tangut characters were ordered by 'radicals' in Unicode. One might assume that the creator of the Tangut script consciously built characters out of the 513 Unicode radicals, but in fact some of those radicals are simply arbitrary indexing devices: e.g., the first five radical 001 (一) characters have nothing in common but a horizontal line on top:

Tangraph My reconstructed reading Tone.rhyme Gloss

do 1.49 poison

sa (unknown) suck

tʃhiaa 1.21 strong

ʃɔ 2.43 to mate; to copulate

ŋwəu 2.1 collar; neckband; territory (cf. semantic range of the unrelated Chn word 領)

Similarly, 'radical 1' (一) characters in traditional Chinese radical-based indexes only share a (nearly) horizontal line:

Sinograph Mandarin reading Gloss
yi one
ding fourth Heavenly Stem; a surname
qi seven
zhang unit of length; elder person; to measure land
san three

One might get the impression that Chinese radical 1 correlates with numerals, but not all characters for numerals contain it: e.g., 八 ba 'eight' and 九 jiu 'nine'.

There is no consensus on how to index Tangut characters. Every Tangut index has its own system. Andrew West compares thirteen different radical systems in this PDF file. Some indexing systems include 'compound radicals' which in many cases are almost certainly not units of graphic structure that the creator of tangraphy had in mind..

Modern indexers might use a Tangut period radical system if they knew of one, but none has been found yet. Tangut lexicographers grouped characters by pronunciation, not by graphic structure. Thus the first five Tangut graphs above are widely separated in Tangut period dictionaries. However, nongraphic indexing is difficult for modern Tangutologists who need to rapidly find characters without knowing how they are pronounced. Modern radical-based indexing is not authentic, but it is convenient.

4.20.00:14: You can read more about modern Tangut radicals and see various lists of such radicals in N3495. THE TANGUT "SCRIPT ITSELF IS NOT COMPLEX"?

I had an exciting afternoon yesterday. I took Routledge's Sino-Tibetan volume with me as I ran my errands and stumbled upon topics I hope to blog about soon. Then I came home to find that Andrew West uploaded three files that lifted my spirits even higher.

The first of these files was his list of "Documents relating to the encoding of the Tangut, Jurchen and Khitan scripts" in Unicode. I have been interested in all three scripts since 1996 and can't wait to be able to type them in Unicode without the proprietary fonts or homemade graphics that I've been using until now. I'll focus on Tangut here and deal with Jurchen and Khitan in my next two posts.

The title quote is from document N3797 which is a great introduction to the script I've studied for so many years (emphasis mine):

The Tangut script is composed of about 6,000 individual characters that superficially resemble Chinese ideographs [...] Although the individual characters are quite complex, with most characters comprising 8-15 strokes, the script is not itself complex.

How can a script be simple if its characters are complex?

There are no combining characters, and characters do not interact with each other typographically.

Tangut characters can simply be placed in sequence, one after another. On this site, I use GIFs for Tangut characters and I don't have to worry about combining GIFs or choosing a different GIF depending on which other GIFs are next to it. In this sense, Tangut is simpler than, for example,

- Thai which requires character stacking: compare the position of the may eek diacritic ( ่) in

สู่ suu 'get to' ( ่ may eek is directly atop ส s which in turn is atop ู uu)

สี่ sii 'four' (< Chn 四 'id.'; ่ may eek is atop ี ii which in turn is atop ส s)

- Devanagari which requires ligatures e.g.,

क् + ष = क्ष

k + ṣa = kṣa (which looks completely different from its parts)

ज् + ञ = ज्ञ

j + ña = jña (which resembles j but not ña)

- Hangul letter combinations: e.g.,

ㄱ + ㅏ + ㄱ = 각

k + a + k = kak

Notice how ㄱ has a variant フ in top left position.

- The Khitan small script (more on this in an upcoming post): e.g.,


'prince' + -od (plural) + -en (genitive) = oŋoden 'of the princes'

Perhaps there is generally an inverse correlation between the number of characters and the complexity of a script (in the sense used in N3797). Such complexity is only viable if the number of characters is limited.

Tangut characters, like their Chinese inspirations, are themselves combinations of elements. These elements generally cannot function as a standalone characters: e.g.,


consists of three nonindependent parts


'wood' + 'fire' + 'fire trigram'

N3797-A is a list of 513 Tangut character components ranging from one to sixteen strokes.

N3797-B is a nearly complete list of Tangut characters ordered by component: e.g., the first 46 characters (other than the iteration mark) contain the single-stroke horizontal line component

(radical 001; meaning, if any, unknown)

and the last three characters contain the sixteen-stroke component

'iron' (radical 513)

The "IDS" (ideographic description sequence) column lists a breakdown of each character into its components and the "Stroke Count & Stroke Order" column lists a breakdown of each character into individual strokes: e.g., the first Tangut character (17001)


has six strokes signified by AEABEM:

stroke 1: A = ㅡ (on the top; radical 001 in this position)

stroke 2: E = ㄱ (top 2/3 of コ)

stroke 3: A = ㅡ (bottom 1/3 of コ)

stroke 4: B = l (left of ㄇ)

stroke 5: E = ㄱ (top and right of ㄇ)

stroke 6: M = 乚

Characters are ordered by radical (top and/or left-hand component), number of strokes, and stroke types (in alphabetical order according to stroke codes): e.g.,

17000: only six-stroke character with radical 001

17001: only seven-stroke character with radical 001

17002: first eight-stroke character with radical 001: AAABCCQH

17003: second eight-stroke character with radical 001: AABFFQBB (AAB ㅡㅡㅣ ... follows AAA ㅡㅡㅡ ...)

17004: third eight-stroke character with radical 001: ABAEAAMC (ABA ㅡㅣㅡ ... follows AAB ㅡㅡㅣ ...)


Tangut fonts by Mojikyo.org
Tangut radical font by Andrew West
All other content copyright © 2002-2010 Amritavision