Home

19.7.11.23:50: BATAVO-TAIWANESE ACRES

While looking in Endymion Wilkinson's Chinese History: A Manual (2000 edition - I'm three editions behind) for an English equivalent of the Chinese (and Khitan) title 開府儀同三司 for my last entry, I stumbled upon the Taiwanese word 甲 kah [kaʔ˧˨], a measurement of land, on p. 243. I was surprised to learn that it was a borrowing of Dutch akker (cognate to English acre - though a kah is actually about 2.1 acres).

I had always assumed that the Dutch had never left any linguistic traces in Taiwan. Wrong!

How many other Batavo-Taiwanese words are there? The Wikipedia entry on Taiwanese doesn't mention the existence of Dutch loans.

I just found that Wiktionary has a long English entry for 開府儀同三司. Nice.


19.7.10.23:59: VEXED BY FIREFOX (PART 1)

I've almost always been able to use Firefox when Chrome failed me. Until now.

What would Firefox be called in Tangut? How about

𗜐𗗱

4408 1870 1my1' 1jy2 'fire fox'?

The first half (4408) has bothered me for years for two reasons.

First, why does 4408 contain what looks like a <WOOD> radical (𘡩) atop what have been thought to be two <FIRE> radicals (𘠠 and 𘧦)? Compare 13+-stroke 4408 to the 4-stroke simplicity of Chinese 火 'fire'. The Tangraphic Sea analysis is improbable: would the graph for the basic word for 'fire' really be derived from the graph for half of a n apparently nonbasic word for 'fire'?

𗜐=𗚜+𘍽

4408 1my'1 'fire' =

top and left of 4413 2pu4 'to burn, ignite' (semantic) +

right of 5082 1vi1 (second syllable of 𘓼𘍽 4555 5082 1py1 1vi1 'fire', only attested in dictionaries; could the first syllable, attested as the name of the trigram for 'fire', be cognate to 4413?; semantic)

The derivation for 4413 is unknown.

The derivation for 5082 is circular:

𘍽=𘍵+𗜐

5082 1vi1 (second syllable of 1py1 1vi1 'fire') =

left of 5286 (second sylllable of 𘄦𘍵 1772 5286 1ten4 1vi1  'intelligent'; phonetic) +

right of 4408 1my'1 'fire' (semantic)

Surely 4408 was devised before 5286.

Second, 𗜐 4408 1my'1 < *miX 'fire' has the mysterious phonetic characteristic that I call 'prime' and represent as an apostrophe which is easier to type than a true prime symbol. I represent its pre-Tangut source as *X (though I could just as easily carry over the prime notation, since I have no idea what *X was). A mi-word for 'fire' is widespread in Sino-Tibetan, but none of the cognates of pre-Tangut *miX contain any obvious segment or tone that plausibly correlates with *X. Suppose, for instance, that I proposed that pre-Tangut *X corresponds to Written Burmese -ḥ. That correspondence works for 'fire' and 'nine' but not for 'two' and 'five'. Written Burmese 'two' lacks -ḥ, and (pre-)Tangut 'five' lacks *-X/-'.

7.11.21:55: A table of the above words and more:

gloss
tangraph
Li Fanwen number
Tangut
pre-Tangut
Proto-Southern Qiang
Written Burmese
Pyu
Written Tibetan
Sinograph
Old Chinese
fire
𗜐 4408
1my'1
*miX
*mu/i (H)
mīḥ
?
me

*m̥əjʔ
two
𗍫 4027
1ny'4
*niX
*(χ)nə (L)
nhac < *n̥ik kni
gnyis

*nis
three
𘕕 5865
1soq1
*Ksum
*khsi -
suṁḥ nhomh
gsum

*səm
four
𗥃 2205
1lyr'3
*RliX
*grə L
leḥ
plä
bzhi

*slis
five
𗏁 1999
1ngwy1
*pŋi < *pŋa? *ʁuɑ L
ṅāḥ
pïnga lnga

*ŋaʔ
nine
𗢭 3113
1gy'4
*ŋgiX < *ŋgiwX? *χguə
kuiḥ
tko
dgu

*kuʔ

Proto-Southern Qiang reconstructions are from Evans (2001). Key to his tone symbols:

Evans did not reconstruct a tone for 'nine'. Using his notation, I would reconstruct *(H): Longxi and Mianchi have high tones, but Taoping has a mid tone which normally points to *L.

My near-total ignorance of Pyu basic vocabulary (e.g., 'fire') does raise the troubling possibility that Pyu is a non-Sino-Tibetan language with loans from Sino-Tibetan. Tai has borrowed nearly all its lower numerals (with the exception of 'one') from Chinese.

My reconstruction of pre-Tangut 'nine' implies a chain shift:

*-k >*-w > Ø

Pre-Tangut *-w was lost, and Tangut gained a new -w from the lenition of pre-Tangut *-k: e.g., in

𘈩 0100 *kʌtik > *lew 'one'

'Three', 'four', and 'five' all had the same tone (or, more likely, segmental source of a tone) in Proto-Lolo-Burmese, and I suspect that tone source spread from one numeral to the others. (Cf. how *-i spread from 'four' to 'five' in pre-Tangut. Or how ) If that tone corresponded to Pyu -h, then that tone source spread from 'three' to 'four' and 'five' in Proto-Lolo-Burmese (or some ancestor of PLB). But that scenario assumes Pyu is conservative, which I don't think it is.

A huge problem is that the final segments (or quasi-segments in the case of [pre-]Tangut *-X/-') line up poorly. Ideally I'd like to see a pattern like

Tangut tone 2 : Written Burmese -ḥ : Pyu -h : Written Tibetan -s : Old Chinese *-s

among the oldest languages (Proto-Southern Qiang tones are of recent origin), but there are no instances of that above. And the thought of languages adding a final *-s or *-h to some random numerals but not others bothers me.

Also disturbing is the possibility that (pre-)Tangut *-X/-' corresponds to nothing in any other language because it is a reflex of a Proto-Sino-Tibetan phonetic feature completely lost elsewhere. I'd like to think that maybe some Qiangic language (i.e., a relatively close living relative of Tangut) has something corresponding to (pre-)Tangut *-X/-'. Proto-Southern Qiang apparently isn't that language.

One more possibility is that *-X/-' is unique to Tangut because it reflects a substratum language which had it. But that hypothesis cannot be tested since we know nothing about such a substratum language (unless its traces are in the so-called 'ritual' language [see Andrew West's skeptical take], and -' does not seem to be any more prominent in that subset of the Tangut vocabulary -  in fact, -' is even less frequent in the 'ritual' numerals than in the regular ones!). And if a substratum language had -', why would its speakers impose that feature onto a language that didn't have it? I don't know anything about the English of Hmong native speakers, but I imagine that English does not have any uvular phonemes (that is, a feature in Hmong absent in English).

However, I can imagine a situation in which a speaker of a continental Altaic-type language would introduce uvulars into English because uvulars and velars are in complementary distribution in their own language (i.e., nonphonemic): e.g.,

native, English /ki/ > [ki]

but

native, English /ka/ > [qɑ]

(For convenience I use the symbol /k/ to represent the Altaic back consonant. One might argue that ideally I should use a symbol other than /k/ or /q/ to avoid implying that one allophone is more like the Platonic form of the phoneme than the other.)

But ... when Khitan and Manchu actually did encounter [ka]-type combinations violating their phonotactics in Chinese, they borrowed them as /ka/: e.g.,

As a result, the uvular-phonetic distinction became phonemic as well as phonetic: e.g., these new imported /ka/ contrasted with native /qa/.

Then again, I am citing written Khitan and Manchu which may have reflected an elite, idealized pronunciation. Some Khitan and Manchu speakers learning Chinese might have pronounced uvulars before /a/. If they did, at least they had a phonotactic motivation for doing so. The phonotactic motivation, if any, for pronouncing whatever -' was in Tangut is unknown. Minimal pairs such as

𗹦:𗜐

3513 1my1 'sky' : 4408 1my'1 'fire'

seem to rule out a phonotactic motivation.

Could the fact that 'sky' and 'fire' had different vowels in pre-Tangut be relevant? Could I abandon *X and instead propose that

No, because there are cases of -y' from pre-Tangut *-uX and -y from pre-Tangut *-i: e.g.,

𗡡

0320 1vy'1 < *NApuX or *CANpuX 'soft, weak' (cf. Japhug mpɯ < *-u 'soft')

𘗊

4880 2ryr1 < *riH 'copper' (cf. Written Tibetan gri 'knife'?)

(The -r of 2ryr1 is vowel retroflexion conditioned by *r-. As 1lyr'3 'four' above demonstrates, there is no phonotactic constraint against ' coexisting with retroflexion, so I cannot claim that 2ryr1 would have ended in -y' if not for retroflexion.)

There is even a doublet for 'worm'

𘟥𗯏

1888 2by1 < *mbuH and 5270 1by'1 < *mbuX

which is cognate to Written Tibetan Hbu [mbu] 'id.' See Gong Hwang-cherng's "A Hypothesis of Three Grades and Vowel Length Distinction in Tangut" (1995) for more examples. (Gong's 'long vowels' correspond to my V' 'vowel-prime' sequences. The correct explanation for -' would have to account for such doublets.


19.7.9.15:59: TWENTY BLADES OF CHINESE GRASS

At the end of "An-derused", I wrote,

I was surprised to see 漢 <CHINESE> 한 Han with 艹 instead of 廿 on the top right on the cover of 最新版常用學習三千漢字 Chhoeshinphan sangyong haksŭp samchŏn hancha (Three Thousand Hanja for Everyday Study: New Edition).

I was even more surprised to look inside and see the entry for 漢 <CHINESE> 한 Han on p. 47. Each of the three thousand hanja in the book has a large entry character atop a chart showing how to write it in seven steps and one or more example words containing it. The large entry character is 漢 with 艹 (resembling the character component <GRASS> though actually having nothing to do with grass) on the top right. However,

That must be confusing to someone who does not know how to write the character. Wiktionary shows both ways to write <CHINESE>. If I were to write a book on hanja, I'd bring up the variation of <CHINESE>.

I can describe that variation in terms of Unicode:

So why don't I just type U+FA47 and U+FA9A instead of resorting to phrases like "漢 with 廿 on the top right"? Because I don't think most people have fonts that support the distinction between the two forms.The <CHINESE> hanja that you see here in fact has a third Unicode codepoint: U+6F22. Why are there three codepoints for two forms of <CHINESE>¹?

This table is my attempt to show the relationships between a few encodings and forms of <CHINESE>:

Unicode codepoint
Unicode glyph
Japanese equivalent
North Korean equivalent
South Korean equivalent in KS C 5601-1987
U+6F22 font-dependent
艹-version 廿-version (?)
廿-version
U+FA47
廿-version 廿-version none none
U+FA9A
艹-version none
艹-version none

The duplicate codepoints in Unicode are a byproduct of the different versions of <CHINESE> corresponding to U+6F22 in Japanese and North Korean encodings. In an 'ideal' Unicode without regard for non-Unicode encodings, there would either be two codepoints for the two versions (following a maximalist philosophy of one codepoint per form) or just one (following a minimalist philosophy of one codepoint per platonic character) but not three.

¹There are in fact at least 31 forms of <CHINESE>, but the 廿~艹 variants (and the simplified Chinese form 汉) are all that are needed for everyday purposes.


19.7.8.23:59: AN-DERUSED

Of course right after I finished my previous post on 顏 U+984F~顔 U+9854 for Sino-Korean 안 an <FACE>, I realized I should have checked the very first Sino-Korean dictionary, 東國正韻 Tongguk chŏngun (1447), which is also one of the earliest hangul texts. Its entry for ᅌᅡᆫ ngan (the prescriptive 15th century reading of <FACE>) has the form 顏 U+984F. Needless to say, it is absurd to draw direct lines across three vastly different periods, but I'll do so anyway:

Tongguk (1447): 顏 — Gale (1897): 顏 — Sae chajŏn (1961): 顏 (all U+984F)

Those are the three earliest texts in my survey so far. I am certain of what I have seen in them. I am less certain about these search results from titles and authors in the National Library of Korea's database (via Cambridge's list of Korean studies resources), since it's possible someone typed one form instead of the other:

Sorting the results by date reveals some obvious typos: e.g., modern items like a book with 2018 in the title dated "201" instead of "201X". And the difference between "201" and "201X" is more obvious than that between 顏 U+984F and 顔 U+9854.

It is certainly not true that 顔 only appears in post-1961 books. The earliest result for 顔 is 史鉞 Sawŏl (The Axe of History, 1506). Although I don't have time to go through the online scan of the book (there is no search function), I can believe 顔 was in it, since the earliest attestation of that form that I can find is in the Chinese rhyme dictionary Guangyun (1008).

Conversely, it is also not true that 顏 U+984F is absent from recent publications, as ... oh no. The results include anything with an in the title or author's name regardless of whether it's spelled 顏, 顔, 晏 (the surname of the author of Sawŏl), in hangul as 안, etc. I suppose that makes sense in a time when few people may know what the proper hanja is. But why does searching for 顏 U+984F and 顔 U+9854 generate different results if all that matters is the presence of a syllable an regardless of written form? I don't know. I wonder if the site developer will ever address that question.

I'm going to look at the question of 顏 U+984F~顔 U+9854 from one last angle. Here is a list of the frequency of the two forms in South Korean national newspapers according to Google. I have arranged the papers in order of circulation whenever I could find figures. The figures were partly undated, so this table cannot be interpreted as a true ranking. I just wanted a rough idea of the popularity of the various papers.

Title
Circulation
顏 U+984F 顔 U+9854 Notes
Chosun Ilbo
1.8 million
7
(1441)
顔 U+9854 figure includes instances in the paper's Japanese edition.
JoongAng Ilbo
1.3 million
0
0
0 in spite of the fact that the paper does not have a no-hanja policy like Hankyoreh (see below).
Dong-A Ilbo
1.2 million
3
(3100)
顏 U+984F figure excludes instances in the paper's Chinese edition.
顔 U+9854 figure includes instances in the paper's Japanese edition.
Seoul Shinmun
780,000
5
198

Kyunghyang Shinmun 350,000
0
183

Hankook Ilbo 213,200
1
70

Hankyoreh
?
(1)
(1880)
The paper has a no-hanja policy in its Korean edition, so the figures are for 顔 U+9854 in the paper's Japanese edition.
The one instance of 顏 U+984F is in a comment in the Japanese edition and is probably a character selection error, as the rest of the comment is in postwar characters; the writer is not someone like me who insists on prewar orthography.
Kookmin Ilbo
?
0
138

Munhwa Ilbo
?
0
120

For comparison, in Asahi shinbun, 顏 U+984F appears 30 times and 顔 U+9854 appears 165,000 times. (Those figures include <FACE> for native kao as well as Sino-Japanese gan, whereas <FACE> in Korean only represents Sino-Korean an.) Kanji are alive and well in Japanese, whereas hanja are in decline in Korean. I took hanja seriously when I first started learning Korean in 1987. The newspapers were full of them then. But now Hankyoreh has zero in its Korean edition.  An all-kana Japanese newspaper is unthinkable today, even though Japanese TV news reporters demonstrate it is possible to present the news orally without any kanji (not counting onscreen text).

What is missing from the figures above are a sense of proportion and the time dimension. What is the frequency of each form of <FACE> per million characters (counting hangul letter blocks as single characters) per year per publication? My guess is that the Japanese usage of both forms of <FACE> has remained constant after the postwar writing reform, whereas <FACE> in either form has become increasingly infrequent in Korean, though 顔 U+9854 has taken the lead due to

It would be interesting to see frequency figures for various types of hanja in publications over time. I suspect, for instance, that the usage 日 il 'Japan, day' has declined but not to the same degree as <FACE> because newspapers continue to use 日 as an eye-catching abbreviation of 日本 Ilbon 'Japan', particularly in headlines. Even JoongAng Ilbo which has zero instances of <FACE> has 1,650 Google instances of 日, presumably mostly for Il 'Japan'. However, the example words for 日 in Grant's (1982: 43) A Guide to Korean Characters have the following frequencies in JoongAng Ilbo according to Google:
Those are 日常單語 ilsang tanŏ 'everyday words', so the low frequency of their spellings does not reflect the frequency of the words that those spellings represent. Here are the hangul spelling frequencies in JoongAng Ilbo according to Google:
JoongAng Ilbo has been around since 1965. I suspect Grant's example words were sometimes written in hanja in 1965 issues which are of course not Google-searchable. Here's an low-resolution image of the top of the front page of the debut issue. The largest characters - the only ones I can make out - are all hanja:

<NUMBER> 호 ho appears on the front page as 號 U+865F and as 号 U+53F7, the simplified form also used in postwar Japan. I get the impression that official standards aside - 号 U+53F7 isn't supported by KSC encoding or in the 1,800 hanja taught in secondary schools - Korean typographers and even reference book writers are not purists when it comes to hanja forms. I was surprised to see 漢 <CHINESE> 한 Han with 艹 instead of 廿 on the top right on the cover of 最新版常用學習三千漢字 Chhoeshinphan sangyong haksŭp samchŏn hancha (Three Thousand Hanja for Everyday Study: New Edition).


19.7.7.23:59: J-AN-US: THE TWO <FACE>S OF NAVER

Why do I care so much about minute variations like 顏~顔 for Sino-Korean 안 an <FACE>?

In TJK¹ studies, subtly different graphs are often regarded by modern scholars as separate entities. Whether such differences also reflect linguistic differences requires study.

No such study is needed to know that 顏 and 顔 are the 'same' character in one sense. But in Unicode, they are not: 顏 is U+984F and 顔 is U+9854. Unicode is not consistent about assigning variants to different codepoints. That is not necessarily a flaw. Should the VS17 and VS18 forms of 喩 U+55A9 have different codepoints? I couldn't tell them apart without laying VS18 over VS17. On the other hand, why doesn't VS19 of 囀󠄀 U+56C0 have its own codepoint? Andrew West has much more on this issue.

Back to Korean: my interest lies in determining what the de facto standard form of 안 an <FACE> is or was at different points of time.

Last night I forgot to check Gale (1897), one of the first Korean dictionaries I ever used. Page 943 has 顏 U+984F.

Today I use naver.com. Its hanja dictionary treats 顏 U+984F as the 本字 ponja 'original character' of 顔 U+9854, but its entry for 顔 U+9854 is lengthier, including lists of 18 words and 5 phrases containing 顔 U+9854 without equivalents in the entry for 顏 U+984F. Clearly the dictionary regards 顔 as the principal form. Yet if I run a search on those characters throughout the entire dictionary (i.e., if I have 전체 chŏnchhe 'entire body' selected), I get

One might conclude that the 31 words containing 顏 U+984F can never be written with 顔 U+9854, that there are no phrases that can be written with 顏 U+984F, etc. But that isn't true: anything that can be written with one can be written with the other. And yet there may not be any overlap between those lists: e.g.,

If someone runs into 顏面 anmyŏn 'face' with U+984F in a text and looks it up in Naver, one will find the words
but not 顏面 anmyŏn 'face' itself!

My impression is that in South Korea 顔 U+9854 has become dominant but has not yet fully eclipsed 顏 U+984F. Otherwise I would expect Naver to be like Japanese dictionaries which have a single main entry for 顔 U+9854 and list 顏 U+984F as a variant.

I predict that the domination of 顔 U+9854 will increase over time as Koreans type fewer hanja and only use hanja that their IMEs provide for them: e.g., 顔 U+9854 but not 顏 U+984F in the case of Windows.

¹Andrew West's term for Tangut/Jurchen/Khitan, a play on CJK for Chinese/Japanese/Korean.

Tangut Yinchuan font copyright © Prof. 景永时 Jing Yongshi
Tangut character image fonts by Mojikyo.org
Tangut radical and Khitan fonts by Andrew West
Jurchen font by Jason Glavy
All other content copyright © 2002-2019 Amritavision