While looking in Endymion
History: A Manual (2000 edition - I'm three editions behind)
for an English equivalent of the Chinese (and Khitan) title 開府儀同三司 for my last entry, I stumbled upon the
Taiwanese word 甲 kah [kaʔ˧˨], a measurement of land, on p. 243.
I was surprised to learn that it was a borrowing of Dutch akker (cognate
to English acre - though a kah is actually about 2.1 acres).
I had always assumed that the Dutch had never left any linguistic traces in Taiwan. Wrong!
How many other Batavo-Taiwanese
words are there? The Wikipedia
entry on Taiwanese doesn't mention the existence of Dutch loans.
I just found that Wiktionary has a
long English entry for 開府儀同三司. Nice.
VEXED BY FIREFOX (PART 1)
I've almost always been able to use Firefox when Chrome failed me. Until now.
What would Firefox be called in Tangut? How about
4408 1870 1my1' 1jy2 'fire fox'?
The first half (4408) has bothered me for years for two reasons.
First, why does 4408 contain what looks like a <WOOD> radical
(𘡩) atop what have been thought to be two
<FIRE> radicals (𘠠 and 𘧦)? Compare 13+-stroke 4408 to the 4-stroke
simplicity of Chinese 火 'fire'. The Tangraphic Sea analysis is
improbable: would the graph for the basic word for 'fire' really be
derived from the graph for half of a n apparently nonbasic word for
4408 1my'1 'fire' =
top and left of 4413 2pu4 'to burn, ignite' (semantic) +
right of 5082 1vi1 (second syllable of 𘓼𘍽 4555 5082 1py1 1vi1 'fire', only attested in dictionaries; could the first syllable, attested as the name of the trigram for 'fire', be cognate to 4413?; semantic)
The derivation for 4413 is unknown.
The derivation for 5082 is circular:
5082 1vi1 (second syllable of 1py1 1vi1 'fire') =
left of 5286 (second sylllable of 𘄦𘍵 1772 5286 1ten4 1vi1 'intelligent'; phonetic) +
right of 4408 1my'1 'fire' (semantic)
Surely 4408 was devised before 5286.
Second, 𗜐 4408 1my'1 < *miX 'fire' has the mysterious phonetic characteristic that I call 'prime' and represent as an apostrophe which is easier to type than a true prime symbol. I represent its pre-Tangut source as *X (though I could just as easily carry over the prime notation, since I have no idea what *X was). A mi-word for 'fire' is widespread in Sino-Tibetan, but none of the cognates of pre-Tangut *miX contain any obvious segment or tone that plausibly correlates with *X. Suppose, for instance, that I proposed that pre-Tangut *X corresponds to Written Burmese -ḥ. That correspondence works for 'fire' and 'nine' but not for 'two' and 'five'. Written Burmese 'two' lacks -ḥ, and (pre-)Tangut 'five' lacks *-X/-'.
7.11.21:55: A table of the above words and more:
||Li Fanwen number
||nhac < *n̥ik||kni
||*pŋi < *pŋa?||*ʁuɑ L
||*ŋgiX < *ŋgiwX?||*χguə
Proto-Southern Qiang reconstructions are from Evans (2001). Key to his tone symbols:
parentheses: one counterexample exists in the data
dash: data are equivocal
Evans did not reconstruct a tone for 'nine'. Using his notation, I
would reconstruct *(H): Longxi and Mianchi have high tones, but Taoping
has a mid tone which normally points to *L.
My near-total ignorance of Pyu basic vocabulary (e.g., 'fire') does
raise the troubling possibility that Pyu is a non-Sino-Tibetan language
with loans from Sino-Tibetan. Tai has borrowed nearly all its lower
numerals (with the exception of 'one') from Chinese.
My reconstruction of pre-Tangut 'nine' implies a chain shift:
*-k >*-w > Ø
Pre-Tangut *-w was lost, and Tangut gained a new -w from the lenition of pre-Tangut *-k: e.g., in
𘈩 0100 *kʌtik > *lew 'one'
'Three', 'four', and 'five' all had the same tone (or, more likely, segmental source of a tone) in Proto-Lolo-Burmese, and I suspect that tone source spread from one numeral to the others. (Cf. how *-i spread from 'four' to 'five' in pre-Tangut. Or how ) If that tone corresponded to Pyu -h, then that tone source spread from 'three' to 'four' and 'five' in Proto-Lolo-Burmese (or some ancestor of PLB). But that scenario assumes Pyu is conservative, which I don't think it is.
A huge problem is that the final segments (or quasi-segments in the case of [pre-]Tangut *-X/-') line up poorly. Ideally I'd like to see a pattern like
Tangut tone 2 : Written Burmese -ḥ : Pyu -h : Written Tibetan -s : Old Chinese *-s
among the oldest languages (Proto-Southern Qiang tones are of recent origin), but there are no instances of that above. And the thought of languages adding a final *-s or *-h to some random numerals but not others bothers me.
Also disturbing is the possibility that (pre-)Tangut *-X/-' corresponds to nothing in any other language because it is a reflex of a Proto-Sino-Tibetan phonetic feature completely lost elsewhere. I'd like to think that maybe some Qiangic language (i.e., a relatively close living relative of Tangut) has something corresponding to (pre-)Tangut *-X/-'. Proto-Southern Qiang apparently isn't that language.
One more possibility is that *-X/-' is unique to Tangut because it reflects a substratum language which had it. But that hypothesis cannot be tested since we know nothing about such a substratum language (unless its traces are in the so-called 'ritual' language [see Andrew West's skeptical take], and -' does not seem to be any more prominent in that subset of the Tangut vocabulary - in fact, -' is even less frequent in the 'ritual' numerals than in the regular ones!). And if a substratum language had -', why would its speakers impose that feature onto a language that didn't have it? I don't know anything about the English of Hmong native speakers, but I imagine that English does not have any uvular phonemes (that is, a feature in Hmong absent in English).
However, I can imagine a situation in which a speaker of a continental Altaic-type language would introduce uvulars into English because uvulars and velars are in complementary distribution in their own language (i.e., nonphonemic): e.g.,
native, English /ki/ > [ki]
native, English /ka/ > [qɑ]
(For convenience I use the symbol /k/ to represent the Altaic back
consonant. One might argue that ideally I should use a symbol other
than /k/ or /q/ to avoid implying that one allophone is more like the
Platonic form of the phoneme than the other.)
But ... when Khitan and Manchu actually did encounter [ka]-type combinations violating their phonotactics in Chinese, they borrowed them as /ka/: e.g.,
Liao Chinese 開 *kʰaj 'to open' > Khitan small script <k.ai> (not <q.ai>; element in the title <k.ai fu ng.i t.ung s.a.am sï> < 開府儀同三司 *kʰaj fu ŋi tʰuŋ sam sz̩, lit. 'open government ceremony same three official' and not a monosyllabic verb 'to open')
Mandarin gang [kaŋ] 'steel' > Manchu g'an [kaɴ] (not [qaɴ])
As a result, the uvular-phonetic distinction became phonemic as well
as phonetic: e.g., these new imported /ka/ contrasted with native /qa/.
Then again, I am citing written Khitan and Manchu which may have reflected an elite, idealized pronunciation. Some Khitan and Manchu speakers learning Chinese might have pronounced uvulars before /a/. If they did, at least they had a phonotactic motivation for doing so. The phonotactic motivation, if any, for pronouncing whatever -' was in Tangut is unknown. Minimal pairs such as
3513 1my1 'sky' : 4408 1my'1 'fire'
seem to rule out a phonotactic motivation.
Could the fact that 'sky' and 'fire' had different vowels in
pre-Tangut be relevant? Could I abandon *X and instead propose
pre-Tangut *-u > -y (e.g., 'sky'; cf. Written
Burmese muiḥ < *məwh 'sky')
but pre-Tangut *-i > -y'?
No, because there are cases of -y' from pre-Tangut *-uX and -y from pre-Tangut *-i: e.g.,
𗡡0320 1vy'1 < *NApuX or *CANpuX 'soft, weak' (cf. Japhug mpɯ < *-u 'soft')
4880 2ryr1 < *riH 'copper' (cf. Written Tibetan gri 'knife'?)
(The -r of 2ryr1 is vowel retroflexion conditioned by *r-. As 1lyr'3 'four' above demonstrates, there is no phonotactic constraint against ' coexisting with retroflexion, so I cannot claim that 2ryr1 would have ended in -y' if not for retroflexion.)
There is even a doublet for 'worm'
1888 2by1 < *mbuH and 5270 1by'1 < *mbuX
which is cognate to Written Tibetan Hbu [mbu] 'id.' See Gong
Hwang-cherng's "A Hypothesis of Three Grades and Vowel Length
Distinction in Tangut" (1995) for more examples. (Gong's 'long vowels'
correspond to my V' 'vowel-prime' sequences. The correct
explanation for -' would have to account for such doublets.
220.127.116.11:59: TWENTY BLADES OF CHINESE GRASS
At the end of "An-derused",
I was surprised to see 漢 <CHINESE> 한 Han with 艹 instead of 廿 on the top right on the cover of 最新版常用學習三千漢字 Chhoeshinphan sangyong haksŭp samchŏn hancha (Three Thousand Hanja for Everyday Study: New Edition).
I was even more surprised to look inside and see the entry for 漢 <CHINESE> 한 Han on p. 47. Each of the three thousand hanja in the book has a large entry character atop a chart showing how to write it in seven steps and one or more example words containing it. The large entry character is 漢 with 艹 (resembling the character component <GRASS> though actually having nothing to do with grass) on the top right. However,
the hanja is listed as "水 radical [i.e., 氵] 11 strokes" (the 11-stroke figure only makes sense if the hanja has 4-stroke 廿 [resembling the character <TWENTY> though having nothing to do with twenty] rather than 3-stroke 艹)
the seven-step writing diagram has 艹 in step 2 and 廿 in step 3
the example words 怪漢 koehan 'suspiciously behaving man' and 漢方 Hanbang 'Chinese medicine' have 漢 with 廿 on the top right.
That must be confusing to someone who does not know how to write the
shows both ways to write <CHINESE>. If I were to write a book
on hanja, I'd bring up the variation of <CHINESE>.
I can describe that variation in terms of Unicode:
the 廿-version is U+FA47
the 艹-version is U+FA9A
So why don't I just type U+FA47 and U+FA9A instead of resorting to phrases like "漢 with 廿 on the top right"? Because I don't think most people have fonts that support the distinction between the two forms.The <CHINESE> hanja that you see here in fact has a third Unicode codepoint: U+6F22. Why are there three codepoints for two forms of <CHINESE>¹?
The 廿-version (U+FA47)
was added later in Unicode
for compatibility with the Japanese non-Unicode JIS standard which has
a separate codepoint for the 廿-version; U+6F22 corresponds to the
艹-version in Japanese fonts. (The Unihan database gives "J3" as a
source, but J3
is a code for JIS X 0213:2004 level-3 which I presume didn't exist
The 艹-version (U+FA9A)
was added even later in Unicode
for compatibility with the North Korean non-Unicode KPS 10721-2000
standard which has a separate codepoint for the 艹-version; U+6F22
presumably corresponds to the 廿-version in North Korean fonts. (I can't
find a copy of the North Korean standard online to confirm my guess.)
This table is my attempt to show the relationships between a few encodings and forms of <CHINESE>:
||North Korean equivalent
||South Korean equivalent in KS
The duplicate codepoints in Unicode are a byproduct of the different
versions of <CHINESE> corresponding to U+6F22 in Japanese and
North Korean encodings. In an 'ideal' Unicode without regard for
non-Unicode encodings, there would either be two codepoints for the two
versions (following a maximalist philosophy of one codepoint per form)
or just one (following a minimalist philosophy of one codepoint per platonic
character) but not three.
¹There are in fact at least 31 forms of <CHINESE>, but the 廿~艹 variants (and the simplified Chinese form 汉) are all that are needed for everyday purposes.
Of course right after I finished my previous post on 顏 U+984F~顔 U+9854 for Sino-Korean 안 an <FACE>, I realized I should have checked the very first Sino-Korean dictionary, 東國正韻 Tongguk chŏngun (1447), which is also one of the earliest hangul texts. Its entry for ᅌᅡᆫ ngan (the prescriptive 15th century reading of <FACE>) has the form 顏 U+984F. Needless to say, it is absurd to draw direct lines across three vastly different periods, but I'll do so anyway:
Tongguk (1447): 顏 — Gale (1897): 顏 — Sae chajŏn (1961): 顏 (all U+984F)
Those are the three earliest texts in my survey so far. I am certain of what I have seen in them. I am less certain about these search results from titles and authors in the National Library of Korea's database (via Cambridge's list of Korean studies resources), since it's possible someone typed one form instead of the other:
Sorting the results by date reveals some obvious typos: e.g., modern items like a book with 2018 in the title dated "201" instead of "201X". And the difference between "201" and "201X" is more obvious than that between 顏 U+984F and 顔 U+9854.
It is certainly not true that 顔 only appears in post-1961 books. The earliest result for 顔 is 史鉞 Sawŏl (The Axe of History, 1506). Although I don't have time to go through the online scan of the book (there is no search function), I can believe 顔 was in it, since the earliest attestation of that form that I can find is in the Chinese rhyme dictionary Guangyun (1008).
Conversely, it is also not true that 顏 U+984F is absent from recent publications, as ... oh no. The results include anything with an in the title or author's name regardless of whether it's spelled 顏, 顔, 晏 (the surname of the author of Sawŏl), in hangul as 안, etc. I suppose that makes sense in a time when few people may know what the proper hanja is. But why does searching for 顏 U+984F and 顔 U+9854 generate different results if all that matters is the presence of a syllable an regardless of written form? I don't know. I wonder if the site developer will ever address that question.
I'm going to look at the question of 顏 U+984F~顔 U+9854 from one last
angle. Here is a list of the frequency of the two forms in South Korean
national newspapers according
to Google. I have arranged the papers in order of circulation whenever
I could find figures. The figures were partly undated, so this table
cannot be interpreted as a true ranking. I just wanted a rough idea of
the popularity of the various papers.
||顏 U+984F||顔 U+9854||Notes
||顔 U+9854 figure includes instances in the paper's Japanese edition.|
||0 in spite of the fact that the paper does not
have a no-hanja policy like Hankyoreh (see below).
||顏 U+984F figure excludes instances in the
paper's Chinese edition.
顔 U+9854 figure includes instances in the paper's Japanese edition.
||The paper has a no-hanja policy in its Korean
edition, so the figures are for 顔 U+9854 in the paper's Japanese
The one instance of 顏 U+984F is in a comment in the Japanese edition and is probably a character selection error, as the rest of the comment is in postwar characters; the writer is not someone like me who insists on prewar orthography.
For comparison, in Asahi shinbun, 顏 U+984F appears 30 times and 顔 U+9854 appears 165,000 times. (Those figures include <FACE> for native kao as well as Sino-Japanese gan, whereas <FACE> in Korean only represents Sino-Korean an.) Kanji are alive and well in Japanese, whereas hanja are in decline in Korean. I took hanja seriously when I first started learning Korean in 1987. The newspapers were full of them then. But now Hankyoreh has zero in its Korean edition. An all-kana Japanese newspaper is unthinkable today, even though Japanese TV news reporters demonstrate it is possible to present the news orally without any kanji (not counting onscreen text).
What is missing from the figures above are a sense of proportion and the time dimension. What is the frequency of each form of <FACE> per million characters (counting hangul letter blocks as single characters) per year per publication? My guess is that the Japanese usage of both forms of <FACE> has remained constant after the postwar writing reform, whereas <FACE> in either form has become increasingly infrequent in Korean, though 顔 U+9854 has taken the lead due to
the inability to type 顏 U+984F in Windows' Korean IME
日常 ilsang 'everyday, common': 0
日曜日 iryoil 'Sunday': 0
日記 ilgi 'diary': 0
일상 ilsang 'everyday, common': 194
일요일 iryoil 'Sunday': 425
일기 ilgi 'diary': 593
中央日報 JoongAng Ilbo 'Central Daily'
創刊号 chhangganho 'first issue'
創刊辭 chhanggansa 'first issue editorial' (i.e., something
written to introduce the first issue)
1965年9月2日 chhŏn'gubaengnyukshibo-nyŏn kuwŏl iil 'September 2, 1965'
第1號 che ir ho 'issue number 1'
日刊 ilgan 'daily'
<NUMBER> 호 ho appears on the front page as 號 U+865F and as 号 U+53F7, the simplified form also used in postwar Japan. I get the impression that official standards aside - 号 U+53F7 isn't supported by KSC encoding or in the 1,800 hanja taught in secondary schools - Korean typographers and even reference book writers are not purists when it comes to hanja forms. I was surprised to see 漢 <CHINESE> 한 Han with 艹 instead of 廿 on the top right on the cover of 最新版常用學習三千漢字 Chhoeshinphan sangyong haksŭp samchŏn hancha (Three Thousand Hanja for Everyday Study: New Edition).
18.104.22.168:59: J-AN-US: THE TWO <FACE>S OF NAVER
Why do I care so much about minute variations like 顏~顔 for Sino-Korean 안 an <FACE>?
In TJK¹ studies, subtly different graphs are often regarded by modern scholars as separate entities. Whether such differences also reflect linguistic differences requires study.
No such study is needed to know that 顏 and 顔 are the 'same'
character in one sense. But in Unicode, they are not: 顏 is U+984F and 顔
is U+9854. Unicode is not consistent about assigning variants to
different codepoints. That is not necessarily a flaw. Should the
VS17 and VS18 forms of 喩 U+55A9 have different codepoints? I
couldn't tell them apart without laying VS18 over VS17. On the other
hand, why doesn't VS19
of 囀󠄀 U+56C0 have its own codepoint? Andrew
West has much more on this issue.
Back to Korean: my interest lies in determining what the de facto standard form of 안 an <FACE> is or was at different points of time.
Last night I forgot to check Gale (1897), one of the first Korean dictionaries I ever used. Page 943 has 顏 U+984F.
Today I use naver.com. Its hanja dictionary treats 顏 U+984F as the 本字 ponja 'original character' of 顔 U+9854, but its entry for 顔 U+9854 is lengthier, including lists of 18 words and 5 phrases containing 顔 U+9854 without equivalents in the entry for 顏 U+984F. Clearly the dictionary regards 顔 as the principal form. Yet if I run a search on those characters throughout the entire dictionary (i.e., if I have 전체 chŏnchhe 'entire body' selected), I get
One might conclude that the 31 words containing 顏 U+984F can never be written with 顔 U+9854, that there are no phrases that can be written with 顏 U+984F, etc. But that isn't true: anything that can be written with one can be written with the other. And yet there may not be any overlap between those lists: e.g.,
伯顏 paegan 'Jurchen word for a rich man' has no alternate spelling 伯顔 with U+9854listed
顔面 anmyŏn 'face' has no alternate spelling 顏面 with U+984F listed
有顏面 yuanmyŏn 'having a face'
隔歲顏面 kyŏkseanmyŏn 'a face one meets for the first time
in a year'
My impression is that in South Korea 顔 U+9854 has become dominant but has not yet fully eclipsed 顏 U+984F. Otherwise I would expect Naver to be like Japanese dictionaries which have a single main entry for 顔 U+9854 and list 顏 U+984F as a variant.
I predict that the domination of 顔 U+9854 will increase over time as
Koreans type fewer hanja and only use hanja that their IMEs provide for
them: e.g., 顔 U+9854 but not 顏 U+984F in the case of Windows.