As you may know, Unicode is a character set that will cover every single script on the planet (and beyond). Characters on the main plane of Unicode (U+0000 to U+FFFF), which almost certainly include everything you will ever need, can be accessed in HTML with the escape sequence ode;. There are several distinct and unique advantages to this approach:
Using Unicode characters in node titles is also bit of an iffy business, since they're usually pretty tough to enter and also because EDB doesn't realize that &#xhex;, �hex;, &#dec; and �dec; are all the same character. Then again, for "non-transscriptable" languages like Hebrew and Arabic entering the words in Unicode is pretty much the only way to get a unique and identifiable name. But until the search code gets tweaked for better support for non-Latin1 characters, I would have to recommend keeping Unicode out of titles.
As these computations are left to the user's display engine, it is possible that the browser does not know the proper rendering method and that there are bugs in the rendering code -- for example, Mozilla (at time of writing) still has some difficulties with bidirectional scripts. There is nothing you can do about this, but again, browsers that dig Unicode will usually get these right and the issue is irrelevant for systems that don't support Unicode at all.
This method is, however, intensely painful for anything more complex than a single name. Also, while OK for alphabetic or syllabic scripts, converting Japanese kanji or Chinese hanzi (漢字) by browsing through 5000 characters is not fun.
A better option is Java, which includes a remarkable set of tools that can convert almost any encoding into Unicode and back. Once the text is Unicode, it's a simple matter to extract the hex code and pad it, and that's what my little utility J2U does. You'll need a working Java environment to run J2U, writing an applet interface to the tool is on my TODO list.
For Japanese, you can cut and paste strings in any encoding into XJDIC or WWWJDIC (at http://www.csse.monash.edu.au/~jwb/wwwjdic.html), after which performing an "Examine Kanji" on the word gives the Unicode as Uxxxx. unicode.org's Unihan database search provides similar facilities for all languages that use 漢字.
A few more tools and tips sent in by kind noders:
Cheers to Gorgonzola, lj, Oolong, tres equis and WWWWolf for corrections and additions.
javascript: p=(document.all)? document.selection.createRange().text: ((window.getSelection)? window:document).getSelection().toString(); if(!p) void(p=prompt('Text...','')); while(p){ q=''; for(i=0; i<p.length; i++) { j=p.charCodeAt(i); q+=(j==38)?'&':(j<128)?p.charAt(i):''+j+';'; } void(p=prompt(p,q)); }
For Western languages, see HTML symbol reference. They have HTML entity codes beginning with ampersand and ending with semicolon, around a name, for example é . Most of these should also be creatable on your keyboard using a combination with Alt, Ctrl, or Option keys: see Special Alt key characters & accents. The Western European character set covers English, French, Spanish, Italian, Portuguese, German, Danish, Swedish, Norwegian, Finnish, and in theory Icelandic though in practice the letters thorn and edh often come out wrong. Blame your browser. Greek letters can also be represented by HTML entities such as α .
For brevity I am not repeating those letters that are found in the Western set, with acute, grave, circumflex, umlaut, and so on. See Accent marks used with the Latin alphabet for a list of Western and Eastern accented letters arranged by accent.
In general, do not use accented letters in node titles or in hard links. Even if you think they're better that way. They're not. What's better is if other noders can find them. The E2 Search facility is limited in what it can find: it cannot find ü if you search for u, nor vice versa. Acutes and graves are okay, but umlauts won't work. It is better to leave other accents off. E2 is written in English, not Hungarian, and in English we usually leave all accents off. Please do not put in title edit requests asking for them to be added. If you want the accents to appear in your text, pipelink them, e.g. [Lowenbrau|Löwenbräu]. See E2 FAQ: Using Special HTML Characters for more detail on this.
Never use HTML entities or Unicode in names in node titles. Don't be pedantic about names. Pedantry is bad. Usefulness is good.
In the following tables capital letters come before lowercase. If you can't see them properly, this won't be of use to you. That's a limitation of your browser. A lot of browsers won't be able to show them, and they'll just appear as rectangles or question marks. And I use proper human numbers, not hexadecimal, which means there's no "x" in the code, just nnn;.
Large scripts like Chinese and Devanagari are beyond the scope of this write-up, as are extras like the vowel pointing of Hebrew and Arabic. Go to www.unicode.org/charts for all the rest, like Mongolian, Tamil, Ogham, -- the lot.
ا ا alif ب ب ba ة ة ta marbuta ت ت ta ث ث tha ج ج jim ح ح ha emphatic خ خ kha د د dal ذ ذ dhal ر ر ra ز ز za س س sin ش ش shin ص ص sad ض ض dad ط ط ta emphatic ظ ظ za emphatic ع ع ain غ غ ghain a gap in numbers ف ف fa ق ق qaf ك ك kaf ل ل lam م م mim ن ن nun ه ه ha و و waw ى ى ya undotted ي ي ya dotted Letters with hamza: ء ء no bearer أ أ alif hamza above ؤ ؤ waw hamza إ إ alif hamza below ئ ئ ya hamza Other diacritics: آ آ alif maddah ً ً fathah with nunation ٌ ٌ dammah with nunation ٍ ٍ kasrah with nunation َ َ fathah ُ ُ dammah ِ ِ kasrah ّ ّ shaddah ْ ْ sukun Numerals: ٠ ٠ 0 ١ ١ 1 ٢ ٢ 2 ٣ ٣ 3 ٤ ٤ 4 ٥ ٥ 5 ٦ ٦ 6 ٧ ٧ 7 ٨ ٨ 8 ٩ ٩ 9
Ā Ā ā ā A-macron Ḍ Ḍ ḍ ḍ D-dot-below Ḥ Ḥ ḥ ḥ H-dot-below Ī Ī ī ī I-macron Ṣ Ṣ ṣ ṣ S-dot-below Ṭ Ṭ ṭ ṭ T-dot-below Ū Ū ū ū U-macron
Ə Ə ə ə schwa Ğ Ğ ğ ğ G-breve (yumuşak-G) İ İ I dotted capital ı ı I undotted lowercase Ş Ş ş ş S-cedilla
Ґ Ґ ґ ґ G-hook І І і і I Ў Ў ў ў U-breve
Ŀ Ŀ ŀ ŀ L-mid-dot
Ć Ć ć ć C-acute Č Č č č C-hacek Đ Đ đ đ D-bar Š Š š š S-hacek Ž Ž ž ž Z-hacek
Č Č č č C-hacek Ď Ď ď ď D-hook Ě Ě ě ě E-hacek Ň Ň ň ň N-hacek Ř Ř ř ř R-hacek Š Š š š S-hacek Ť Ť ť ť T-hook Ů Ů ů ů U-circle Ž Ž ž ž Z-hacek
Ĉ Ĉ ĉ ĉ C-circumflex Ĝ Ĝ ĝ ĝ G-circumflex Ĥ Ĥ ĥ ĥ H-circumflex Ĵ Ĵ ĵ ĵ J-circumflex Ŝ Ŝ ŝ ŝ S-circumflex Ŭ Ŭ ŭ ŭ U-breve
ʻ ʻ 'okina Ā Ā ā ā A-macron Ē Ē ē ē E-macron Ī Ī ī ī I-macron Ō Ō ō ō O-macron Ū Ū ū ū U-macron
א א aleph ב ב beth ג ג gimel ד ד daleth ה ה he ו ו waw ז ז zayin ח ח heth ט ט teth י י yod ך ך kaph final כ כ kaph ל ל lamedh ם ם mem final מ מ mem ן ן nun final נ נ nun ס ס samekh ע ע ayin ף ף pe final פ פ pe ץ ץ sadhe final צ צ sadhe ק ק qoph ר ר resh ש ש shin/sin ת ת taw
Ő Ő ő ő O-double-acute Ű Ű ű ű U-double-acute
Ā Ā ā ā A-macron Ē Ē ē ē E-macron Ō Ō ō ō O-macron Ū Ū ū ū U-macron
Ŏ Ŏ ŏ ŏ O-breve Ŭ Ŭ ŭ ŭ U-breve
Ā Ā ā ā A-macron Ă Ă ă ă A-breve Ē Ē ē ē E-macron Ĕ Ĕ ĕ ĕ E-breve Ī Ī ī ī I-macron Ĭ Ĭ ĭ ĭ I-breve Ō Ō ō ō O-macron Ŏ Ŏ ŏ ŏ O-breve Ū Ū ū ū U-macron Ŭ Ŭ ŭ ŭ U-breve
Ā Ā ā ā A-macron Č Č č č C-hacek Ē Ē ē ē E-macron Ģ Ģ ģ ģ G-cedilla Ī Ī ī ī I-macron Ķ Ķ ķ ķ K-cedilla Ļ Ļ ļ ļ L-cedilla Ņ Ņ ņ ņ N-cedilla Ō Ō ō ō O-macron Ŗ Ŗ ŗ ŗ R-cedilla Š Š š š S-hacek Ū Ū ū ū U-macron Ž Ž ž ž Z-hacek
Ą Ą ą ą A-ogonek Č Č č č C-hacek Ę Ę ę ę E-ogonek Ė Ė ė ė E-dot-above Į Į į į I-ogonek Š Š š š S-hacek Ū Ū ū ū U-macron Ų Ų ų ų U-ogonek Ž Ž ž ž Z-hacek
Ѓ Ѓ ѓ ѓ GJ (G-acute) Ѕ Ѕ ѕ ѕ DZ Ј Ј ј ј J Љ Љ љ љ LJ Њ Њ њ њ NJ Ќ Ќ ќ ќ KJ (K-acute) Џ Џ џ џ DZ-hacek
Ċ Ċ ċ ċ C-dot-above Ġ Ġ ġ ġ G-dot-above Ħ Ħ ħ ħ H-bar Ż Ż ż ż Z-dot-above
Ā Ā ā ā A-macron Ē Ē ē ē E-macron Ī Ī ī ī I-macron Ō Ō ō ō O-macron Ū Ū ū ū U-macron
پ پ p چ چ ch ژ ژ zh گ گ g
Ą Ą ą ą A-ogonek Ć Ć ć ć C-acute Ę Ę ę ę E-ogonek Ł Ł ł ł L-slash Ń Ń ń ń N-acute Ś Ś ś ś S-acute Ź Ź ź ź Z-acute Ż Ż ż ż Z-dot-above
Ă Ă ă ă A-breve Ş Ş ş ş S-cedilla Ţ Ţ ţ ţ T-cedilla
Ș Ș ș ș S-comma Ț Ț ț ț T-comma
А А а а a Б Б б б b В В в в v Г Г г г g Д Д д д d Е Е е е ye Ё Ё ё ё yo (N.B. out of order!) Ж Ж ж ж zh З З з з z И И и и i Й Й й й y К К к к k Л Л л л l М М м м m Н Н н н n О О о о o П П п п p Р Р р р r С С с с s Т Т т т t У У у у u Ф Ф ф ф f Х Х х х kh Ц Ц ц ц ts Ч Ч ч ч ch Ш Ш ш ш sh Щ Щ щ щ shch Ъ Ъ ъ ъ hard sign Ы Ы ы ы y Ь Ь ь ь soft sign Э Э э э e Ю Ю ю ю yu Я Я я я ya
Ā Ā ā ā A-macron Ḍ Ḍ ḍ ḍ D-dot-below Ḥ Ḥ ḥ ḥ H-dot-below Ī Ī ī ī I-macron Ḷ Ḷ ḷ ḷ L-dot-below Ṃ Ṃ ṃ ṃ M-dot-below Ṅ Ṅ ṅ ṅ N-dot-above Ṇ Ṇ ṇ ṇ N-dot-below Ṛ Ṛ ṛ ṛ R-dot-below Ṝ Ṝ ṝ ṝ R-dot-and-macron Ś Ś ś ś S-acute Ṣ Ṣ ṣ ṣ S-dot-below Ṭ Ṭ ṭ ṭ T-dot-below Ū Ū ū ū U-macron
Ђ Ђ ђ ђ D-bar Ј Ј ј ј J Љ Љ љ љ LJ Њ Њ њ њ NJ Ћ Ћ ћ ћ C-acute Џ Џ џ џ DZ-hacek
Č Č č č C-hacek Ď Ď ď ď D-hook Ĺ Ĺ ĺ ĺ L-acute Ľ Ľ ľ ľ L-apostrophe Ň Ň ň ň N-hacek Ŕ Ŕ ŕ ŕ R-acute Š Š š š S-hacek Ť Ť ť ť T-hook Ž Ž ž ž Z-hacek
Ğ Ğ ğ ğ G-breve (yumuşak-G) İ İ I dotted capital ı ı I undotted lowercase Ş Ş ş ş S-cedilla
Ň Ň ň ň N-hacek Ş Ş ş ş S-cedilla Ž Ž ž ž Z-hacek
Є Є є є curved-E І І і і I Ї Ї ї ї I-umlaut Ґ Ґ ґ ґ G-hook
Ă Ă ă ă A-breve Đ Đ đ đ D-bar Ơ Ơ ơ ơ O-hook Ư Ư ư ư U-hook
Ŵ Ŵ ŵ ŵ W-circumflex Ŷ Ŷ ŷ ŷ Y-circumflex
Ẹ Ẹ ẹ ẹ E-dot-below Ọ Ọ ọ ọ O-dot-below Ṣ Ṣ ṣ ṣ S-dot-below
Chess symbols in Unicode:
♔ - ♔ ♕ - ♕ ♖ - ♖ ♗ - ♗ ♘ - ♘ ♙ - ♙ ♚ - ♚ ♛ - ♛ ♜ - ♜ ♝ - ♝ ♞ - ♞ ♟ - ♟
By now, you will have noticed, O astute reader, that these all come on simple, white backgrounds. Apparently, the good people defining the Unicode standard don't play chess enough to understand how clueless this is.
Bottom line: if you want to node a chessboard with black squares, you're in trouble. Still, I'm sure you're creative enough to find a workaround.
If your browser doesn't display the characters, you'll need to get yourself a font that contains the upper range of Unicode characters - say, Arial Unicode MS, or some such.
I love gn0sis's idea above for adding the native name, in Unicode, to your writeups about foreign terms, concepts, people, etc.
Here are many important terms I have gathered in various languages, all of them already noded or certainly nodeable. They are listed first in alphabetical order of their native language, thence in English alphabetical order of the English word, English spelling, or English transliteration. There are no doubt errors here since I don't speak any of these languages, so please /msg me if you find an error, have an addition, or if you use one of these in a writeup of yours. To use these, you can try just cutting and pasting into your writeup - this works for some browsers depending on the configuration. Otherwise, use your browser's "View Source" menu and cut and paste the HTML entities.
Amharic (አማርኛ):
Arabic (العربية):
Armenian (Հայ, Հայերեն):
Assamese (অসমিয়া):
Azerbaijani, Azeri (Азәрбайжан):
Belarusian (Беларуская):
Bengali (বাঙালী):
Bulgarian (Български):
Cherokee (ᏣᎳᎩ):
Chinese (汉语, 中文):
Czech (Česky):
Dhivehi (ދިވެހިބަސް):
Dzongkha (༄༅ཇོ༹ང་ཁ):
Farsi, Persian (فارسی):
Georgian (ქართულად):
Greek (Ελληνικά):
Gujarati (ગુજરાતી):
Hebrew (עברית):
Hindi (हिन्दी):
Inuktitut (ᐃᓄᒃᑎᑐᑦ):
Japanese (日本語):
Kannada, Kanarese (ಕನ್ನಡ):
Kashmiri (कश्मीरी, كشميري):
Kazakh (Қазақ):
Khmer, Cambodian (ខ្មែរ):
Klingon, tlhIngan Hol ( ):
Konkani (कोंकणी):
Korean (한국어):
Kyrgyz (кыргыз, кыргызча):
Lao (ລາວ):
Latvian (latviešu):
Macedonian (Македонски):
Malayalam (മലയാളം):
Manipuri (?):
Maori (Māori):
Marathi (मराठी):
Mongolian (монгол хэл):
Myanmar, Burmese (မ္ရန္မာ):
Nepali (नेपाली):
Oriya (ଓଡ଼ିଆ):
Pashto (پښتو):
Pitjantjatjara:
Anangu (Aṉangu) Kata Tjuta, The Olgas (Kata Tjuṯa) Uluru, Ayers Rock (Uluṟu)
Polish (Polski):
Punjabi (ਪੰਜਾਬੀ, پنجابي, पंजाबी):
Romanian (Română):
Russian (Русский):
Sanskrit (संस्कृत):
Serbian (Српски, Srpski):
Sinhala, Sinhalese (සිංහල):
Slovenian (Slovenščina):
Tajik (тоҷикӣ):