These days I am working on many languages like Vietnamese, Dutch, Thai apart from Indian languages. Now the Chinese turn comes. I was always interested to know more about China for different reasons. Some of them are China is a neighboring country of mine, recent political disturbances between China and India, more news about China saving the world from recession, Super Emerging power blah blah. I always wonder why there is such a economic gap between India and China. Is it because of the political system adopted by India (democracy) and China (communism) or is it because of the nature of people or is it something else. Anyways we have many things to learn from China and there is a long way to go. I am not against India or China. I feel China and we are more related in terms of culture, relations, religious cultures etc but there is a huge gap in Technology. One good thing was that IT in India is on par (or above) with China. Rather than fighting, I think we should help each other more so that two huge global powers rise.

Ahh.. As far as I am concerned (as a NLP researcher), my goal is to learn a bit of Chinese language structure and collect the biggest Chinese language corpora ever made so that my name is scribed on golden walls of Chinese history ;)

Let me highlight that An educated Chinese uses about three to four thousand characters ( I was amazed to know this. OMG!!) Ok, let me tell you more about Chinese (in formal words).

Brief Introduction to Chinese

About one-fifth of the world’s population, or over one billion people, speak some form of Chinese as their native language. There are between seven and thirteen main regional groups of Chinese (depending on classification scheme), of which the most spoken, by far, are Mandarin (about 850 million), followed by Wu (90 million), Cantonese (Yue) (70 million) and Min (70 million). Some people call Chinese a language and its subdivisions as dialects, while others call Chinese a language family and its subdivisions as languages. The identification of the varieties of Chinese as "dialects" instead of "languages" is considered inappropriate by some linguists and Sinologists. This is mainly because most of these groups are mutually unintelligible (not understood without learning or much effort), although some, like Xiang and the Southwest Mandarin dialects, may share common terms and some degree of intelligibility.

Chinese Writing System

All the above varieties of Chinese have common writing systems. Traditional Chinese and Simplified Chinese characters are the two standard sets of printed Chinese characters (also called hanzi). Simplified Chinese character forms were created by decreasing the number of strokes and simplifying the forms of a sizable proportion of traditional Chinese characters. Simplified Chinese is mostly used in China (aka People Republic of China), Singapore and the United Nations whereas Traditional Chinese is used in Taiwan (aka Republic of China), Hong Kong and Macau. Debate on Simplified Chinese Vs Traditional Chinese can be found here.

Web and electronic presence of Simplified Chinese is estimated to be more than Traditional Chinese.

Pinyin is currently the most commonly used romanization system for Chinese. Even this is used on web extensively (Mostly by foreign language learners and recent generations). Currently, standard Mandarin is taught in schools with Pinyin writing system.

Chinese Morphology

The number of Chinese characters contained in the Kangxi dictionary is approximately 47,035, although a large number of these are rarely used variants accumulated throughout history. An educated Chinese uses about three to four thousand characters ( I was amazed to know this. OMG!!)

Chinese characters are morphosyllabic, each usually corresponding to a spoken syllable with a basic meaning. Chinese words are monosyllabic. However, a majority of words in Mandarin Chinese require two or more characters to write (thus are poly-syllabic) but have meaning that is distinct from the characters they are made from. So if we collect corpora of size two billion characters, it might be equal to one billion word corpora approximately.

Chinese has few grammatical inflections -- it possesses no tenses, no voices, no numbers (singular, plural; though there are plural markers, for example for personal pronouns), and only a few articles (i.e., equivalents to "the, a, an" in English). There is, however, a gender difference in the written language.

Chinese features Subject Verb Object (SVO) word order. Words are not separated by spaces which thus makes tokenization an important problem.

Chinese Encodings

Chinese character encodings can be used to represent text written in the CJK languages — Chinese, Japanese, Korean. Most common encodings used are

Guobiao (prefixed by GB) mainly used by Simplified Chinese (PRC's official encoding)
Big5 mainly used by Traditional Chinese (RC's official encoding)
Unicode for all. (Accepted by every community. Famous recently)
CP936 (GBK) and CP950 (Big5) [Microsoft's encodings]
Other encodings are also present but not that popular.

Problem is the large character set reservation for these languages. Around 40000 characters are needed to represent the complete language. I am not going into those details.

Conversion from Traditional Chinese to Simplified is easy (many to one) but the reverse is quite difficult (one to many) and depends on the context. Most of them have been resolved now. Unicode to other conversions is very easy.

I hope this is helpful to some of you. And lemme know your views.

