This brief article was authored for my International E-Discovery session at the Georgetown Advanced E-Discovery Institute in November 2010.
Many e-discovery vendors promote Unicode as the solution to being able to process, analyse and review non-English text. However, as at January 2010, Google estimated that Unicode-encoded text accounted for fewer than 50 per cent of all web-based text that it had indexed.
Human communication relies on verbal and written language. As a modern writing system, the English alphabet consisting of 26 characters can trace its roots from Latin and Greek to the ancient Phoenician alphabet. The Cyrillic alphabet also has Latin and Greek ancestry. The Arabic alphabet descended from the Aramaic alphabet that also can be traced to the ancient Phoenician alphabet. Today, it is estimated that 328 million people natively speak English. In contrast, over 1.4 billion people natively speak Chinese, Japanese or Korean (CJK).
"Han" characters are the foundation of the CJK writing systems and date back to 1200 B.C. Unlike the English alphabet, Han characters were not designed for the purpose of phonetics, but are commonly considered ideographic in nature. That is, each character serves as a visual representation of an idea or meaning. Unlike the English alphabet, the evolution of Han characters in the People's Republic of China (PRC), Japan and Korea means that there is no agreed "set" of CJK characters.
In Japan, for example, Han characters are called "Kanji" and are one of four writing systems used in Japan. The other writing systems are "Hiragana", "Katakana" and the Latin alphabet, "Romaji". Both Hiragana and Katakana have phonetic features. In the PRC, under Mao's governance in the 1950's, significant efforts were made to simplify written Chinese. This led to the introduction of the "Pinyin", or Simplified Chinese, writing system, which introduced new versions of several thousand Han characters, also known as Hanzi in the PRC, and the introduction of the Latin alphabet. In Korea, Han characters, known as Hanja, are less used, and have been replaced by a more phonetic alphabet named Hangul. A feature of CJK writing systems, particularly Japanese, is that multiple writing systems can be used in one sentence, and even one word.
As computer technology has evolved, writing systems have been divided into characters and individually encoded to a unique number or "code point", to create "character sets". Encoded character sets are also commonly referred to as "code pages". Encoded character sets allow characters to be entered, stored and transferred, as data, between computers.
The American Standard Code for Information Interchange (ASCII) was a 7-bit encoded character set, first released in 1963 and is still widely used to encode the alphabet of American English text. ASCII was widely adopted, and later extended as an 8-bit (or 1 byte) encoded character set to include accented English characters to cater for European languages, including German and French. From the mid-1960s to late-1990s, many proprietary encoded character systems were introduced, including IBM Corporation's Extended Binary Coded Decimal Interchange Code (EDCDIC), the Extended Unix Code (EUC), Apple's Mac OS Roman and multiple Microsoft Windows code pages. Many countries also introduced or sponsored standardised encoded character systems for their writing systems, including GB18030 in the PRC (Simplified Chinese), Shift-JIS in Japan (Japanese), KS X 1001 in South Korea (Korean), HKSCS in Hong Kong and Big5 in Taiwan (Traditional Chinese).
Unicode is promoted as the industry standard encoded character set for all of the world's writing systems. The Unicode Consortium released the first version of Unicode in October 1991. The most recent version, Unicode version 6.0, was released in October 2010. Unicode can presently reference over 109,000 characters across 93 writing systems. Unicode is also synchronised as an international standard under ISO 10646. The 8-bit Unicode Transformation Format (UTF-8) is presently the most common Unicode character encoding system. UTF-8 can represent every character in the Unicode character set as a variable byte (between one and four bytes per character), and is also backwards compatible with ASCII.
Significant effort has been made to integrate the CJK writing systems into Unicode with over 70,000 CJK characters presently supported. In addition, Unicode has been involved in a culturally controversial "Han unification" process designed to harmonise similar Han characters across CJK writing systems. Pre-existing encoded character systems, such as Shift-JIS and GB18030 are still popular in their native countries as technology implements Unicode support and Han unification issues are resolved.
Even with the technology advances and increasing adoption of Unicode, the CJK writing systems have the potential to cause headaches to e-discovery lawyers and vendors alike. Four common challenges include:
Text indexing and search software creates an index of each word in a data set using a "token" concept. For example, the sentence, "the cat sat on the mat" - subject to "stop words" used by the indexing software - may be indexed as up to six tokens. Tokenisation is relatively straightforward for English text. However, CJK writing systems do not readily use spaces between each word and your indexing software tool may be unable to define what constitutes a token, and ultimately, a word for the purposes of indexing and for subsequent, keyword searching. Individuals fluent in the specific CJK writing system(s) should be consulted to assist with testing the accuracy of keyword searching using a specific indexing software tool.
Some e-discovery software tools support Unicode, and specifically UTF-8, but cannot process data encoded in Shift-JIS or similar CJK encoded character sets. Failure to correctly process an encoded character set results in "mojibake" - that is, a computer displaying unreadable characters because it cannot render text correctly. To proactively identify such issues, language identification software such as Basis Technology's Rosette Language Identifier can assist with identifying the origin of a CJK document.
Character Layout and Spacing
Processing data in "vertical text layout" can also cause issues, so can use of "half-width" and "full-width" CJK characters. Such issues can generally only be identified as part of a quality assurance exercise after processing a data set.
You may need to consider the use of translation. Even widely available machine translation tools, such as Google Translate, are fast and increasingly intelligent. That said, for contracts and similar documents, it may be essential to seek certified human translators to translate CJK to English.
In summary, e-discovery software used for the purpose of dealing with multilingual data sets must be carefully tested to ensure that it can accurately process and analyse not only Unicode, but also additional encoded character sets, of the documents, spreadsheets, emails, web pages and other electronic files that may be subject to an order for discovery.