Character Sets

The character set you choose determines which languages you can store in the database.

The following CHAR and VARCHAR characters are representable in all Oracle database character sets and are transportable to any platform.

Upper and lower case English characters A-Z and a-z
Arabic digits 0-9
The following punctuation marks: (%, `,' , (, ), *, +, -, ,, ., /, \, :, ;, <, >, =, !, _, &, ~, {, }, |, ^, ?, $, #, @, ", [, ])
The following control characters: '', '', '', '
'.

Single-Byte Encoding Schemes
Single byte encoding schemes are the most efficient as they take up the least amount of space to represent characters and are easy to process and program with because one character can be represented in one byte.
7-bit Encoding Schemes 8-bit Encoding Schemes
Single-byte 8-bit encoding schemes can define up to 256 characters, and often support a group of related languages. One example being ISO 8859-1, which supports many Western European languages.

Multibyte Encoding Schemes
Multibyte encoding schemes are needed to support ideographic scripts used in Asian languages like Chinese or Japanese since these languages use thousands of characters. These schemes use either a fixed number of bytes to represent a character or a variable number of bytes per character.
Fixed-width Encoding Schemes
In a fixed-width multibyte encoding scheme, each character is represented by a fixed number of n bytes, where n is greater than or equal to two.
Variable-width Encoding Schemes
A variable-width encoding scheme uses one or more bytes to represent a single character. Some multibyte encoding schemes use certain bits to indicate the number of bytes that will represent a character. For example, if two bytes is the maximum number of bytes used to represent a character, the most significant bit can be toggled to indicate whether that byte is part of a single-byte character or the first byte of a double-byte character. In other schemes, control codes differentiate single-byte from double-byte characters. Another possibility is that a shift-out code will be used to indicate that the subsequent bytes are double-byte characters until a shift-in code is encountered.

Oracle's naming convention for character set names

<#_of_bits_representing_a_char>[S] [C] [FIXED]

Note that UTF8 and UTFE are exceptions to this naming convention.

Examples:

US7ASCII is the U.S. 7-bit ASCII character set
WE8ISO8859P1 is the Western European 8-bit ISO 8859 Part 1 character set
JA16SJIS is the Japanese 16-bit Shifted Japanese Industrial Standard character set

The optional "S" or "C" at the end of the character set name is sometimes used to help differentiate character sets that can only be used on the server (S) or client (C).

Oracle uses the database character set for:
- data stored in CHAR, VARCHAR2, CLOB, and LONG columns
- identifiers such as table names, column names, and PL/SQL variables
- entering and storing SQL and PL/SQL program source

Four considerations you should make when choosing an Oracle character set for the database are:

- What languages does the database need to support?
- Interoperability with system resources and applications
- Performance implications
- Restrictions

The character datatypes CHAR and VARCHAR2 are specified in bytes, not characters. Hence, the specification CHAR(20) in a table definition allows 20 bytes for storing character data.
This works out well if the database character set uses a single-byte character encoding scheme because the number of characters will be the same as the number of bytes. If the database character set uses a multibyte character encoding scheme, there is no such correspondence. That is, the number of bytes no longer equals the number of characters since a character can consist of one or more bytes. Thus, column widths must be chosen with care to allow for the maximum possible number of bytes for a given number of characters.

Alternative Character sets
In some cases, you may wish to have the ability to choose an alternate character set for the database because the properties of a different character encoding scheme may be more desirable for extensive character processing operations, or to facilitate ease-of-programming. In particular, the following data types can be used with an alternate character set:
NCHAR, NVARCHAR2 , NCLOB
Specifying an NCHAR character set allows you to specify an alternate character set from the database character set for use in NCHAR, NVARCHAR2, and NCLOB columns. This can be particularly useful for customers using a variable-width multibyte database character set because NCHAR has the capability to support fixed-width multibyte encoding schemes, whereas the database character set cannot. The benefits in using a fixed-width multibyte encoding over a variable-width one are:
optimized string processing performance on NCHAR, NVARCHAR2, and NCLOB columns
ease-of-programming with a fixed-width multibyte character set as opposed to a variable-width multibyte character set
When choosing an NCHAR character set, you must ensure that the NCHAR character repertoire is equivalent to or a subset of the database character set repertoire.

Note: all SQL commands will use the database character set, not the NCHAR character set. Therefore, literals can only be specified in the database character set.
Considerations for Different Encoding Schemes
Be Careful when Mixing Fixed-Width and Varying-Width Character Sets
Because fixed-width multi-byte character sets are measured in characters, and varying-width character sets are measured in bytes, be careful if you use a fixed-width multi-byte character set as your national character set on one platform and a varying-width character set on another platform.
As an example, if you use %TYPE or a named type to declare an item on one platform using the declaration information of an item from the other platform, you might receive a constraint limit too small to support the data. So, for example, "NCHAR (10)" on the platform using the fixed-width multi-byte set allocates enough space for 10 characters, but if %TYPE or the use of a named type creates a correspondingly typed item on the other platform, it allocates only 10 bytes. Usually, this is not enough for 10 characters. To be safe:
- Do not mix fixed-width multi-byte and varying-width character sets as the national character set on different platforms.
- If you do mix fixed-width multi-byte and varying-width character sets as the national character set on different platforms, use varying-length type declarations with relatively large constraint values.

Storing Data in Multi-Byte Character Sets
Width specifications of the character datatypes CHAR and VARCHAR2 refer to bytes, not characters. Hence, the specification CHAR(20) in a table definition allows 20 bytes for storing character data.
If the database character set is single byte, and that character set includes only composite characters, the number of characters and the number of bytes are the same. If the database character set is multi-byte, in general, there is no such correspondence. A character can consist of one or more bytes, depending on the specific multi-byte encoding scheme and whether shift-in/shift-out control codes are present. Hence, column widths must be chosen with care to allow for the maximum possible number of bytes for a given number of characters.
A typical situation is when character elements are combined to form a single character. For example, o and an umlaut can be combined to form ö. In the Thai language, up to three separate character elements can be combined to form one character, and one Thai character would require up to 3 bytes when TH8TISASCII or another single-byte Thai character set is used. One Thai character would require up to 9 bytes when the UTF8 character set is used.
One Thai character consists of up to three separate character elements as shown in Figure 3-2, where two of the characters are comprised of three character elements.

Unrestricted Multilingual Support - UNICODE
Often, unrestricted multilingual support is needed, and a universal character set such as Unicode is necessary as the server database character set. Unicode has two major encoding schemes: UCS2 and UTF8. UCS2 is a two-byte fixed-width format; UTF8 is a multi-byte format with a variable width. Oracle8i provides support for the UTF8 format. This enhancement is transparent to clients who already provide support for multi-byte character sets.
Character set conversion between a UTF8 database and any single-byte character set introduces very little overhead. Conversion between UTF8 and any multi-byte character set has some overhead but there is no conversion loss problem except that some multi-byte character sets do not support user-defined characters during character set conversion to and from UTF8. See Appendix A, "Locale Data", for further information