5
Linguistic Sorting and String Searching

This chapter explains sorting and searching for strings in an Oracle environment. It contains the following topics:

Overview of Oracle's Sorting Capabilities

Different languages have different sort orders. In addition, different cultures or countries that use the same alphabets may sort words differently. For example, in Danish, Æ is after Z, while Y and Ü are considered to be variants of the same letter.

Sort order can be case-sensitive or case-insensitive. Case refers to the condition of being uppercase or lowercase. For example, in a Latin alphabet, A is the uppercase glyph for a, the lowercase glyph.

Sort order can ignore or consider diacritics. A diacritic is a mark near or through a character or combination of characters that indicates a different sound than the sound of the character without the diacritic. For example, the cedilla (,) in façade is a diacritic. It changes the sound of c.

Sort order can be phonetic or it can be based on the appearance of the character. For example, sort order can be based on the number of strokes in East Asian ideographs. Another common sorting issue is combining letters into a single character. For example, in traditional Spanish, ch is a distinct character that comes after c, which means that the correct order is: cerveza, colorado, cheremoya. This means that the letter c cannot be sorted until Oracle has checked whether the next letter is an h.

Oracle provides the following types of sorts:

Binary sort
Monolingual linguistic sort
Multilingual linguistic sort

These sorts achieve a linguistically correct order for a single language as well as a sort based on the multilingual ISO standard (ISO 14651), which is designed to handle many languages at the same time.

Using Binary Sorts

One way to sort character data is based on the numeric values of the characters defined by the character encoding scheme. This is called a binary sort. Binary sorts are the fastest type of sort. They produce reasonable results for the English alphabet because the ASCII and EBCDIC standards define the letters A to Z in ascending numeric value.

Note:

In the ASCII standard, all uppercase letters appear before any lowercase letters. In the EBCDIC standard, the opposite is true: all lowercase letters appear before any uppercase letters.

When characters used in other languages are present, a binary sort usually does not produce reasonable results. For example, an ascending ORDER BY query returns the character strings ABC, ABZ, BCD, ÄBC, when Ä has a higher numeric value than B in the character encoding scheme. A binary sort is not usually linguistically meaningful for Asian languages that use ideographic characters.

Using Linguistic Sorts

To produce a sort sequence that matches the alphabetic sequence of characters, another sort technique must be used that sorts characters independently of their numeric values in the character encoding scheme. This technique is called a linguistic sort. A linguistic sort operates by replacing characters with numeric values that reflect each character's proper linguistic order.

Oracle offers two kinds of linguistic sorts: monolingual and multilingual.

This section includes the following topics:

Monolingual Linguistic Sorts

Oracle compares character strings in two steps for monolingual sorts. The first step compares the major value of the entire string from a table of major values. Usually, letters with the same appearance have the same major value. The second step compares the minor value from a table of minor values. The major and minor values are defined by Oracle. Oracle defines letters with diacritic and case differences as having the same major value but different minor values.

Each major table entry contains the Unicode code point and major value for a character. The Unicode code point is a 16-bit binary value that represents a character.

Table 5-1 illustrates sample values for sorting a, A, ä, Ä, and b.

Table 5-1 Sample Glyphs and Their Major and Minor Sort Values

Glyph	Major Value	Minor Value
a	15	5
A	15	10
ä	15	15
Ä	15	20
b	20	5

See Also:

"Overview of Unicode"

Multilingual Linguistic Sorts

Oracle provides multilingual linguistic sorts so that you can sort data in more than one language in one sort. This is useful for regions or languages that have complex sorting rules and for multilingual databases. Oracle Database 10g supports all of the sort orders defined by previous releases.

For Asian language data or multilingual data, Oracle provides a sorting mechanism based on the ISO 14651 standard and the Unicode 3.2 standard. Chinese characters are ordered by the number of strokes, PinYin, or radicals.

In addition, multilingual sorts can handle canonical equivalence and supplementary characters. Canonical equivalence is a basic equivalence between characters or sequences of characters. For example, ç is equivalent to the combination of c and ,. Supplementary characters are user-defined characters or predefined characters in Unicode 3.2 that require two code points within a specific code range. You can define up to 1.1 million code points in one multilingual sort.

For example, Oracle supports a monolingual French sort (FRENCH), but you can specify a multilingual French sort (FRENCH_M). _M represents the ISO 14651 standard for multilingual sorting. The sorting order is based on the GENERIC_M sorting order and can sort diacritical marks from right to left. Oracle Corporation recommends using a multilingual linguistic sort if the tables contain multilingual data. If the tables contain only French, then a monolingual French sort may have better performance because it uses less memory. It uses less memory because fewer characters are defined in a monolingual French sort than in a multilingual French sort. There is a tradeoff between the scope and the performance of a sort.

See Also:

Multilingual Sorting Levels

Oracle evaluates multilingual sorts at three levels of precision:

Primary Level Sorts

A primary level sort distinguishes between base letters, such as the difference between characters a and b. It is up to individual locales to define whether a is before b, b is before a, or if they are equal. The binary representation of the characters is completely irrelevant. If a character is an ignorable character, then it is assigned a primary level order (or weight) of zero, which means it is ignored at the primary level. Characters that are ignorable on other levels are given an order of zero at those levels.

For example, at the primary level, all variations of bat come before all variations of bet. The variations of bat can appear in any order, and the variations of bet can appear in any order:

Bat
bat
BAT
BET
Bet
bet

See Also:

"Ignorable Characters"

Secondary Level Sorts

A secondary level sort distinguishes between base letters (the primary level sort) before distinguishing between diacritics on a given base letter. For example, the character Ä differs from the character A only because it has a diacritic. Thus, Ä and A are the same on the primary level because they have the same base letter (A) but differ on the secondary level.

The following list has been sorted on the primary level (resume comes before resumes) and on the secondary level (strings without diacritics come before strings with diacritics):

resume
résumé
Résumé
Resumes
resumes
résumés

Tertiary Level Sorts

A tertiary level sort distinguishes between base letters (primary level sort), diacritics (secondary level sort), and case (upper case and lower case). It can also include special characters such as +, -, and *.

The following are examples of tertiary level sorts:

Characters a and A are equal on the primary and secondary levels but different on the tertiary level because they have different cases.
Characters ä and A are equal on the primary level and different on the secondary and tertiary levels.
The primary and secondary level orders for the dash character - is 0. That is, it is ignored on the primary and secondary levels. If a dash is compared with another character whose primary level order is nonzero, for example, u, then no result for the primary level is available because u is not compared with anything. In this case, Oracle finds a difference between - and u only at the tertiary level.

The following list has been sorted on the primary level (resume comes before resumes) and on the secondary level (strings without diacritics come before strings with diacritics) and on the tertiary level (lower case comes before upper case):

resume
Resume
résumé
Résumé
resumes
Resumes
résumés
Résumés

Linguistic Sort Features

This section contains information about different features that a linguistic sort can have:

You can customize linguistic sorts to include the desired characteristics.

See Also:

Chapter 13, "Customizing Locale"

Base Letters

Base letters are defined in a base letter table, which maps each letter to its base letter. For example, a, A, ä, and Ä all map to a, which is the base letter. This concept is particularly relevant for working with Oracle Text.

See Also:

Oracle Text Reference

Ignorable Characters

Some characters can be ignored in a linguistic sort. These characters are called ignorable characters. There are two kinds of ignorable characters: diacritics and punctuation.

Examples of ignorable diacritics are:

^, so that rôle is treated the same as role
The umlaut, so that naïve is treated the same as naive

An example of an ignorable punctuation character is the dash character -. If it is ignored, then multi-lingual can be treated that same as multilingual and e-mail can be treated the same as email.

Contracting Characters

Sorting elements usually consist of a single character, but in some locales, two or more characters in a character string must be considered as a single sorting element during sorting. For example, in traditional Spanish, the string ch is composed of two characters. These characters are called contracting characters in multilingual linguistic sorting and special combination letters in monolingual linguistic sorting.

Do not confuse a composed character with a contracting character. A composed character like á can be decomposed into a and ', each with their own encoding. The difference between a composed character and a contracting character is that a composed character can be displayed as a single character on a terminal, while a contracting character is used only for sorting, and its component characters must be rendered separately.

Expanding Characters

In some locales, certain characters must be sorted as if they were character strings. An example is the German character ß (sharp s). It is sorted exactly the same as the string SS. Another example is that ö sorts as if it were oe, after od and before of. These characters are known as expanding characters in multilingual linguistic sorting and special letters in monolingual linguistic sorting. Just as with contracting characters, the replacement string for an expanding character is meaningful only for sorting.

Context-Sensitive Characters

In Japanese, a prolonged sound mark that resembles an em dash -- represents a length mark that lengthens the vowel of the preceding character. The sort order depends on the vowel that precedes the length mark. This is called context-sensitive sorting. For example, after the character ka, the -- length mark indicates a long a and is treated the same as a, while after the character ki, the -- length mark indicates a long i and is treated the same as i. Transliterating this to Latin characters, a sort might look like this:

kaa   
ka--   -- kaa and ka-- are the same
kai   -- kai follows ka- because i is after a
kia   -- kia follows kai because i is after a
kii   -- kii follows kia because i is after a
ki--   -- kii and ki-- are the same

Canonical Equivalence

Canonical equivalence is an attribute of a multilingual sort and describes how equivalent code point sequences are sorted. If canonical equivalence is applied in a particular linguistic sort, then canonically equivalent strings are treated as equal.

One Unicode code point can be equivalent to a sequence of base letter code points plus diacritic code points. This is called the Unicode canonical equivalence. For example, ä equals its base letter a and an umlaut. A linguistic flag, CANONICAL_EQUIVALENCE=TRUE, indicates that all canonical equivalence rules defined in Unicode 3.2 need to be applied in a specific linguistic sort. Oracle-defined linguistic sorts include the appropriate setting for the canonical equivalence flag.You can set the flag to FALSE to speed up the comparison and ordering functions if all the data is in its composed form.

For example, consider the following strings:

äa (a umlaut followed by a)
a¨b (a followed by umlaut followed by b)
äc (a umlaut followed by c)

If CANONICAL_EQUIVALENCE=FALSE, then the sort order of the strings is:

a¨b
äa
äc

This occurs because a comes before ä if canonical equivalence is not applied.

If CANONICAL_EQUIVALENCE=TRUE, then the sort order of the strings is:

äa
a¨b
äc

This occurs because ä and a¨ are treated as canonically equivalent.

You can use Oracle Locale Builder to view the setting of the canonical equivalence flag in existing multilingual sorts. When you create a customized multilingual sort with Oracle Locale Builder, you can set the canonical equivalence flag as desired.

See Also:

"Creating a New Linguistic Sort with the Oracle Locale Builder" for more information about setting the canonical equivalence flag

Reverse Secondary Sorting

In French, sorting strings of characters with diacritics first compares base letters from left to right, but compares characters with diacritics from right to left. For example, by default, a character with a diacritic is placed after its unmarked variant. Thus Èdit comes before Edít in a French sort. They are equal on the primary level, and the secondary order is determined by examining characters with diacritics from right to left. Individual locales can request that the characters with diacritics be sorted with the right-to-left rule. Set the REVERSE_SECONDARY linguistic flag to TRUE to enable reverse secondary sorting.

See Also:

"Creating a New Linguistic Sort with the Oracle Locale Builder" for more information about setting the reverse secondary flag

Character Rearrangement for Thai and Laotian Characters

In Thai and Lao, some characters must first change places with the following character before sorting. Normally, these types of character are symbols representing vowel sounds, and the next character is a consonant. Consonants and vowels must change places before sorting. Set the SWAP_WITH_NEXT linguistic flag for all characters that must change places before sorting.

See Also:

"Creating a New Linguistic Sort with the Oracle Locale Builder" for more information about setting the SWAP_WITH_NEXT flag

Special Letters

Special letters is a term used in monolingual sorts. They are called expanding characters in multilingual sorts.

See Also:

"Expanding Characters"

Special Combination Letters

Special combination letters is the term used in monolingual sorts. They are called contracting letters in multilingual sorts.

See Also:

"Contracting Characters"

Special Uppercase Letters

One lowercase letter may map to multiple uppercase letters. For example, in traditional German, the uppercase letters for ß are SS.

These case conversions are handled by the NLS_UPPER, NLS_LOWER, and NLS_INITCAP SQL functions, according to the conventions established by the linguistic sort sequence. The UPPER, LOWER, and INITCAP SQL functions cannot handle these special characters.

The NLS_UPPER SQL function returns all uppercase characters from the same character set as the lowercase string. The following example shows the result of the NLS_UPPER function when NLS_SORT is set to XGERMAN:

SELECT NLS_UPPER ('große') "Uppercase" FROM DUAL;

Upper
-----
GROSSE

See Also:

Oracle Database SQL Reference

Special Lowercase Letters

Oracle supports special lowercase letters. One uppercase letter may map to multiple lowercase letters. An example is the Turkish uppercase I becoming a small, dotless i: .

Case-Insensitive and Accent-Insensitive Linguistic Sorts

Operation inside an Oracle database is always sensitive to the case and the accents (diacritics) of the characters. Sometimes you may need to perform case-insensitive or accent-insensitive comparisons and sorts.

In previous versions of the database, case-insensitive queries could be achieved by using the NLS_UPPER and NLS_LOWER SQL functions. The functions change the case of strings based on a specific linguistic sort definition. This enables you to perform case-insensitive searches regardless of the language being used. For example, create a table called test4 as follows:

SQL> CREATE TABLE test4(word VARCHAR2(12));
SQL> INSERT INTO test4 VALUES('GROSSE');
SQL> INSERT INTO test4 VALUES('Große');
SQL> INSERT INTO test4 VALUES('große');
SQL> SELECT * FROM test4;

WORD
------------
GROSSE
Große
große

Perform a case-sensitive search for GROSSE as follows:

SQL> SELECT word FROM test4 WHERE word='GROSSE';

WORD
------------
GROSSE

Perform a case-insensitive search for GROSSE using the NLS_UPPER function:

SELECT word FROM test4
WHERE NLS_UPPER(word, 'NLS_SORT = XGERMAN') = 'GROSSE';

WORD
------------
GROSSE
Große
große

Using NLS_UPPER and NLS_LOWER functions can be cumbersome because they need to be hard-coded into the application logic. A partial solution was introduced in Oracle9i Release 2 (9.2). It uses the GENERIC_BASELETTER linguistic sort. The GENERIC_BASELETTER sort groups all characters together based on their base letter values. This is achieved by ignoring their case and diacritic differences.

The following example shows a GENERIC_BASELETTER query. First create a table called test5:

CREATE TABLE test5(product VARCHAR2(20));
INSERT INTO test5 VALUES('DATABASE');
INSERT INTO test5 VALUES('dätäbase');
INSERT INTO test5 VALUES('database');
INSERT INTO test5 VALUES('Database');
SELECT product FROM test5;

PRODUCT
--------------------
DATABASE
dätäbase
database
Database

Select database from test5, using the default binary sort:

SELECT product FROM test5 WHERE product='database';

PRODUCT
--------------------
database

Set NLS_COMP to ANSI to perform a linguistic sort based on the value of NLS_SORT:

ALTER SESSION SET NLS_COMP=ANSI;

Set NLS_SORT to GENERIC_BASELETTER:

ALTER SESSION SET NLS_SORT=GENERIC_BASELETTER;

Again select database from test5:

SELECT * FROM test5 WHERE product='database';

PRODUCT
--------------------
DATABASE
dätäbase
database
Database

Note that all of the rows of test5 are selected.

The GENERIC_BASELETTER sort defines the base letters of the underlying characters. Hence it simulates the behavior of a case-insensitive and accent-insensitive linguistic sort. However, the GENERIC_BASELETTER search is not a linguistically sensitive search because it is not based on any specific language.

In Oracle Database 10g, Oracle provides case-insensitive and accent-insensitive options for linguistic sorts.

Oracle provides the following types of monolingual and multilingual linguistic sorts:

Linguistic sorts that use information about base letters, diacritics, punctuation, and case. These are the standard monolingual and multilingual linguistic sorts that are described in "Using Linguistic Sorts".
Linguistic sorts that use information about base letters, diacritics, and punctuation. This type of sort is called case-insensitive.
Linguistic sorts that use information about base letters only. This type of sort is called accent-insensitive. (Accent is another word for diacritic.) An accent-insensitive sort is always case-insensitive as well.

The rest of this section contains the following topics:

Examples of Case-Insensitive and Accent-Insensitive Sorts

The following examples show:

A sort that uses information about base letters, diacritics, punctuation, and case
A case-insensitive sort
An accent-insensitive sort

Example 5-1 Linguistic Sort Using Base Letters, Diacritics, Punctuation, and Case Information

The following list has been sorted using information about base letters, diacritics, punctuation, and case:

blackbird
Blackbird
black bird
black-bird
Black-bird
bläckbird
blackbîrd

Example 5-2 Case-Insensitive Linguistic Sort

The following list has been sorted using information about base letters, diacritics, and punctuation, ignoring case:

Blackbird
blackbird
bläckbird
blackbîrd
black bird
black-bird
Black-bird

black-bird and Black-bird have the same value in the sort, because the only different between them is case. They could appear interchanged in the list. Blackbird and blackbird also have the same value in the sort and could appear interchanged in the list.

Example 5-3 Accent-Insensitive Linguistic Sort

The following list has been sorted using information about base letters only. No information about diacritics, punctuation, or case has been used.

blackbird
bläckbird
blackbîrd
Blackbird
BlackBird
Black-bird
Black bird

Specifying a Case-Insensitive or Accent-Insensitive Sort

Use the NLS_SORT session parameter to specify a case-insensitive or accent-insensitive sort:

Append _CI to an Oracle sort name for a case-insensitive sort.
Append _AI to an Oracle sort name for an accent-insensitive and case-insensitive sort.

For example, you can set NLS_SORT to the following types of values:

FRENCH_M_AI
XGERMAN_CI

Binary sorts can also be case-insensitive or accent-insensitive. When you specify BINARY_CI as a value for NLS_SORT, it designates a sort that is accent-sensitive and case-insensitive. BINARY_AI designates an accent-insensitive and case-insensitive binary sort. You may want to use a binary sort if the binary sort order of the character set is appropriate for the character set you are using.

For example, with the NLS_LANG environment variable set to AMERICAN_AMERICA.WE8ISO8859P1, create a table called test1 and populate it as follows:

SQL> CREATE TABLE test1 (letter VARCHAR2(10));
SQL> INSERT INTO test1 VALUES('ä');
SQL> INSERT INTO test1 VALUES('a');
SQL> INSERT INTO test1 VALUES('A');
SQL> INSERT INTO test1 VALUES('Z');
SQL> SELECT * FROM test1;

LETTER
-----------
ä 
a
A
Z

The default value of NLS_SORT is BINARY. Use the following statement to do a binary sort of the characters in table test1:

SELECT * FROM test1 ORDER BY letter;

To change the value of NLS_SORT, enter a statement similar to the following:

ALTER SESSION SET NLS_SORT=BINARY_CI;

The following table shows the sort orders that result from setting NLS_SORT to BINARY, BINARY_CI, and BINARY_AI.

BINARY	BINARY_CI	BINARY_AI
`A`	`a`	`ä`
`Z`	`A`	`a`
`a`	`Z`	`A`
`ä`	`ä`	`Z`

When NLS_SORT=BINARY, uppercase letters come before lowercase letters. Letters with diacritics appear last.

When the sort considers diacritics but ignores case (BINARY_CI), the letters with diacritics appear last.

When both case and diacritics are ignored (BINARY_AI), ä is sorted with the other characters whose base letter is a. All the characters whose base letter is a occur before z.

You can use binary sorts for better performance when the character set is US7ASCII or another character set that has the same sort order as the binary sorts.

The following table shows the sort orders that result from German sorts for the table.

GERMAN	GERMAN_CI	GERMAN_AI
`a`	`a`	`ä`
`A`	`A`	`a`
`ä`	`ä`	`A`
`Z`	`Z`	`Z`

A German sort places lowercase letters before uppercase letters, and ä occurs before Z. When the sort ignores both case and diacritics (GERMAN_AI), ä appears with the other characters whose base letter is a.

Linguistic Sort Examples

The examples in this section demonstrate a binary sort, a monolingual sort, and a multilingual sort. To prepare for the examples, create and populate a table called test2. Enter the following statements:

SQL> CREATE TABLE test2 (name VARCHAR2(20));
SQL> INSERT INTO test2 VALUES('Diet');
SQL> INSERT INTO test2 VALUES('À voir');
SQL> INSERT INTO test2 VALUES('Freizeit');

Example 5-4 Binary Sort

The ORDER BY clause uses a binary sort.

SQL> SELECT * FROM test2 ORDER BY name;

You should see the following output:

Diet
Freizeit
À voir

Note that a binary sort results in À voir being at the end of the list.

Example 5-5 Monolingual German Sort

Use the NLSSORT function with the NLS_SORT parameter set to german to obtain a German sort.

SQL> SELECT * FROM test2 ORDER BY NLSSORT(name, 'NLS_SORT=german');

You should see the following output:

À voir
Diet
Freizeit

Note that À voir is at the beginning of the list in a German sort.

Example 5-6 Comparing a Monolingual German Sort to a Multilingual Sort

Insert the character string shown in Figure 5-1 into test. It is a D with a crossbar followed by ñ.

Figure 5-1 Character String

Text description of the illustration dn14.gif

Perform a monolingual German sort by using the NLSSORT function with the NLS_SORT parameter set to german.

SQL> SELECT * FROM test2 ORDER BY NLSSORT(name, 'NLS_SORT=german');

The output from the German sort shows the new character string last in the list of entries because the characters are not recognized in a German sort.

Perform a multilingual sort by entering the following statement:

SQL> SELECT * FROM test2 ORDER BY NLSSORT(name, 'NLS_SORT=generic_m');

The output shows the new character string after Diet, following ISO sorting rules.

See Also:

"The NLSSORT Function"
"NLS_SORT" for more information about setting and changing the NLS_SORT parameter

Using Linguistic Indexes

Linguistic sorting is language-specific and requires more data processing than binary sorting. Using a binary sort for ASCII is accurate and fast because the binary codes for ASCII characters reflect their linguistic order. When data in multiple languages is stored in the database, you may want applications to sort the data returned from a SELECT...ORDER BY statement according to different sort sequences depending on the language. You can accomplish this without sacrificing performance by using linguistic indexes. Although a linguistic index for a column slows down inserts and updates, it greatly improves the performance of linguistic sorting with the ORDER BY clause.

You can create a function-based index that uses languages other than English. The index does not change the linguistic sort order determined by NLS_SORT. The index simply improves the performance. The following statement creates an index based on a German sort:

CREATE TABLE my_table(name VARCHAR(20) NOT NULL) 
/*NOT NULL ensures that the index is used */
CREATE INDEX nls_index ON my_table (NLSSORT(name, 'NLS_SORT = German'));

After the index has been created, enter a SELECT statement similar to the following:

SELECT * FROM my_table ORDER BY name;

It returns the result much faster than the same SELECT statement without an index.

The rest of this section contains the following topics:

Linguistic Indexes for Multiple Languages
Requirements for Using Linguistic Indexes
See Also:

Oracle Database Concepts

Oracle Database SQL Reference for more information about function-based indexes

Linguistic Indexes for Multiple Languages

There are three ways to build linguistic indexes for data in multiple languages:

Build a linguistic index for each language that the application supports. This approach offers simplicity but requires more disk space. For each index, the rows in the language other than the one on which the index is built are collated together at the end of the sequence. The following example builds linguistic indexes for French and German.
```
CREATE INDEX french_index ON employees (NLSSORT(employee_id, 'NLS_
SORT=FRENCH'));
CREATE INDEX german_index ON employees (NLSSORT(employee_id, 'NLS_
SORT=GERMAN'));
```
Oracle chooses the index based on the NLS_SORT session parameter or the arguments of the NLSSORT function specified in the ORDER BY clause. For example, if the NLS_SORT session parameter is set to FRENCH, then Oracle uses french_index. When it is set to GERMAN, Oracle uses german_index.
Build a single linguistic index for all languages. This requires a language column (LANG_COL in "Example: Setting Up a French Linguistic Index") to be used as a parameter of the NLSSORT function. The language column contains NLS_LANGUAGE values for the data in the column on which the index is built. The following example builds a single linguistic index for multiple languages. With this index, the rows with the same values for NLS_LANGUAGE are sorted together.
```
CREATE INDEX i ON t (NLSSORT(col, 'NLS_SORT=' || LANG_COL));
```
Queries choose an index based on the argument of the NLSSORT function specified in the ORDER BY clause.
Build a single linguistic index for all languages using one of the multilingual linguistic sorts such as GENERIC_M or FRENCH_M. These indexes sort characters according to the rules defined in ISO 14651. For example:
```
CREATE INDEX i on t (NLSSORT(col, 'NLS_SORT=GENERIC_M');
```
See Also:
"Multilingual Linguistic Sorts" for more information about Unicode sorts

Requirements for Using Linguistic Indexes

The following are requirements for using linguistic indexes:

This section also includes:

Example: Setting Up a French Linguistic Index

Set NLS_SORT Appropriately

The NLS_SORT parameter should indicate the linguistic definition you want to use for the linguistic sort. If you want a French linguistic sort order, then NLS_SORT should be set to FRENCH. If you want a German linguistic sort order, then NLS_SORT should be set to GERMAN.

There are several ways to set NLS_SORT. You should set NLS_SORT as a client environment variable so that you can use the same SQL statements for all languages. Different linguistic indexes can be used when NLS_SORT is set in the client environment.

See Also:

"NLS_SORT"

Specify NOT NULL in a WHERE Clause If the Column Was Not Declared NOT NULL

When you want to use the ORDER BY column_name clause with a column that has a linguistic index, include a WHERE clause like the following:

WHERE NLSSORT(column_name) IS NOT NULL

This WHERE clause is not necessary if the column has already been defined as a NOT NULL column in the schema.

Example: Setting Up a French Linguistic Index

The following example shows how to set up a French linguistic index. You may want to set NLS_SORT as a client environment variable instead of using the ALTER SESSION statement.

ALTER SESSION SET NLS_SORT='FRENCH';
ALTER SESSION SET OPTIMIZER_MODE = FIRST_ROWS;
CREATE INDEX test_idx ON test3(NLSSORT(col, 'NLS_SORT=FRENCH'));
SELECT * FROM test3 ORDER BY col;
ALTER SESSION SET NLS_COMP=ANSI;
SELECT * FROM test3 WHERE col > 'JJJ';

Searching Linguistic Strings

Searching and sorting are related tasks. Organizing data and processing it in a linguistically meaningful order is necessary for proper business processing. Searching and matching data in a linguistically meaningful way depends on what sort order is applied. For example, searching for all strings greater than c and less than f produces different results depending on the value of NLS_SORT. In a ASCII binary sort the search finds any strings that start with d or e but excludes entries that begin with upper case D or E or accented e with a diacritic, such as ê. Applying an accent-insensitive binary sort returns all strings that start with d, D, and accented e, such as Ê or ê. Applying the same search with NLS_SORT set to XSPANISH also returns strings that start with ch, because ch is treated as a composite character that sorts between c and d in traditional Spanish. This chapter discusses the kinds of sorts that Oracle offers and how they affect string searches by SQL and SQL regular expressions.

See Also:

SQL Regular Expressions in a Multilingual Environment

Regular expressions provide a powerful method of identifying patterns of strings within a body of text. Usage ranges from a simple search for a string such as San Francisco to the more complex task of extracting all URLs to a task like finding all words whose every second character is a vowel. SQL and PL/SQL support regular expressions in Oracle Database 10g.

Traditional regular expression engines were designed to address only English text. However, regular expression implementations can encompass a wide variety of languages with characteristics that are very different from western European text. Oracle's implementation of regular expressions is based on the Unicode Regular Expression Guidelines. The REGEXP SQL functions work with all character sets that are supported as database character sets and national character sets. Moreover, Oracle enhances the matching capabilities of the POSIX regular expression constructs to handle the unique linguistic requirements of matching multilingual data.

Oracle enhancements of the linguistic-sensitive operators are described in the following sections:

Character Range '[x-y]' in Regular Expressions
Collation Element Delimiter '[. .]' in Regular Expressions
Character Class '[: :]' in Regular Expressions
Equivalence Class '[= =]' in Regular Expressions

Examples: Regular Expressions

See Also:

Oracle Database Application Developer's Guide - Fundamentals for more information about regular expression syntax
Oracle Database SQL Reference for more information about REGEX SQL functions

Character Range '[x-y]' in Regular Expressions

According to the POSIX standard, a range in a regular expression includes all collation elements between the start point and the end point of the range in the linguistic definition of the current locale. Therefore, ranges in regular expressions are meant to be linguistic ranges, not byte value ranges, because byte value ranges depend on the platform, and the end user should not be expected to know the ordering of the byte values of the characters. The semantics of the range expression must be independent of the character set. This implies that a range such as [a-d] includes all the letters between a and d plus all of those letters with diacritics, plus any special case collation element such as ch in Traditional Spanish that is sorted as one character.

Oracle interprets range expressions as specified by the NLS_SORT parameter to determine the collation elements covered by a given range. For example:

Expression:     [a-d]e
NLS_SORT:       BINARY
Does not match: cheremoya
NLS_SORT:       XSPANISH
Matches:        >>che<<remoya

Collation Element Delimiter '[. .]' in Regular Expressions

This construct is introduced by the POSIX standard to separate collating elements. A collating element is a unit of collation and is equal to one character in most cases. However, the collation sequence in some languages may define two or more characters as a collating element. The historical regular expression syntax does not allow the user to define ranges involving multicharacter collation elements. For example, there was no way to define a range from a to ch because ch was interpreted as two separate characters.

By using the collating element delimiter [. .], you can separate a multicharacter collation element from other elements. For example, the range from a to ch can be written as [a-[.ch.]]. It can also be used to separate single-character collating elements. If you use [. .] to enclose a multicharacter sequence that is not a defined collating element, then it is considered as a semantic error in the regular expression. For example, [.ab.] is considered invalid if ab is not a defined multicharacter collating element.

Character Class '[: :]' in Regular Expressions

In English regular expressions, the range expression can be used to indicate a character class. For example, [a-z] can be used to indicate any lowercase letter. However, in non-English regular expressions, this approach is not accurate unless a is the first lowercase letter and z is the last lowercase letter in the collation sequence of the language.

The POSIX standard introduces a new syntactical element to enable specifying explicit character classes in a portable way. The [: :] syntax denotes the set of characters belonging to a certain character class. The character class definition is based on the character set classification data.

Equivalence Class '[= =]' in Regular Expressions

Oracle also supports equivalence classes through the [= =] syntax as recommended by the POSIX standard. A base letter and all of the accented versions of the base constitute an equivalence class. For example, the equivalence class [=a=] matches ä as well as â. The current implementation does not support matching of Unicode composed and decomposed forms for performance reasons. For example, ä (a umlaut) does not match 'a followed by umlaut'.

Examples: Regular Expressions

The following examples show regular expression matches.

Example 5-7 Case-Insensitive Match Using the NLS_SORT Value

Case sensitivity in an Oracle regular expression match is determined at two levels: the NLS_SORT initialization parameter and the runtime match option. The REGEXP functions inherit the case-sensitivity behavior from the value of NLS_SORT by default. The value can also be explicitly overridden by the runtime match option 'c' (case sensitive) or 'i' (case insensitive).

Expression: catalog(ue)?
NLS_SORT: GENERIC_M_CI
Matches: 

>>Catalog<<
>>catalogue<<
>>CATALOG<<

Oracle SQL syntax:

SQL> ALTER SESSION SET NLS_SORT='GENERIC_M_CI';
SQL> SELECT col FROM test WHERE REGEXP_LIKE(col,'catalog(ue)?'):

Example 5-8 Case Insensitivity Overridden by the Runtime Match Option

Expression: catalog(ue)?
NLS_SORT: GENERIC_M_CI
Match option: 'c'
Matches:

>>catalogue<<

Does not match:

Catalog
CATALOG

Oracle SQL syntax:

SQL> ALTER SESSION SET NLS_SORT='GENERIC_M_CI';
SQL> SELECT col FROM test WHERE REGEXP_LIKE(col,'catalog(ue)?,'c');

Example 5-9 Matching with the Collation Element Operator [..]

Expression: [^-a-[.ch.]]+
Matches: 

>>driver<<

Does not match:

cab

Oracle SQL syntax:

SQL> SELECT col FROM test WHERE REGEXP_LIKE(col,'[^-a-[.ch.]]+');

Example 5-10 Matching with the Character Class Operator [::]

This expression looks for 6-character strings with lowercase characters. Note that accented characters are matched as lowercase characters.

Expression: [[:lower:]]{6}
Database character set: WE8ISO1559P1
Matches:

>>maître<<
>>mòbile<<
>>pájaro<<
>>zurück<<

Oracle SQL syntax:

SQL> SELECT col FROM test WHERE REGEXP_LIKE(col,'[[:lower:]]{6}');

Example 5-11 Matching with the Base Letter Operator [==]

Expression: r[[=e=]]sum[[=e=]]
Matches:

>>resume<<
>>résumé<<
>>résume<<
>>resumé<<

Oracle SQL syntax:

SQL> SELECT col FROM test WHERE REGEXP_LIKE(col,r[[=e=]]sum[[e]]');

See Also:

Oracle Database Application Developer's Guide - Fundamentals for more information about regular expression syntax
Oracle Database SQL Reference for more information about REGEX SQL functions

5 Linguistic Sorting and String Searching

Overview of Oracle's Sorting Capabilities

Using Binary Sorts

Using Linguistic Sorts

Monolingual Linguistic Sorts

Table 5-1 Sample Glyphs and Their Major and Minor Sort Values

Multilingual Linguistic Sorts

Multilingual Sorting Levels

Primary Level Sorts

Secondary Level Sorts

Tertiary Level Sorts

Linguistic Sort Features

Base Letters

Ignorable Characters

Contracting Characters

Expanding Characters

Context-Sensitive Characters

Canonical Equivalence

Reverse Secondary Sorting

Character Rearrangement for Thai and Laotian Characters

Special Letters

Special Combination Letters

Special Uppercase Letters

Special Lowercase Letters

Case-Insensitive and Accent-Insensitive Linguistic Sorts

Examples of Case-Insensitive and Accent-Insensitive Sorts

Example 5-1 Linguistic Sort Using Base Letters, Diacritics, Punctuation, and Case Information

Example 5-2 Case-Insensitive Linguistic Sort

Example 5-3 Accent-Insensitive Linguistic Sort

Specifying a Case-Insensitive or Accent-Insensitive Sort

Linguistic Sort Examples

Example 5-4 Binary Sort

Example 5-5 Monolingual German Sort

Example 5-6 Comparing a Monolingual German Sort to a Multilingual Sort

Figure 5-1 Character String

Using Linguistic Indexes

Linguistic Indexes for Multiple Languages

Requirements for Using Linguistic Indexes

Set NLS_SORT Appropriately

Specify NOT NULL in a WHERE Clause If the Column Was Not Declared NOT NULL

Example: Setting Up a French Linguistic Index

Searching Linguistic Strings

SQL Regular Expressions in a Multilingual Environment

Character Range '[x-y]' in Regular Expressions

Collation Element Delimiter '[. .]' in Regular Expressions

Character Class '[: :]' in Regular Expressions

Equivalence Class '[= =]' in Regular Expressions

Examples: Regular Expressions

Example 5-7 Case-Insensitive Match Using the NLS_SORT Value

Example 5-8 Case Insensitivity Overridden by the Runtime Match Option

Example 5-9 Matching with the Collation Element Operator [..]

Example 5-10 Matching with the Character Class Operator [::]

Example 5-11 Matching with the Base Letter Operator [==]

5
Linguistic Sorting and String Searching