Collation is the assembly of written information into a standard order. Many systems of collation are based on number or alphabetical order, or extensions and combinations thereof. Collation is a fundamental element of most office filing systems, , and .
Collation differs from classification in that classification is concerned with arranging information into logical categories, while collation is concerned with the ordering of items of information, usually based on the form of their . Formally speaking, a collation method typically defines a total order on a set of possible identifiers, called , which consequently produces a total preorder on the set of items of information (items with the same identifier are not placed in any defined order).
A collation algorithm such as the Unicode collation algorithm defines an order through the process of comparing two given and deciding which should come before the other. When an order has been defined in this way, a sorting algorithm can be used to put a list of any number of items into that order.
The main advantage of collation is that it makes it fast and easy for a user to find an element in the list, or to confirm that it is absent from the list. In automatic systems this can be done using a binary search algorithm or interpolation search; manual searching may be performed using a roughly similar procedure, though this will often be done unconsciously. Other advantages are that one can easily find the first or last elements on the list (most likely to be useful in the case of numerically sorted data), or elements in a given range (useful again in the case of numerical data, and also with alphabetically ordered data when one may be sure of only the first few letters of the sought item or items).
A similar approach may be taken with strings representing calendar date or other items that can be ordered chronologically or in some other natural fashion.
To decide which of two strings comes first in alphabetical order, initially their first letters are compared. The string whose first letter appears earlier in the alphabet comes first in alphabetical order. If the first letters are the same, then the second letters are compared, and so on, until the order is decided. (If one string runs out of letters to compare, then it is deemed to come first; for example, "cart" comes before "carthorse".) The result of arranging a set of strings in alphabetical order is that words with the same first letter are grouped together, and within such a group words with the same first two letters are grouped together, and so on.
Certain limitations, complications, and special conventions may apply when alphabetical order is used:
In several languages the rules have changed over time, and so older dictionaries may use a different order than modern ones. Furthermore, collation may depend on use. For example, German Dictionary and telephone directories use different approaches.
The radical-and-stroke system is cumbersome compared to an alphabetical system in which there are a few characters, all unambiguous. The choice of which components of a logograph comprise separate radicals and which radical is primary is not clear-cut. As a result, logographic languages often supplement radical-and-stroke ordering with alphabetic sorting of a phonetic conversion of the logographs. For example, the kanji word Tokyo (東京) can be sorted as if it were spelled out in the Japanese characters of the hiragana syllabary as "to-u-ki-yo-u" (とうきょう), using the conventional sorting order for these characters.
In addition, in Greater China, surname stroke ordering is a convention in some official documents where people's names are listed without hierarchy.
The radical-and-stroke system, or some similar pattern-matching and stroke-counting method, was traditionally the only practical method for constructing dictionaries that someone could use to look up a logograph whose pronunciation was unknown. With the advent of computers, dictionary programs are now available that allow one to handwrite a character using a mouse or stylus.
The simplest kind of automated collation is based on the numerical codes of the symbols in a character set, such as ASCII coding (or any of its such as Unicode), with the symbols being ordered in increasing numerical order of their codes, and this ordering being extended to strings in accordance with the basic principles of alphabetical ordering (mathematically speaking, lexicographical ordering). So a computer program might treat the characters a, b, C, d, and $ as being ordered $, C, a, b, d (the corresponding ASCII codes are $ = 36, a = 97, b = 98, C = 67, and d = 100). Therefore, strings beginning with C, M, or Z would be sorted before strings with lower-case a, b, etc. This is sometimes called ASCIIbetical order. This deviates from the standard alphabetical order, particularly due to the ordering of capital letters before all lower-case ones (and possibly the treatment of spaces and other non-letter characters). It is therefore often applied with certain alterations, the most obvious being case conversion (often to uppercase, for historical reasonsHistorically, computers only handled text in uppercase (this dates back to telegraph conventions).) before comparison of ASCII values.
In many collation algorithms, the comparison is based not on the numerical codes of the characters, but with reference to the collating sequence – a sequence in which the characters are assumed to come for the purpose of collation – as well as other ordering rules appropriate to the given application. This can serve to apply the correct conventions used for alphabetical ordering in the language in question, dealing properly with differently cased letters, , digraphs, particular abbreviations, and so on, as mentioned above under Alphabetical order, and in detail in the Alphabetical order article. Such algorithms are potentially quite complex, possibly requiring several passes through the text.
Problems are nonetheless still common when the algorithm has to encompass more than one language. For example, in German dictionaries the word ökonomisch comes between offenbar and olfaktorisch, while Turkish dictionaries treat o and ö as different letters, placing oyun before öbür.
A standard algorithm for collating any collection of strings composed of any standard Unicode symbols is the Unicode Collation Algorithm. This can be adapted to use the appropriate collation sequence for a given language by tailoring its default collation table. Several such tailorings are collected in Common Locale Data Repository.
Sorting decimals properly is a bit more difficult, because different locales use different symbols for a decimal point, and sometimes the same character used as a Decimal mark is also used as a separator, for example "Section 3.2.5". There is no universal answer for how to sort such strings; any rules are application dependent.
Also −13 comes alphabetically after −12 although it is less. With negative numbers, to make ascending order correspond with alphabetical sorting, more drastic measures are needed such as adding a constant to all numbers to make them all positive.
When letters of an alphabet are used for this purpose of enumeration, there are certain language-specific conventions as to which letters are used. For example, the Russian alphabet letters Ъ and Ь (which in writing are only used for modifying the preceding consonant), and usually also Ы, Й, and Ё, are usually omitted. Also in many languages that use extended Latin script, the are often not used in enumeration.