Text normalization
Text normalization is a process by which text is transformed in some way to make it consistent in some way which it may not have been before. Text normalization is often performed before a text is processed in some way, such as generating synthesized speech, automated language translation, and storage in a database.
Examples of text normalization:
- Unicode NFC (Normalization Form Composition) where the base character and combining accents are canonically composed.
- Unicode NFD (Normalization Form Decomposition) where the base character and combining accents are canonically decomposed. Usually this is into separate codepoints.
- converting all letters to lower or upper case
- removing punctuation
- removing letters with accent marks and other diacritics
- expanding abbreviations
While this may be done manually, and usually is in the case of ad hoc and personal documents, many programming languages support mechanisms which enable text normalization.
