Hyphenation

AH Formatter V7.1 can hyphenate over 40 languages. There is no need to prepare the dictionary.

Languages

AH Formatter V7.1 supports the hyphenation for the following languages.

CodeLanguageHyphenation Limited To
afafrAfrikaansLatin characters and Apostrophe
bgbulBulgarianCyrillic characters
cacatCatalanLatin characters and Apostrophe and Decimal point (Full stop or Middle dot)
cscesCzechLatin characters
cycymWelshLatin characters and Apostrophe
dadanDanishLatin characters and Apostrophe
dedeuGerman / Swiss GermanLatin characters and Apostrophe
elellGreekGreek characters
enengEnglishLatin characters and Apostrophe
en-USeng-USAmericanLatin characters and Apostrophe
eoepoEsperantoLatin characters
esspaSpanishLatin characters
etestEstonianLatin characters
eueusBasqueLatin characters
fifinFinnishLatin characters
frfraFrench / Canadian FrenchLatin characters and Apostrophe
gagleIrish (Erse or Gaelic)Latin characters and Apostrophe
hrhrvCroatianCyrillic characters or Latin characters
huhunHungarianLatin characters
idindIndonesianLatin characters and Apostrophe and Digit 2
isislIcelandicLatin characters
ititaItalianLatin characters and Apostrophe
lalatLatinLatin characters
ltlitLithuanianLatin characters
lvlavLatvianLatin characters
msmsaBahasa MalayLatin characters and Apostrophe and Digit 2
mtmltMalteseLatin characters and Apostrophe
nbnobNorwegian (Bokmål) Latin characters and Apostrophe
nlnldDutch / FlemishLatin characters and Apostrophe
nnnnoNorwegian (Nynorsk) Latin characters and Apostrophe
nonorNorwegianLatin characters and Apostrophe
plpolPolishLatin characters
ptporPortuguese / BrazilianLatin characters
roronRomanian / MoldavianLatin characters and Apostrophe
rurusRussianCyrillic characters
skslkSlovakLatin characters and Apostrophe
slslvSlovenianLatin characters and Apostrophe
srsrpSerbianCyrillic characters or Latin characters
svsweSwedishLatin characters and Apostrophe
swswaSwahiliLatin characters and Apostrophe
ththaThaiThai characters
trturTurkishLatin characters
ukukrUkrainianCyrillic characters

AH Formatter V7.1 hyphenates a word considering the character string composed of characters listed in the table above to be a word. If a word contains the other characters, it is not considered a word. If you need hyphenation for unsupported characters you will need to use a TeX dictionary.

Example

To use Czech hyphenation the following is placed in the FO file:

<fo:block hyphenate="true" language="ces">
Všichni lidé rodí se svobodní a sobě rovní co do důstojnosti a práv. Jsou nadáni rozumem a svědomím a mají spolu jednat v duchu bratrství.
</fo:block>

Exception Dictionary

It's not necessary to prepare the dictionary with AH Formatter V7.1. However, there may be a case that you want to treat the unexpected hyphened words as exceptions. In such case, it is possible to register the words in the exception dictionary. In addition, when you edit the exception dictionary while working on GUI, you can re-load the hyphenation dictionary and re-format the document from [Format]-[Reload Hyphenation Dictionary] in the menu.

The exception dictionary is stored in the hyphenation folder in the AH Formatter V7.1 installation folder or in the folder where the AHF71_HYPDIC_PATH (AHF71_64_HYPDIC_PATH for 64-bit version) environment variable indicates. The name of the dictionary file conforms to the following rules.

  • The file name is made from the Language Tag defined in RFC1766. To make a file name, the “.xml” extension is added. The Language Tag is made by joining the language code of ISO 639-2 and the country code of ISO 3166 with a hyphen. Sometimes it consists of only the language code. You can also use an underscore instead of a hyphen in the file name.
  • The language code given to the file name should be specified by 2-letter code when it exists, and if not, specify it by Terminology code. Also specify the country code by 2-letter code when it exists.

For example: de.xml, en_US.xml and so on. When xml:lang="nl-BE" is specified, dictionaries are detected in the following order. The same is applied even if xml:lang="nld-BEL" is specified.

  1. nl-BE.xml
  2. nl_BE.xml
  3. nl.xml

The following shows the content of exception dictionary.

ElementLocationDescription
<hyphenation-info>root element
<hyphen-char>child of <hyphenation-info>The element that indicates the hyphenation character alternative to <hyphen/> in the <exception> element. Hyphenation character is expressed by the value attribute. The initial value is “-” (U+002D).
<hyphen-min>child of <hyphenation-info>When a line break by hyphenation occurs, gives the minimum number of characters before and after the line break position of the word with the before, after attributes. The before attribute corresponds to the hyphenation-remain-character-count property in the XSL specification and the after attribute corresponds to the hyphenation-push-character-count property. AH Formatter V7.1 uses the <hyphen-min> as the initial value of these properties. See hyphen-min in the Option Setting File.
<exceptions>child of <hyphenation-info>A data of exception dictionary. The text of the <exception> element is a collection of hyphened words divided by white space. The hyphen information is indicated by the <hyphen> element, however the character specified by the <hyphen-char> element can also be used.
<hyphen>child of <exceptions>A full functional hyphen equivalent to TeX dictionary. <hyphen> element has the pre, post and no attributes. The pre attribute indicates the strings inserted before the hyphenation character when a hyphenation break occurs, the post attribute indicates the strings inserted after the hyphenation character when a hyphenation break occurs, and the no attribute indicates the strings appearing when a hyphenation break does not occur. <hyphen> element is used when the spelling changes when a hyphenation break occurs.
<non-eol-words>child of <hyphenation-info> Specifies non-end-of-line words dividing by the white space. The word specified here is adjusted not to be placed at the end of line, however in some case it's inevitable. The non-end-of-line process is effective all the time, independent of the hyphenate property in FO.

The DTD for the Exception Dictionary is simply as follows:

<!ELEMENT hyphenation-info (hyphen-char?, exceptions?, non-eol-words?)>

<!ELEMENT hyphen-char EMPTY>
<!ATTLIST hyphen-char value CDATA #REQUIRED>

<!ELEMENT hyphen-min EMPTY>
<!ATTLIST hyphen-char before NMTOKEN #IMPLIED><!-- digits -->
<!ATTLIST hyphen-char after  NMTOKEN #IMPLIED><!-- digits -->

<!ELEMENT exceptions (#PCDATA|hyphen)*>

<!ELEMENT hyphen EMPTY>
<!ATTLIST hyphen pre  CDATA #IMPLIED>
<!ATTLIST hyphen no   CDATA #IMPLIED>
<!ATTLIST hyphen post CDATA #IMPLIED>

<!ELEMENT non-eol-words #PCDATA>

The <hyphen> element can change the spelling of a word when it is hyphenated.

Exception DictionaryWordHyphenated Word
ab<hyphen/>defabdefabdef
ab<hyphen no="c"/>defabcdefabdef
ab<hyphen pre="x"/>defabdefabxdef
ab<hyphen pre="x" no="c"/>defabcdefabxdef
ab<hyphen post="z"/>defabdefabzdef
ab<hyphen no="c" post="z"/>defabcdefabzdef
ab<hyphen pre="x" post="z"/>defabdefabxzdef
ab<hyphen pre="x" no="c" post="z"/>defabcdefabxzdef

Suppose the following exception dictionary is prepared:

<hyphenation-info>
<exceptions>
ta-ble
present
ba<hyphen pre="k" no="c"/>ken
</exceptions>
</hyphenation-info>

The word “table” will be hyphenated only as “ta-ble”; the word “present” will never be hyphenated; and the word “backen” will be hyphenated as “bak-ken”. Also, “ta<hyphen/>ble” is equivalent to “ta-ble” in this example.

The Dutch exception dictionary nl.xml is attached to AH Formatter V7.1. V7.1MR1

The exception dictionary is available with the following languages:

CodeLanguageHyphenation Limited To
kmkhmKhmer Khmer characters
lolaoLao Lao characters
mymyaBurmese (Myanmar) Burmese characters
ththaThaiThai characters

With these languages, the exception dictionary is not used for hyphenation but to specify the words that are prohibited from breaking. Each word can contain only the characters making up the word. Neither hyphen characters nor <hyphen> can be used in <exceptions>.

TeX Dictionary

It's also available to do hyphenate using the TeX dictionary with AH Formatter V7.1. To hyphenate by Tex dictionary, it's necessary to specify HyphenationOption="false" in the Option Setting File. Dictionaries will be required for all the necessary languages. Dictionaries are XML files that are the same format as FOP. See also the Apache Website. Only the hyphenation dictionary for English (en.xml) is ready and provided with AH Formatter V7.1.

When you'd like to hyphenate words by TeX dictionary only with a certain language, specify a language to hyphenation-TeX in the Option Setting File.

See also Exception Dictionary to learn the name and the position of TeX dictionary.

The contents of TeX's Hyphenation Dictionary are defined in the hyphenation.dtd. hyphenation.dtd is included in FOP distribution. In AH Formatter V7.1, it is installed in the hyphenation folder where AH Formatter V7.1 is installed. Below is a brief explanation of the DTD. For more details, see hyphenation.dtd.

ElementLocationDescription
<hyphenation-info>root element
<hyphen-char>child of <hyphenation-info>This element expresses hyphenation characters in the exception dictionary data. Hyphenation character is expressed by the value attribute. Initial value is “-” (U+002D). But the hyphenation characters in the actual formatted result are given by the hyphenation-character property in the XSL specification.
<hyphen-min>child of <hyphenation-info>When hyphenation break occurs, before and after attributes give the minimum number of characters in a hyphenated word before or after the hyphenation character. The before attribute corresponds to the hyphenation-remain-character-count property in the XSL specification and the after attribute corresponds to the hyphenation-push-character-count property. AH Formatter V7.1 uses the setting of <hyphen-min> as the initial value of these properties.
<classes>child of <hyphenation-info>Defined as character equivalent class. Text of classes' element is white space-separated list of character groups, all characters in a group are to be treated equivalent. Actually each group consists of lowercase and uppercase characters. The following is a sample of English dictionary (en.xml).
aA bB cC dD eE fF gG hH iI jJ kK lL mM nN oO pP qQ rR sS tT uU vV wW xX yY zZ
<pattern>child of <hyphenation-info>The hyphenation patterns, separated by spaces. A pattern consists of character and digits. Character is the beginning characters of classes groups (normally lowercase). Digits between characters indicate the strength of hyphenation potential (hyphenation value).
<exceptions>child of <hyphenation-info>Data of hyphenation exception dictionary. Text of exceptions element consists of space-separated list of hyphenated words. A hyphen is indicated by the hyphen element, but you can use character defined in hyphen-char element. Exceptions element is used when hyphenation points determined by hyphenation-pattern dictionary are not appropriate or you want to use special hyphenation patterns of your own.
<hyphen>child of <exceptions>A full functional hyphen. Hyphen element has the pre, post and no attributes. The pre attribute indicates the strings inserted before the hyphenation character when a hyphenation break occurs, the post attribute indicates the strings inserted after the hyphenation character when a hyphenation break occurs, and the no attribute indicates the strings appearing when a hyphenation break does not occur. Hyphen element is used when the spelling changes when a hyphenation break occurs.

Restrictions

  • If the sentence is placed in the narrow region and there occurs multiple hyphenation for one word, sometimes the result does not follow the exception dictionary. See also Hyphenation in Technical Notes.

  • When a line break occurs at the position where U+200B or U+200C is inserted, a hyphenation is not done at U+200B, but occurs at U+200C.