Technical Notes

Formatting HTML

Antenna House Formatter V7.4 can format HTML designed for the Web (except for HTML that uses a frame). However, there may be few HTML documents that achieve a good result without needing adjustment after formatting. The reasons are as follows:

  • The HTML document was designed especially for the browser and paginated media was not taken into consideration.
  • The HTML document does not follow the HTML specification.
  • The CSS used in the HTML document may not be used exactly according to the CSS specification.

For example, if the HTML can be printed from a Web browser without overflowing the right-hand side of the page, then formatting with Antenna House Formatter V7.4 will produce a reasonable result. However, in order to achieve a better result, the HTML must be designed both for the browser and for printing. The CSS for printing may be precisely defined using rules such as:

@media print { ... }
@page { ... }

Moreover, there are big differences in the CSS implementations of current Web browsers. If the HTML contains grammar mistakes by being designed for a particular browser, or the HTML uses incorrect CSS, it is unlikely that a good result could be obtained.

Many (X)HTML documents on the Web use only generic fonts. (This is desirable considering the characteristics of the Web.) Since the font settings for every script in the Option Setting File always apply in Antenna House Formatter V7.4 GUI on Windows, suitable fonts will be used. However, this applies only to Antenna House Formatter V7.4 GUI and only on Windows. When using the Command-line Interface, set appropriate <script-font> values in the Option Setting File and specify the Option Setting File when formatting.

Cascading Order of CSS

The orign of CSS is defined in the CSS3 specifications as follows: This is in order of priority. CSS is applied in descending order.

  1. Transition declarations (not supported)
  2. Important user agent declarations
  3. Important user declarations
  4. Important author declarations
  5. Animation declarations (not supported)
  6. Normal author declarations
  7. Normal user declarations
  8. Normal user agent declarations

Antenna House Formatter V7.4 corresponds to the following.

  • user agent declarations

    It is html.css. See also Default CSS for HTML.

  • user declarations

    This can be specified by <usercss>, in the Option Setting File and by the command line of -css or -s. (As for the .NET, Java interface, etc, they are equivalent to the corresponding command line.) These are applied in the following order.

    1. Applies CSS specified by the Option Setting File and -css in the appearance order.
    2. Applies CSS specified by -s.

    Only the Option Setting File is applied in GUI. What is specified on the CSS page of the Format Option Setting dialog will be reflected in the Option Setting File.

  • author declarations

    This can be specified by <link> or <style>, style attribute inside HTML, by the processing instruction of <?xml-stylesheet ...?>. These are applied in the following order.

    1. Applies the processing instruction <?xml-stylesheet?> in XML in order of appearance. (XML or XHTML)
    2. Applies <link> or <style> inside HTML in order of appearance. (XML or XHTML)
    3. Applies the style attribute in HTML, SVG, and MathML.

    For <link>, CSS with rel="stylesheet" or rel="alternate stylesheet" specified such as the following will be applied:

    <link rel="stylesheet" href="main.css" />
    

    <?xml-stylesheet?>, <link> and <style> can have the title attribute. Anything that isn't “alternate” and doesn't have the title attribute will be applied, but if you include a non-empty title attribute, it will not necessarily be applied. Only the one that has the same title as the first appeared and that isn't “alternate” will be applied.

    For example, if the following are specified, B.css and <style> will not be applied. “alternate” requires “title”. <link> with “alternate” without “title” will not be applied.
    <link rel="stylesheet" href="main.css" />
    <link rel="stylesheet" href="A.css" title="A" />
    <link rel="stylesheet" href="B.css" title="B" />
    <link rel="alternate stylesheet" href="C.css" title="A" />
    <style title="S">...</style>
    

Default CSS for HTML

Default CSS for HTML is used as the first stylesheet (user agent declarations) when formatting (X)HTML. This is html.css, which is placed in the directory indicated by the environment variable, AHF74_32_DEFAULT_HTML_CSS or AHF74_64_DEFAULT_HTML_CSS. (When html.css does not exist, it is formatted as all the elements are inline.)

This stylesheet is created based on the display of a web browser, the style specified by CSS, etc. However, there may be specification which cannot be well displayed depending on the environment. Probably, there is also a difference of taste. Users are required to optimize the default CSS according to their own environment etc. Some examples are shown below.

  • <q>

    It is specified as follows by default CSS.

    q::before { content: open-quote }
    q::after  { content: close-quote }
    

    In Antenna House Formatter V7.4, the default values of quotes are "\201C" "\201D" "\2018" "\2019". The following specification may be preferable.

    q:lang(en) { quotes: '"' '"' "'" "'" }
    q:lang(no) { quotes: "«" "»" '"' '"' }
    

  • footnote

    A footnote number is specified to be placed in the margin of the left page. If you don't want to make it overflow into the margin, specify padding-left or specify list-style-position:inside to @footnote. decimal is specified for numbering. Probably, it is good to correct as follows when you want to use super-decimal.

    ::footnote-call {
      content: counter(footnote, super-decimal);
    }
    ::footnote-marker {
      content: counter(footnote, super-decimal);
      -ah-margin-end: 0.5em;
      text-indent: 0;
    }
    

  • ::marker

    The symbol used for the marker of the list is specified by <list-style-type> in the Option Setting File. Because that glyph is font dependent, different fonts will show different markers. Since ::marker has no specific font setting, the font used depends on the context at that time. It is good to specify a specific font for the ::marker if necessary.

Detection of Formatting Type

When the formatting starts by setting the detection of formatting type automatically, the formatting type will be determined in the following procedures.

  1. When MIME is specified, Antenna House Formatter V7.4 will follow its settings. That is, if text/html is specified, it will be detected as HTML. When application/xhtml+xml is specified, it will be detected as XHTML.
  2. When auto-formatter-type="html" is specified in the Option Setting File and the extension of the input document is known, Antenna House Formatter V7.4 will follow its setting. That is, when the extension is for HTML such as .htm or .html, it will be detected as HTML. If the extension is for XHTML, such as .xht or .xhtml, it will be detected as XHTML.
  3. When there is no XML declaration and DOCTYPE is for HTML, it will be detected as HTML. If it is for XHTML, it will be detected as XHTML.
  4. When auto-formatter-type="xhtml" is specified in the Option Setting File and the name space is for XHTML, it will be detected as XHTML.
  5. When there is no XML declaration and name space does not exist and the root element is <HTML> with case insensitive, it will be detected as HTML.
  6. When CSS, which is not XSLT, is specified (to the internal or external document), it will be detected as XML+CSS.
  7. When the name space is for XSL-FO, it will be detected as XSL-FO.
  8. Other than these will be detected as XML+CSS.

Although the document does not need to be XML if it's HTML formatting, it is required except HTML that the document should be well formed XML.

Antenna House Formatter V7.4 can format graphics files directly. The page size then becomes the size of the graphics. However, anything smaller than 1mm will not be the exact size.

Changes from XSL 1.0 to XSL 1.1

Some incompatible changes from XSL 1.0 are made to XSL 1.1.

  • from-page-master-region()

    In XSL 1.1, even if writing-mode or reference-orientation are specified to <fo:region-*>, these are ignored and not effective. In order to make these specifications effective with XSL 1.1, it is necessary to specify the following to <fo:page-sequence>.

    writing-mode="from-page-master-region()"
    reference-orientation="from-page-master-region()"
    

    In order to evaluate it as well as XSL 1.0 without making any changes in FO, specify default-from-page-master-region="true" in the Option Setting File.

  • fo:table

    In XSL 1.0, fo:table is supposed to generate a reference area (see 5.6 in XSL 1.0). However, in XSL 1.1, it was corrected that this was an error. The difference is mainly generated when converting from margin-* to start-indent and end-indent specified in fo:table. For example:

    <fo:block margin-left="10pt">
      <fo:table margin-left="0pt">
      ...
    

    In the table like above, left margins may differ between XSL 1.0 and XSL 1.1. If start-indent etc. are used instead of margin-*, such incompatibility will not be generated.

    In order to evaluate it as well as XSL 1.0 without making any changes in FO, specify table-is-reference-area="true" in the Option Setting File.

Shorthand

Since the shorthand in the property of XSL has succeeded the definition of CSS, the value is evaluated like CSS. That is,

margin="0pt -10pt"

is evaluated as two values instead of one formula. However, when it's not a shorthand, this is evaluated as one formula. For example, the following is one formula.

margin-left="0pt -10pt"

Antenna House Formatter V7.4 processes such an ambiguous expression by the shorthand as follows:

  • If the expression cannot be one formula like "0pt 10pt", then it is counted as two values.
  • If the mark and the numerical value have adhered like "0pt -10pt", it is counted as two values.
  • If a white space is included between a mark and a numerical value like "0pt - 10pt", it is counted as one formula.
  • "0pt-10pt" is an error. (See 5.9.5 Numerics in the XSL specification)

In FO, when using a formula in the shorthand, it can be enclosed with parentheses, etc.

With CSS, when a function of calc() is written as calc(10pt-5pt), “-” is evaluated as a operator.

Property Value Syntax

We briefly explain a part of property value syntax in the XSL/CSS Extensions. This notation conforms to that in CSS. For more details, see also Value Definition Syntax.

  • Component value combinators
    • All values that are simply placed must appear in the given order.
    • All values that are separated by a double ampersand “&&” must appear in any order.
    • Greater than or equal to one of the values that are separated by a double bar “||” must appear in any order.
    • Exactly one of the values that are separated by a bar “|” must appear.
    • Brackets “[ ]” are for grouping the content.
    • For “[ ]!”, greater than or equal to one of the values of grouped contents must appear.
  • Component value multipliers
    • An asterisk “*” indicates that the content appears greater than or equal to zero times.
    • A plus “+” indicates that the content appears greater than or equal to one times.
    • A question mark “?” indicates that the content appears zero or one time.
    • {N}” indicates that the content appears N times.
    • {N,}” indicates that the content appears greater than or equal to N times.
    • {N,M}” indicates that the content appears at least N and at most M times.
    • A hash mark “#” indicates that the content appears greater than or equal to one times, separated by comma.

Unicode

Antenna House Formatter V7.4 supports Unicode 13.0. Newly added characters may not be treated correctly. In addition, it's impossible to treat the character of unsupported script correctly ( Scripts and Languages). See unicode-bidi-rev in the Option Setting File for the BIDI control characters.

BIDI Algorithm Implementation Restrictions

When an algorithm that is not compatible with V6.6 is selected in unicode-bidi-rev, the BIDI level may not be resolved as specified.

As shown in the following example, if the character before the break is a character such as a space whose BIDI level depends on the subsequent text, or text that changes the BIDI level depending on the presence of the corresponding character, such as parentheses or “isolate” if present, the text in multiple elements must be combined to determine the BIDI level. At this time, the BIDI level is not evaluated correctly if there is an element within the range that the text is converted depending on the evaluation result of the property. The BIDI level is obtained for the text before the property is applied.

<fo:block>aaaa <fo:inline property="xxxx">bbbbb</fo:inline> ...</fo:block>
<fo:block>(aaaa<fo:inline property="xxxx">bbbbb</fo:inline> ...)</fo:block>

As shown in the following example, when the element is backward referenced, the text obtained in the evaluation result is assumed to be “Neutral”, the BIDI level is obtained once, and after the text is obtained, the BIDI processing is performed only with that text.

<p>xxxx<span ref="#yyy" style="content:target-text(attr(ref, url))"></span>zzzz</p>
<p><span id="yyy">ref</span></p>

If you want to place the page number that changes the writing direction of the index leader at the left edge of the page, we recommend that you place a space in front of the leader and change the writing direction with unicode-bidi as shown in the following example, rather than giving control characters to “content”.

a::before {
	content: leader(dotted) " " target-counter(attr(ref, url),page);
	unicode-bidi: embed;
	direction: rtl;
}
<toc>حول xxxx <a ref="#yyy"></a></toc>

When line breaks are prohibited by the white-space property, etc. in CSS, line break prohibition works only between the contents of the same BIDI Level in the element, and it is judged that line breaks are possible where the BIDI Level is different.

Unicode Range

To express the Unicode Range as a property value in the Font Configuration File, Option Setting File, etc., use the following format:

[ <urange> | <string> ]# | all

<urange> is a hexadecimal number with the preceding U+ and one of the following. Hexadecimal is case insensitive. (In the Unicode specification, the code point must be 4 to 6 digits, but here it is allowed to represent less than 4 digits for notation.)

  • a single code point (e.g. U+416)
  • an interval value range (e.g. U+400-4FF)
  • a range where trailing “?” character implies “any digit value” (e.g. U+4??)

U+4?? is equivalent to U+400-4FF. U+??? is equivalent to U+000-FFF. Unicode up to U+10FFFF is effective. Even if a range greater than U+10FFFF is specified, it is disregarded.

<string> is any string enclosed with quotation marks. For example, U+0028-0029 can be written as '()'.

all is considered that U+0-10FFFF is specified.

URI

<uri-specification> in the XSL specification is supposed to specify the character string which fulfills IRI (RFC3987) specification in url(). IRI is called URI for convenience in this document.

Schemes which can actually be specified in Antenna House Formatter V7.4 are as follows:

  • http:
  • https: (Websites cannot be accessed if they have any problem with their certificates)
  • file:
  • data:
  • jar:

It's possible to specify a correct absolute URI that includes the scheme name without using url(). For example, the following two are the same.

<fo:external-graphic src="url('http://localhost/image.png')"/>
<fo:external-graphic src="http://localhost/image.png"/>

Moreover, it's possible to specify a relative URI without specifying the scheme name.

<fo:external-graphic src="url('image.png')"/>
<fo:external-graphic src="image.png"/>

Antenna House Formatter V7.4 allows specifying the file name on a local file system instead of URI for user's convenience. However, generally there is no compatibility between URI and a local file name. For example, while a white space is not allowed for URI, a white space may be available for a local file name. Moreover, since the direct use of the % may be available to use, a character string called foo%20bar.png will point to a different resource between the two cases, evaluating as URI and evaluating as a local file name.

Antenna House Formatter V7.4 solves this problem as follows:

  • When the scheme is specified, it is adopted as is.
  • When the scheme is not specified and surrounded by url(), it is processed as follows:
    1. If URI is correct, it will be adopted as is.
    2. If URI is incorrect, % escape processing is done.
  • When the scheme is not specified explicitly and not surrounded by url(), it is processed as follows:
    1. In the Windows environment, “\” is changed into “/”.
    2. % escape processing is done.

The relative URI is combined with base-uri and transformed into the absolute URI. All local file names are transformed into a file scheme at this time. For example, in the Windows environment, when base-uri is C:\home\, it is transformed as follows:

foobar.pngfile:///C:/home/foobar.png
url('foobar.png')file:///C:/home/foobar.png
url('url(foobar.png)')file:///C:/home/url(foobar.png)
subdir\foobar.pngfile:///C:/home/subdir/foobar.png
url('subdir\foobar.png')file:///C:/home/subdir%5Cfoobar.png
url('subdir/foobar.png')file:///C:/home/subdir/foobar.png
foo bar.pngfile:///C:/home/foo%20bar.png
url('foo bar.png')file:///C:/home/foo%20bar.png
foo%20bar.pngfile:///C:/home/foo%2520bar.png
url('foo%20bar.png')file:///C:/home/foo%20bar.png
foo%%20bar.pngfile:///C:/home/foo%25%2520bar.png
url('foo%%20bar.png')file:///C:/home/foo%25%2520bar.png
foo#bar.pngfile:///C:/home/foo#bar.png
url('foo#bar.png')file:///C:/home/foo#bar.png
foo%23bar.pngfile:///C:/home/foo%2523bar.png
url('foo%23bar.png')file:///C:/home/foo%23bar.png

A local file name cannot be written directly into url(). For example:

url('C:\My Document\foobar.png')

The string above will not operate as expected. Specify a local file name without surrounding by url().

“#” is a fragment separator. In file:///C:/home/foo#bar.png, the resource actually accessed is file:///C:/home/foo. Specify url('foo%23bar.png') to access a resource called foo#bar.png.

UNC (Universal Naming Convention) in Windows, for example, \\host\My Document\foobar.png is transformed into file://host/My%20Document/foobar.png. Also, //host/My Document/foobar.png will be transformed into http://host/My%20Document/foobar.png when base-uri is http:. (The same applies to https:.) In non-Windows environments, file://host/... is not supported.

The format of the data scheme defined in RFC2397 is:

"data:" [ mediatype ] [ ";base64" ] "," data

For example, specify as follows:

<fo:external-graphic
src="
3RJTUUH1AIFCDIuN9BfzQAAAAlw ... ="/>

It's not necessary to specify the media type (content-type) in the data scheme, if specified, it is assumed. Note that a semicolon “;” is required when specifying base64, and a data delimiter is a comma “,”.

The jar scheme defined in JarURLConnection can be specified. This is effective with JAR or ZIP and possible to specify the entry in it.

jar:http://www.foo.com/bar/baz.jar!/COM/foo/Quux.png

What is specified from after the first separator “!/” is considered the entry specification. The nest of JAR or ZIP is not supported.

When accessing HTTP or HTTPS via a proxy in non-Windows environments, it's necessary to specify the proxy address with the HTTP_PROXY or HTTPS_PROXY environment variable.

When the root certificate is necessary in non-Windows environments, it's necessary to specify the file of the root certificate with the SSL_CERT environment variable.

Supports Multi-domain Certificates.

Table Auto Layout

The table (<fo:table>) has the attribute, table-layout="fixed" and table-layout="auto". The former specifies the fixed layout which has the fixed column width, and the latter is a specification of the automatic layout which calculates the column width automatically. When the value is omitted, the default value is table-layout="auto". In the XSL specification, the automatic layout serves as implementation-independent. We will explain the implementation of Antenna House Formatter V7.4 in this document.

An automatic layout can take a lot of time for calculating the width of columns. Specify table-layout="fixed" if high-speed formatting is desired.

In Antenna House Formatter V7.4, the processing method of the table differs between the specification of table-layout and the specification of the width to <fo:table>. When the width of all columns is specified, even if table-layout="auto" is specified, it is treated as table-layout="fixed". Moreover, proportional-column-width() is supposed to be available to specify only in the case of table-layout="fixed" according to the XSL specification. In Antenna House Formatter V7.4, when a column with proportional-column-width() and a column without the width specification are intermingled, it is considered that column-width="proportional-column-width(1)" is specified to the column without the width specification. In addition, it is considered and processed that table-layout="fixed" is specified. That is, in such case, all columns will have the width specification.

table-layoutWidth of fo:tableProcessing Method
fixedYes The width is divided equally and assigned to the column as which width is not specified. When the content exceeds the width, it will overflow.
No The table width becomes 100%. The width is divided equally and assigned to the column where the width is not specified. When the content exceeds the width, it will overflow.
autoYes The content of the column are calculated and the width is assigned to the column where the width is not specified. When the table width exceeds its specified width even if the minimum width of a column is adopted, the table width expands to the exceeded width.
No The content of the column are calculated and the width is assigned to the column where the width is not specified. When the table width does not fill to 100% even if the maximum width of a column is adopted, it will become the table width. When the table width exceeds 100% even if the width of a column is adopted, it will become the table width. Otherwise, the width of a table becomes 100%.

When table-layout="auto" is specified, the content of the column where the width is not specified are investigated. More desirable column width can be determined if all rows are investigated, but it takes too much time for a big table. Antenna House Formatter V7.4 usually investigates the contents only to the column for 100 rows at the maximum and determines the width of a column. This number of rows can be changed by table-auto-layout-limit of Option Setting File.

Line Breaking

Antenna House Formatter V7.4 processes two types of line breaking. One is to break lines into the line width at appropriate points at the end of every line, and the other is a processing according to the line breaking algorithm by Knuth-Plass's “Breaking Paragraphs into Lines”. (Hereinafater referred to as BPIL.) BPIL determines the break position considering the balance of the whole block.

Candidates for line breaking positions are determined by the processing of UAX#14: Line Breaking Properties. The UAX#14 processing is somewhat different from the Unicode specification as follows:

  • Nonstarter Japanese characters defined in JIS X 4051:2004 can be controlled by axf:line-break.

  • Although LB30 in UAX#14 is a non line-breaking rule before the open-parenthesis and after the close-parenthesis. Antenna House Formatter V7.4 permits line breaking for full-width parentheses. The target objects are full-width open parenthesis, full-width close parenthesis, and full-width punctuation that are indicated in axf:punctuation-trim.

  • The line breaking class AI in a CJK script is processed as ID. However, U+2015 (HORIZONTAL BAR) is processed as IN since it is non-breaking character in JIS X 4051:2004.

  • The line breaking class of half width kana is AL. Unless it leaves a space between words as well as the alphabet, line breaking is not done. Antenna House Formatter V7.4 treats half width kana as full width kana and processes the line breaking.

  • UAX#14 allows a line break immediately after U+002F (SOLIDUS), then a line break occurs with abbreviations such as km/h and w/o. It is described clearly that such breaks are undesirable in UAX#14. Antenna House Formatter V7.4 makes it possible to control the breaking of the word, such as abbreviations by axf:abbreviation-character-count.
  • The ideographic space (U+3000) is treated as a non-starter character. If you don't want to treat it as a non-starter character, specify non-starter-ideographic-space="false" in the Option Setting File.

  • U+200C and U+200D are processed as follows:
    1. Line breaking will not be done before and after U+200D.
    2. Line breaking will be considered to be available before and after U+200C.

BPIL is applied to the following blocks:

The language of the block is specified by xml:lang or the language property, or specified by default-lang in the Option Setting File. However, BPIL is not always applied to all situations. In the following cases, BPIL is not applied, but the line breaking is performed at the end of every line.

  • Blocks that contain leaders (such as <fo:leader>)
  • Blocks that contain floats or blocks with complicated adjustment of the spacing (BPIL may be applied for simple adjustment of the spacing)
  • Blocks that contain form field
  • Blocks that contain ruby that is longer than the ruby base character
  • Blocks that require BIDI processing
  • Blocks that contain axf:indent-here
  • Blocks that contain <axf:tab> or tab characters with axf:tab-treatment="preserve"
  • Blocks that contain overflowed lines where the line breaking is not restricted by wrap-option="no-wrap", etc.
  • Blocks whose font-size is 0 or line-height is 0 or less
  • Narrow area (The minimum line width is specified by bpil-minimum-line-width in the Option Setting File)
  • Large blocks (Limited number of characters is specified by bpil-limit-chars in the Option Setting File)
  • Blocks that have page masters changing to different page sizes

The following are restrictions.

  • For blocks that contain <fo:initial-property-set> or ::first-line, BPIL is applied to the second and subsequent lines.
  • If the block spans across greater than or equal to 3 pages (columns), BPIL may not be applied to the second and subsequent pages (columns).

Quotation Mark

Quotation marks are characters that belong to the character class QU in UAX#14: Line Breaking Properties. Quotation marks generally have an open and close direction, but QU does not. Therefore, if nothing is done, it will have undesired results when breaking lines. Unicode, on the other hand, says that if language information is available, it can be used to determine which character is used as the open or close quotation marks and treat them as OP or CL.

Antenna House Formatter V7.4 treats the following characters as quotation marks (including some non-QU characters in UAX#14.) QU/OP/CL shown here indicates in which direction Antenna House Formatter V7.4 treats the character (not the character class in UAX#14.)

U+0022QUQUOTATION MARK
U+0027QUAPOSTROPHE
U+00ABOPLEFT-POINTING DOUBLE ANGLE QUOTATION MARK
U+00BBCLRIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
U+2018OPLEFT SINGLE QUOTATION MARK
U+2019CLRIGHT SINGLE QUOTATION MARK
U+201AOPSINGLE LOW-9 QUOTATION MARK
U+201BOPSINGLE HIGH-REVERSED-9 QUOTATION MARK
U+201COPLEFT DOUBLE QUOTATION MARK
U+201DCLRIGHT DOUBLE QUOTATION MARK
U+201EOPDOUBLE LOW-9 QUOTATION MARK
U+201FOPDOUBLE HIGH-REVERSED-9 QUOTATION MARK
U+2039OPSINGLE LEFT-POINTING ANGLE QUOTATION MARK
U+203ACLSINGLE RIGHT-POINTING ANGLE QUOTATION MARK
U+275BQUHEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT
U+275CQUHEAVY SINGLE COMMA QUOTATION MARK ORNAMENT
U+275DQUHEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT
U+275EQUHEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT
U+275FQUHEAVY LOW SINGLE COMMA QUOTATION MARK ORNAMENT
U+2760QUHEAVY LOW DOUBLE COMMA QUOTATION MARK ORNAMENT
U+2E00QURIGHT ANGLE SUBSTITUTION MARKER
U+2E01QURIGHT ANGLE DOTTED SUBSTITUTION MARKER
U+2E02OPLEFT SUBSTITUTION BRACKET
U+2E03CLRIGHT SUBSTITUTION BRACKET
U+2E04OPLEFT DOTTED SUBSTITUTION BRACKET
U+2E05CLRIGHT DOTTED SUBSTITUTION BRACKET
U+2E06QURAISED INTERPOLATION MARKER
U+2E07QURAISED DOTTED INTERPOLATION MARKER
U+2E08QUDOTTED TRANSPOSITION MARKER
U+2E09OPLEFT TRANSPOSITION BRACKET
U+2E0ACLRIGHT TRANSPOSITION BRACKET
U+2E0BQURAISED SQUARE
U+2E0COPLEFT RAISED OMISSION BRACKET
U+2E0DCLRIGHT RAISED OMISSION BRACKET
U+2E1COPLEFT LOW PARAPHRASE BRACKET
U+2E1DCLRIGHT LOW PARAPHRASE BRACKET
U+2E20OPLEFT VERTICAL BAR WITH QUILL
U+2E21CLRIGHT VERTICAL BAR WITH QUILL
U+301DOPREVERSED DOUBLE PRIME QUOTATION MARK
U+301ECLDOUBLE PRIME QUOTATION MARK
U+301FCLLOW DOUBLE PRIME QUOTATION MARK

Quotation marks have different directions, mainly in European languages. For example, in French it will be «Guillemets» and in German it will be »Guillemets«. Antenna House Formatter V7.4 uses the above settings by default, but some languages correct this as follows. Blank cells and characters not listed here are the same as the default.

language codelanguageU+00ABU+00BBU+2018U+2019U+201CU+201DU+2039U+203A
defaultOPCLOPCLOPCLOPCL
azazeAzerbaijani CL CL
bsbosBosnianCLOP QU QUCLOP
bgbulBulgarian CL CL
cscesCzechCLOPCL CL CLOP
dadanDanishCLOPCL CL CLOP
dedeuGermanCLOPCL CL CLOP
de-CHdeu-CHSwitzerland CL CL
etestEstonian CL CL
fifinFinnish QU QU QU QU
hrhrvCroatianCLOP CLOP
huhunHungarianCLOP CLOP
isislIcelandic CL CL
kakatArmenian CL CL
ltlitLithuanian CL CL
mkmkdMacedonia CL CL
nonorNorwegian QU QU QU QU
plpolPolishCLOP CLOP
rurusRussian CL CL
skslkSlovakCLOPCL CL CLOP
slslvSloveneCLOPCL CL CLOP
sqsqiAlbanian CL CL
svsweSwedish QU QU QU QU
uzuzbUzbek CL CL

You can change the direction of the quotation marks by quotationmark in the Options Setting File. You can also specify the direction of the quotation marks with axf:quotetype.

Characters in quotes used in CSS open-quote/close-quote are forced to OP or CL regardless of these settings.

OP is a quotation mark that is treated like an open parenthesis and CL is a quotation mark that is treated like a close parenthesis. QU is a non-directional quotation mark. For characters that are QU, Antenna House Formatter V7.4 processes them as follows:

  • QU at the beginning of the string is considered OP.
  • QU at the end of the string is considered CL.
  • QU within the string is considered OP if there is no white space immediately after it and there is a white space immediately before it.
  • QU within the string is considered CL if there is no white space immediately before it and there is a white space immediately after it.
  • When other than these, leave it as QU.

Hyphenation

This section explains the behavior of the page (or column) break when hyphenation-keep="page" (or "column") is specified. Suppose there is the following sentence with hyphenation-keep="page" specified.

xxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx
xxxxxxxxxxx abc-
def xxxxxxx ghi-
jkl mnopqr.

When the page break occurs at the last line, ghi will be pushed to the next page and results in the following:

xxxxxxxxxxxxxxxx
xxxxxxxxxxx abc-
def xxxxxxx
---------------- page break
ghijkl mnopqr.

When widows="2" is specified, another 1 line will be pushed to the next page and results in the following:

xxxxxxxxxxxxxxxx
xxxxxxxxxxx abc-
---------------- page break
def xxxxxxx ghi-
jkl mnopqr.

But it acts against the behavior of hyphenation-keep="page". At that time, Antenna House Formatter V7.4 cannot push only abc and accordingly 1 line will be pushed to the next page.

xxxxxxxxxxxxxxxx
---------------- page break
xxxxxxxxxxx abc-
def xxxxxxx ghi-
jkl mnopqr.

When the previous line ends with the hyphenation, lines will be pushed line after line. It's better to use together with hyphenation-ladder-count.

In a slightly different case, lines may increase when ghi is pushed as follows:

xxxxxxxxxxxxxxxx
xxxxxxxxxxx xxxx
xxx xxxxxxx
---------------- page break
ghijkl xxxx mno-
pqr.

When widows="3" is specified, one more line will be pushed. At this time, lines may decrease as follows:

xxxxxxxxxxxxxxxx
xxxxxxxxxxx xxxx
---------------- page break
xxx xxxxxxx ghi-
jkl xxxx mnopqr.

Antenna House Formatter V7.4 cannot dissolve the widows="3" caused by the side effect. This is the limitation of Antenna House Formatter V7.4. widows="2" never cause such scenario.

Variation Sequence

Antenna House Formatter V7.4 supports the Unicode Character “Variation Sequence”. When the OpenType font has the capability of Variation Sequence (cmap Format14), it is processed appropriately. For example, Variant Sequences can be expressed as follows:

葛&#xE0100;城市
葛城市
葛&#xE0101;飾区
葛飾区

Even when it is applied to a CID font which does not have the capability of Variation Sequence, CID is selected according to the following IVD (UTS#37: Ideographic Variation Database).

  • /ivd/data/2007-12-14 Combined registration of the Adobe-Japan1 collection and of sequences in that collection

&#xE0100;, etc. will be disregarded when it is a font which does not have the capability of Variation Sequence or there is no corresponded variation characters, or the specified Variation Sequence is beyond the range. This indicates that even if the setting is the same, the displayed font face may differ depending on which Variation Sequence the font corresponds to.

Font Selection

Fonts in FO or CSS are specified by the font-family property. There are various cases in settings when the candidates of the font are enumerated like font-family="'Courier New', serif", or when there is no specification of font-family, Antenna House Formatter V7.4 determines which font should be applied to a character string as follows:

  1. The character strings in the region are divided into the character strings with the same character by the script information corresponding to the character defined by Unicode, the language specified in FO or CSS, or the script information, etc. and the script of the divided character string is determined. This method of determination is complicated because of the reason that there contains the ambiguous characters to determine if it's a full width character or not in Unicode. Or the language is being unable to determine by kanji only as a character string.

  2. When font-selection-mode="6" is specified in the Option Setting File, each character of this character string is investigated in order whether the font-family specified by FO or CSS has its glyph. Then the font with the first found glyph will be adopted. If these are not specified, each character of this character string is investigated whether the font-family specified by FO or CSS has its glyph, and the font-family supports the Unicode Range or script in order. Then the first found supported font will be adopted. When no font-family is specified, it is considered that the generic font family as the default font family is specified.

In XSL or CSS, the following five can be used as the generic font family.

  • serif
  • sans-serif
  • cursive
  • fantasy
  • monospace

Antenna House Formatter V7.4 has the information of which font is actually made to correspond to these for every script. Moreover, the generic font which does not belong to any script can also be defined now. These can be specified in the Font Setting page of the Option Setting dialog in Graphical User Interface, and also can be specified with <script-font> in the Option Setting File.

  1. When the generic font classified by the script corresponding to the script of the target character string is specified, whether it supports the character string is investigated.

  2. When the corresponding generic font classified by the script is not specified, the generic font is investigated.

  3. When auto-fallback-font="true" is specified in the Option Setting File and any fonts specified in the font-family don't support the target character string, the following fallback processing will be performed.

    1. The font specified to the fallback associated with the corresponding script is investigated.
    2. The font specified to the fallback of the generic font is investigated.
    3. Even then any fonts don't support the target character string, the following fonts are investigated in order.
      • Windows versions
        1. Lucida Sans Unicode
        2. Microsoft Sans Serif
        3. IPAGothic
        4. Code2000
        5. MS PGothic
        6. Arial Unicode MS
      • Non-Windows versions
        1. Helvetica
        2. IPAGothic
        3. Code2000

  4. It is an error even then the font which supports the target character string is not found.

The settings in the Option Setting dialog is reflected on the Option Setting File. For example, it is written like

<script-font script="Hans" serif="SimSun" sans-serif="SimHei" monospace="SimSun"/>

Since there is no specification of cursive here, cursive in the generic font is adopted to Hans. Like immediately after the installation, when <script-font script="Hans"/> itself is not specified, it is considered that the default group is specified. The following default group is set up with Windows versions. Only scripts that are specified here are set up. Moreover, it is not set up when the font does not actually exist.

Scriptserifsans-serifcursivefantasymonospace
defaultTimes New RomanArial Segeo Script or
Comic Sans MS or
Monotype Corsiva
Impact Courier New
JpanMS MinchoMS GothicMS Mincho or
MS Gothic
MS Mincho or
MS Gothic
MS Gothic or
MS Mincho
HansSimSun or
MS Song
SimHei or
MS Hei or
MS Song
SimSun or
MS Song
SimSun or
MS Song
SimHei or
MS Hei or
MS Song
HantMingLiU
HangBatang or
BatangChe
Gulim or
BatangChe
Batang or
BatangChe
Batang or
BatangChe
BatangChe
Armn no-LT Arian AMU Serif or
Arian AMU
Arian AMUArian AMUArian AMUArian AMU Mono or
Arian AMU
Geor no-LT Sylfaen
Ethi no-LT Nyala
ArabArabic Typesetting
Syrc no-LT Estrangelo Edessa
HebrFrankRuehl
DevaMangal
Beng no-LT Vrinda
Guru no-LT Raavi
Gujr no-LT Shruti
Taml no-LT Latha
Telu no-LT Gautami
Knda no-LT Tunga
Mlym no-LT Kartika
Sinh no-LT Iskoola Pota
ThaiAngsana New
Khmr no-LT DaunPenh
Laoo no-LT DokChampa
Mymr no-LT Myanmar Text
Zsye no-LT Segoe UI Emoji
Zsym no-LT Segoe UI Symbol

The following default group is set up with the Macintosh versions.

Scriptserifsans-serifcursivefantasymonospace
defaultTimes or
Times New Roman
Helvetica or
Arial
Monaco or
Chalkboard
Monaco or
Chalkboard
Courier
JpanHiraMinPro W3HiraKakuPro W3HiraMaruPro W3 or
HiraKakuPro W3
HiraMaruPro W3 or
HiraKakuPro W3
HiraKakuPro W3
HansSTXihei
or STSong
STSongSTXihei
or STSong
STXihei
or STSong
STSong
HantLiHeiPro
or LiSongPro
LiSongProLiHeiPro
or LiSongPro
LiHeiPro
or LiSongPro
LiSongPro
HangAppleMyungjoAppleGothicAppleMyungjoAppleMyungjoAppleGothic
ArabGeeza Pro
HebrNewPeninimMT
DevaDevanagariMT
ThaiThonburi
Zsye no-LT Apple Color Emoji

The following default group is set up with the Linux versions.

Scriptserifsans-serifcursivefantasymonospace
defaultTimesHelveticaTimesTimesCourier

Glyphs in Vertical Text

There are basically three types of the orientation of text in Japanese or Chinese documents as follows:

In horizontal writingIn vertical writing
SVOMVO

Expresses the orientation of text in vertical writing mode with U or R. U is a character displayed upright on the paper. R is a character rotated 90 degrees clockwise on the paper. Then the text orientation in vertical writing mode is as follows:

  • Japanese characters like "漢字" are U.
  • Brackets are R.
  • After the glyph for vertical writing is used, punctuations are U.
  • European characters like "Abc" are U in SVO, R in MVO.

There is an argument of which characters should be upright or which characters should be rotated 90 degrees at UAX#50: Unicode Vertical Text Layout. Right now only the description of MVO (Mixed Vertical Orientation) is here. However, the description of SVO (Stacked Vertical Orientation) was also included in the past (tr50-6.html). Antenna House Formatter V7.4 implements axf:text-orientation="mixed" complying with MVO, axf:text-orientation="upright" complying with SVO. However, Antenna House Formatter V7.4 uses the one with some modifications ( tr50-x.Orientation.txt). This data can be modified arbitrarily in the Option Setting File. See also UAX50.

Usually, the font supporting the vertical writing mode has the glyph for vertical writing for some characters. It is because some are inapplicable to vertical writing simply by rotating the glyph for horizontal writing mode. They are small kana, punctuations, long vowel, etc. In vertical writing mode, if the character has the glyph for vertical writing, it will be used.

The orientation of text (U or R) is decided and expressed as compared to the orientation of the glyph for horizontal writing mode. However some glyphs for vertical writing mode differ from that for horizontal writing mode. The example below shows the glyph of U+3083, U+FF08, and U+2190. U+FF08 and U+2190 have the different orientation between vertical and horizontal writing mode.

Glyph for horizontal writingGlyph for vertical writing

Although “brackets are R” as mentioned above, actually you have to display them as U using the glyph for vertical writing mode. That is, here is a tacit assumption that the glyph for vertical writing mode is designed to have the orientation differently from that for horizontal writing mode. Whether the font has the glyph for vertical writing mode or whether the orientation is the same as that for horizontal writing mode depends on the font. In particular, the difference by a font is remarkable in the orientation of symbols, such as arrows. Since it is impossible to get to know which orientation the glyph is designed, this problem is generally impossible to solve. Therefore, Antenna House Formatter V7.4 controls the orientation of the character according to the major implementations.

EmojiXSL-FO Samples CollectionEmoji Support

Antenna House Formatter V7.4 partially supports Emoji characters (UTS#51: Unicode Emoji). The following types of OpenType Emoji fonts are supported.

  • Emoji fonts with COLR/CPAL table

    Segoe UI Emoji font on Windows, etc. (Gradient is not supported)

  • Emoji fonts with sbix table

    Apple Color Emoji font, etc.

  • Emoji fonts with CBDT/CBLC table

    Noto Color Emoji fonts, etc.

Antenna House Formatter V7.4 tries to display the following strings as <Emoji>.

<Emoji> ::= <EmojiUnit> [U+200D [<EmojiUnit>|<EmojiB>]]*
          | <EmojiKeycapBase> U+FE0F U+20E3
          | <EmojiRegionalIndicator> <EmojiRegionalIndicator>
<EmojiUnit> ::= <EmojiA>
              | [<EmojiA>|<EmojiB>] [U+FE0F
                                   | <EmojiModifier>
                                   | <EmojiTagModifier>* U+E007F]
<EmojiA> ::= U+1F000..U+1FFFD
<EmojiB> ::= Characters classified as Emoji other than <EmojiA>
<EmojiModifier> ::= U+1F3FB..U+1F3FF
<EmojiTagModifier> ::= U+E0020..U+E007E
<EmojiKeycapBase> ::= '#' | '*' | '0'..'9'
<EmojiRegionalIndicator> ::= U+1F1E6..U+1F1FF

You can use U+FE0E to make Emoji to be text style. You cannot use U+200D to connect with other characters.

<EmojiText> ::= [<EmojiA>|<EmojiB>] U+FE0E

<Emoji> is considered the script Zsye and <EmojiText> is considered Zsym.

CSS Debug Tree V7.4 no-LT

The CSS debug tree is XML that contains information about which elements the CSS declarations apply to and in what order. To use this feature in the GUI, css-debug="true" must be specified in the Option Setting File. At this time, [File]-[Save CSS Debug Tree] is added to the menu, and you can output the CSS debug tree by selecting it. On the Command-line Interface, specify @CSSDebugTree for the printer name. There is no need to specify css-debug="true". You cannot specify the page range to output. It is invalid when it is not CSS formatting.

The information is added to the normal FO Tree with the namespace xmlns:_cssd_="http://www.antennahouse.com/names/CSSDebug". For example, <_cssd_:selector> are listed in cascading order directly under the element as follows:

<p id="bar" ... _ah_:fo-id="16">
 <_cssd_:selector selector="p" specificity="0.0.1" order="2" origin="8" css-id="1">
  <_cssd_:declaration property="display" value="block" position="22:23"/>
 </_cssd_:selector>
 <_cssd_:selector selector="p" specificity="0.0.1" order="24" origin="8" css-id="1">
  <_cssd_:declaration property="-ah-margin-before" value="1.12em" position="50:30"/>
  <_cssd_:declaration property="-ah-margin-after" value="1.12em" position="50:56"/>
 </_cssd_:selector>
 <_cssd_:selector selector=":where(.foo, #bar, ol &gt; li:first-child)" specificity="0.0.0" order="202" origin="6" css-id="0">
  <_cssd_:declaration property="color" value="red" position="9:7"/>
 </_cssd_:selector>
 <_cssd_:selector selector=":is(#bar)" specificity="1.0.0" order="201" origin="6" css-id="0">
  <_cssd_:declaration property="color" value="blue" position="6:7"/>
 </_cssd_:selector>

Under the root element <html>, it also contains information about the URL of the CSS source.

<html ...>
 <_cssd_:css css-id="1" src="C:\....\html.css"/>
 <_cssd_:css css-id="2" src="C:\....\ahfmain7.css"/>

The details of the XML are as follows:

The debug information is added with (css*, selector*) as the first child of an element. css is only added to the root element (<html>).

<!ELEMENT css EMPTY>
<!ATTLIST css css-id  CDATA #REQUIRED>
<!ATTLIST css src     CDATA #REQUIRED>
<!ELEMENT selector declaration+>
<!ATTLIST selector at               CDATA #IMPLIED>
<!ATTLIST selector selector         CDATA #IMPLIED>
<!ATTLIST selector specificity      CDATA #IMPLIED>
<!ATTLIST selector page-specificity CDATA #IMPLIED>
<!ATTLIST selector order            CDATA #IMPLIED>
<!ATTLIST selector origin           CDATA #REQUIRED>
<!ATTLIST selector css-id           CDATA #IMPLIED>
<!ELEMENT declaration EMPTY>
<!ATTLIST declaration property      CDATA #REQUIRED>
<!ATTLIST declaration value         CDATA #REQUIRED>
<!ATTLIST declaration expr          CDATA #IMPLIED>
<!ATTLIST declaration important     (important) #IMPLIED>
<!ATTLIST declaration position      CDATA #IMPLIED>
ElementAttributeDefaultDescription
<css>
css-id Indicates the CSS file number. This number corresponds to <selector css-id>.
src Indicates the URL of the CSS file.
<selector>
selector Selector.
specificity The specificity of the selector, consisting of three numbers. See specificity for this value.
at Indicates that it is @ rule, such as @page.
page-specificity The corresponding specificity for @page, consisting of seven numbers. See CSS @page for this value.
order Indicates the appearance order of this selector through all CSS.
origin Indicates the origin in CSS Cascading Order.
css-id Corresponds to <css css-id> and indicates the CSS that contains this selector. css-id="0" indicates <style> in HTML. <selector> without css-id indicates that it is due to the element’s style attribute. Then there are no selector, specificity, etc.
<declaration>child of <selector>
property Property.
value The value after the variable expansion.
expr The original expression is shown when variables are included.
important !important includes important="important".
positionIntroduction to CSS for Paged Mediaposition Indicates the line and digit where this declaration appears in the file. The value is in the format of 12:34 where the line position is before the colon and the digit position is after it.

It’s not available with Antenna House Formatter V7.4 Lite.

Formatting Large Document

When outputting PDF, Antenna House Formatter V7.4 discards pages that have already been formatted, so Antenna House Formatter V7.4 consumes just the memory required for one page when outputting PDF for, for example, a simple FO without <fo:page-number-citation>, no matter how huge the document is (except when formatting from the GUI). However, if a page contains an <fo:page-number-citation> that refers to a following page, we cannot know the page number of the referenced page until that page is actually formatted. For that reason, if a page containing an unresolved <fo:page-number-citation> appears, Antenna House Formatter V7.4 will suspend its output and store the result in memory while continuing formatting. When a document has a table of contents at the start, the table of contents will not be output until all the page numbers appearing in it are resolved. Because of the high memory consumption, there is a limit to the number of formatted pages, so it is not possible to format extremely large documents.

To solve this problem, Antenna House Formatter V7.4 makes it possible to process the document in two formatting passes. In the first pass, the formatting is processed only for resolving <fo:page-number-citation>, and all the required page number information is collected. In the second pass, formatting starts again from the first page. Since all <fo:page-number-citation> are already resolved, Antenna House Formatter V7.4 can discard formatted pages when outputting the document. Although the formatting processing time is increased, the formatting consumes less memory and it is possible to format extremely large documents. But this has no effect on the memory consumption needed for the output.

The following shows how to perform 2-pass formatting:

Temporary File

Antenna House Formatter V7.4 does not make a temporary working file if it can be avoided. The following are the cases that Antenna House Formatter V7.4 makes the temporary file for work.

  • With the COM Interface, PDF of a formatted result is saved to a temporary file when outputting PDF to a Web browser directly.

  • An XML document passed by using DOM with the COM Interface is processed using a temporary file. However, when FO is specified as the formatting type, the temporary file is not generated because DOM is processed directly.

  • When outputting a file while printing, a temporary file is generated.

  • When a file interface is required in the XSLT transformation using external XSLT, a temporary file is generated.

  • When the transformation from XML+XSL is required in the render method of a Java Interface, the result FO is generated as a temporary file.

  • In Windows versions, when embedding an image that is not embeddable in PDF, a temporary file is generated in the conversion process.

  • When converting EPS to PDF using Distiller or Ghostscript, a temporary file is generated .

  • When processing EPS using Distiller, if joboptions is not specified, a default joboption will be generated as a temporary file.

  • When outputting to a XPS file, a temporary file is generated .

  • When operating OSDC Conversion, a temporary file is generated. V7.4 no-LT

  • In GUI of Windows versions, a temporary file is suitably generated by Windows System.