geniconvtbl(4)
NAME
geniconvtbl - geniconvtbl input file format
DESCRIPTION
An input file to geniconvtbl is an ASCII text file that con-
tains an iconv code conversion definition from one codeset
to another codeset.
The geniconvtbl utility accepts the code conversion defini-
tion file(s) and writes code conversion binary table file(s)
that can be used in iconv(1) and iconv(3C) to support user-
defined code conversions. See iconv(1) and iconv(3C)for more
detail on the iconv code conversion and geniconvtbl(1) for
more detail on the utility.
The Lexical Conventions
The following lexical conventions are used in the iconv code
conversion definition:
CONVERSION_NAME
A string of characters representing the name of the
iconv code conversion. The iconv code conversion name
should start with one or more printable ASCII charac-
ters followed by a percentage character '%' followed
by another one or more of printable ASCII characters.
Examples: ISO8859-1%ASCII, 646%eucJP, CP_939%ASCII.
NAME A string of characters starts with any one of the
ASCII alphabet characters or the underscore character,
'_', followed by one or more ASCII alphanumeric char-
acters and underscore character, '_'. Examples: _a1,
ABC_codeset, K1.
HEXADECIMAL
A hexadecimal number. The hexadecimal representation
consists of an escape character, '0' followed by the
constant 'x' or 'X' and one or more hexadecimal
digits. Examples: 0x0, 0x1, 0x1a, 0X1A, 0x1B3.
DECIMAL
A decimal number, represented by one or more decimal
digits. Examples: 0, 123, 2165.
Each comment starts with '//' ends at the end of the line.
The following keywords are reserved:
automatic between binary
break condition default
dense direction discard
else error escapeseq
false if index
init input inputsize
map maptype no_change_copy
operation output output_byte_length
outputsize printchr printhd
printint reset return
true
Additionally, the following symbols are also reserved as
tokens:
{ } [ ] ( ) ; , ...
The precedence and associativity
The following table shows the precedence and associativity
of the operators from lower precedence at the top to higher
precedence at the bottom of the table allowed in the iconv
code conversion definition:
Operator (Symbol) Associativity
--------------------------------------------------
Assignment (=) Right
--------------------------------------------------
Logical OR (||) Left
--------------------------------------------------
Logical AND (&&) Left
--------------------------------------------------
Bitwise OR (|) Left
--------------------------------------------------
Exclusive OR (^) Left
--------------------------------------------------
Bitwise AND (&) Left
--------------------------------------------------
Equal-to (= =), Left
Inequality (!=)
--------------------------------------------------
Less-than (<), Left
Less-than-or-equal-to (<=),
Greater-than (>),
Greater-than-or-equal-to (>=)
--------------------------------------------------
Left-shift (<<), Left
Right-shift (>>)
--------------------------------------------------
Addition (+), Left
Subtraction (-)
--------------------------------------------------
Multiplication (*), Left
Division (/),
Remainder (%)
---------------------------------------------------
Logical negation (!), Right
Bitwise complement (~),
Unary minus (-)
---------------------------------------------------
The Syntax
Each iconv code conversion definition starts with
CONVERSION_NAME followed by one or more semi-colon separated
code conversion definition elements:
// a US-ASCII to ISO8859-1 iconv code conversion example:
US-ASCII%ISO8859-1 {
// one or more code conversion definition elements here.
:
:
}
Each code conversion definition element can be any one of
the following elements:
direction
condition
operation
map
To have a meaningful code conversion, there should be at
least one direction, operation, or map element in the iconv
code conversion definition.
The direction element contains one or more semi-colon
separated condition-action pairs that direct the code
conversion:
direction For_US-ASCII_2_ISO8859-1 {
// one or more condition-action pairs here.
:
:
}
Each condition-action pair contains a conditional code
conversion that consists of a condition element and an
action element.
condition action
If the pre-defined condition is met, the corresponding
action is executed. If there is no pre-defined condition
met, iconv(3C) will return -1 with errno set to EILSEQ. The
condition can be a condition element, a name to a pre-
defined condition element, or a condition literal value,
true. The 'true' condition literal value always yields suc-
cess and thus the corresponding action is always executed.
The action also can be an action element or a name to a
pre-defined action element.
The condition element specifies one or more condition
expression elements. Since each condition element can have a
name and also can exist stand-alone, a pre-defined condition
element can be referenced by the name at any action pairs
later. To be used in that way, the corresponding condition
element should be defined beforehand:
condition For_US-ASCII_2_ISO8859-1 {
// one or more condition expression elements here.
:
:
}
The name of the condition element in the above example is
For_US-ASCII_2_ISO8859-1. Each condition element can have
one or more condition expression elements. If there are more
than one condition expression elements, the condition
expression elements are checked from top to bottom to see if
any one of the condition expression elements will yield a
true. Any one of the following can be a condition expression
element:
between
escapeseq
expression
The between condition expression element defines one or more
comma-separated ranges:
between 0x0...0x1f, 0x7f...0x9f ;
between 0xa1a1...0xfefe ;
In the first expression in the example above, the covered
ranges are 0x0 to 0x1f and 0x7f to 0x9f inclusively. In the
second expression, the covered range is the range whose
first byte is 0xa1 to 0xfe and whose second byte is between
0xa1 to 0xfe. This means that the range is defined by each
byte. In this case, the sequence 0xa280 does not meet the
range.
The escapeseq condition expression element defines an
equal-to condition for one or more comma-separated escape
sequence designators:
// ESC $ ) C sequence:
escapeseq 0x1b242943;
// ESC $ ) C sequence or ShiftOut (SO) control character code, 0x0e:
escapeseq 0x1b242943, 0x0e;
The expression can be any one of the following and can be
surrounded by a pair of parentheses, '(' and ')':
// HEXADECIMAL:
0xa1a1
// DECIMAL
12
// A boolean value, true:
true
// A boolean value, false:
false
// Addition expression:
1 + 2
// Subtraction expression:
10 - 3
// Multiplication expression:
0x20 * 10
// Division expression:
20 / 10
// Remainder expression:
17 % 3
// Left-shift expression:
1 << 4
// Right-shift expression:
0xa1 >> 2
// Bitwise OR expression:
0x2121 | 0x8080
// Exclusive OR expression:
0xa1a1 ^ 0x8080
// Bitwise AND expression:
0xa1 & 0x80
// Equal-to expression:
0x10 == 16
// Inequality expression:
0x10 != 10
// Less-than expression:
0x20 < 25
// Less-than-or-equal-to expression:
10 <= 0x10
// Bigger-than expression:
0x10 > 12
// Bigger-than-or-equal-to expression:
0x10 >= 0xa
// Logical OR expression:
0x10 || false
// Logical AND expression:
0x10 && false
// Logical negation expression:
! false
// Bitwise complement expression:
~0
// Unary minus expression:
-123
There is a single type available in this expression:
integer. The boolean values are two special cases of integer
values. The 'true' boolean value's integer value is 1 and
the 'false' boolean value's integer value is 0. Also, any
integer value other than 0 is a true boolean value. Conse-
quently, the integer value 0 is the false boolean value. Any
boolean expression yields integer value 1 for true and
integer value 0 for false as the result.
Any literal value shown at the above expression examples as
operands, that is, DECIMAL, HEXADECIMAL, and boolean values,
can be replaced with another expression. There are a few
other special operands that you can use as well in the
expressions: 'input', 'inputsize', 'outputsize', and
variables. input is a keyword pointing to the current input
buffer. inputsize is a keyword pointing to the current input
buffer size in bytes. outputsize is a keyword pointing to
the current output buffer size in bytes. The NAME lexical
convention is used to name a variable. The initial value of
a variable is 0. The following expressions are allowed with
the special operands:
// Pointer to the third byte value of the current input buffer:
input[2]
// Equal-to expression with the 'input':
input == 0x8020
// Alternative way to write the above expression:
0x8020 == input
// The size of the current input buffer size:
inputsize
// The size of the current output buffer size:
outputsize
// A variable:
saved_second_byte
// Assignment expression with the variable:
saved_second_byte = input[1]
The input keyword without index value can be used only with
the equal-to operator, '=='. When used in that way, the
current input buffer is consecutively compared with another
operand byte by byte. An expression can be another operand.
If the input keyword is used with an index value n, it is a
pointer to the (n+1)th byte from the beginning of the
current input buffer. An expression can be the index. Only a
variable can be placed on the left hand side of an assign-
ment expression.
The action element specifies an action for a condition and
can be any one of the following elements:
direction
operation
map
The operation element specifies one or more operation
expression elements:
operation For_US-ASCII_2_ISO8859-1 {
// one or more operation expression element definitions here.
:
:
}
If the name of the operation element, in the case of the
above example, For_US -ASCII_2_ISO8859-1, is either init or
reset, it defines the initial operation and the reset opera-
tion of the iconv code conversion:
// The initial operation element:
operation init {
// one or more operation expression element definitions here.
:
:
}
// The reset operation element:
operation reset {
// one or more operation expression element definitions here.
:
:
}
The initial operation element defines the operations that
need to be performed in the beginning of the iconv code
conversion. The reset operation element defines the opera-
tions that need to be performed when a user of the iconv(3)
function requests a state reset of the iconv code conver-
sion. For more detail on the state reset, refer to
iconv(3C).
The operation expression can be any one of the following
three different expressions and each operation expression
should be separated by an ending semicolon:
if-else operation expression
output operation expression
control operation expression
The if-else operation expression makes a selection depend on
the boolean expression result. If the boolean expression
result is true, the true task that follows the 'if' is exe-
cuted. If the boolean expression yields false and if a false
task is supplied, the false task that follows the 'else' is
executed. There are three different kinds of if-else opera-
tion expressions:
// The if-else operation expression with only true task:
if (expression) {
// one or more operation expression element definitions here.
:
:
}
// The if-else operation expression with both true and false
// tasks:
if (expression) {
// one or more operation expression element definitions here.
:
:
} else {
// one or more operation expression element definitions here.
:
:
}
// The if-else operation expression with true task and
// another if-else operation expression as the false task:
if (expression) {
// one or more operation expression element definitions here.
:
:
} else if (expression) {
// one or more operation expression element definitions here.
:
:
} else {
// one or more operation expression element definitions here.
:
:
}
The last if-else operation expression can have another if-
else operation expression as the false task. The other if-
else operation expression can be any one of above three if-
else operation expressions.
The output operation expression saves the right hand side
expression result to the output buffer:
// Save 0x8080 at the output buffer:
output = 0x8080;
If the size of the output buffer left is smaller than the
necessary output buffer size resulting from the right hand
side expression, the iconv code conversion will stop with
E2BIG errno and (size_t)-1 return value to indicate that the
code conversion needs more output buffer to complete. Any
expression can be used for the right hand side expression.
The output buffer pointer will automatically move forward
appropriately once the operation is executed.
The control operation expression can be any one of the fol-
lowing expressions:
// Return (size_t)-1 as the return value with an EINVAL errno:
error;
// Return (size_t)-1 as the return value with an EBADF errno:
error 9;
// Discard input buffer byte operation. This discards a byte from
// the current input buffer and move the input buffer pointer to
// the 2'nd byte of the input buffer:
discard;
// Discard input buffer byte operation. This discards
// 10 bytes from the current input buffer and move the input
// buffer pointer to the 11'th byte of the input buffer:
discard 10;
// Return operation. This stops the execution of the current
// operation:
return;
// Operation execution operation. This executes the init
// operation defined and sets all variables to zero:
operation init;
// Operation execution operation. This executes the reset
// operation defined and sets all variables to zero:
operation reset;
// Operation execution operation. This executes an operation
// defined and named 'ISO8859_1_to_ISO8859_2':
operation ISO8859_1_to_ISO8859_2;
// Direction operation. This executes a direction defined and
// named 'ISO8859_1_to_KOI8_R:
direction ISO8859_1_to_KOI8_R;
// Map execution operation. This executes a mapping defined
// and named 'Map_ISO8859_1_to_US_ASCII':
map Map_ISO8859_1_to_US_ASCII;
// Map execution operation. This executes a mapping defined
// and named 'Map_ISO8859_1_to_US_ASCII' after discarding
// 10 input buffer bytes:
map Map_ISO8859_1_to_US_ASCII 10;
In case of init and reset operations, if there is no pre-
defined init and/or reset operations in the iconv code
conversions, only system-defined internal init and reset
operations will be executed. The execution of the system-
defined internal init and reset operations will clear the
system-maintained internal state.
There are three special operators that can be used in the
operation:
printchr expression;
printhd expression;
printint expression;
The above three operators will print out the given expres-
sion as a character, a hexadecimal number, and a decimal
number, respectively, at the standard error stream. These
three operators are for debugging purposes only and should
be removed from the final version of the iconv code conver-
sion definition file.
In addition to the above operations, any valid expression
separated by a semi-colon can be an operation, including an
empty operation, denoted by a semi-colon alone as an opera-
tion.
The map element specifies a direct code conversion mapping
by using one or more map pairs. When used, usually many map
pairs are used to represent an iconv code conversion defini-
tion:
map For_US-ASCII_2_ISO8859-1 {
// one or more map pairs here
:
:
}
Each map element also can have one or two comma-separated
map attribute elements like the following examples:
// Map with densely encoded mapping table map type:
map maptype = dense {
// one or more map pairs here
:
:
}
// Map with hash mapping table map type with hash factor 10.
// Only hash mapping table map type can have hash factor. If
// the hash factor is specified with other map types, it will be
// ignored.
map maptype = hash : 10 {
// one or more map pairs here.
:
:
}
// Map with binary search tree based mapping table map type:
map maptype = binary {
// one more more map pairs here.
:
:
}
// Map with index table based mapping table map type:
map maptype = index {
// one or more map pairs here.
:
:
}
// Map with automatic mapping table map type. If defined,
// system will assign the best possible map type.
map maptype = automatic {
// one or more map pairs here.
:
:
}
// Map with output_byte_length limit set to 2.
map output_byte_length = 2 {
// one or more map pairs here.
:
:
}
// Map with densely encoded mapping table map type and
// output_bute_length limit set to 2:
map maptype = dense, output_byte_length = 2 {
// one or more map pairs here.
:
:
}
If no maptype is defined, automatic is assumed. If no
output_byte_length is defined, the system figures out the
maximum possible output byte length for the mapping by scan-
ning all the possible output values in the mappings. If the
actual output byte length scanned is bigger than the defined
output_byte_length, the geniconvtbl utility issues an error
and stops generating the code conversion binary table(s).
The following are allowed map pairs:
// Single mapping. This maps an input character denoted by
// the code value 0x20 to an output character value 0x21:
0x20 0x21
// Multiple mapping. This maps 128 input characters to 128
// output characters. In this mapping, 0x0 maps to 0x10, 0x1 maps
// to 0x11, 0x2 maps to 0x12, ..., and, 0x7f maps to 0x8f:
0x0...0x7f 0x10
// Default mapping. If specified, every undefined input character
// in this mapping will be converted to a specified character
// (in the following case, a character with code value of 0x3f):
default 0x3f;
// Default mapping. If specified, every undefined input character
// in this mapping will not be converted but directly copied to
// the output buffer:
default no_change_copy;
// Error mapping. If specified, during the code conversion,
// if input buffer contains the byte value, in this case, 0x80,
// the iconv(3) will stop and return (size_t)-1 as the return
// value with EILSEQ set to the errno:
0x80 error;
If no default mapping is specified, every undefined input
character in the mapping will be treated as an error map-
ping. and thus the iconv(3C) will stop the code conversion
and return (size_t)-1 as the return value with EILSEQ set to
the errno.
The syntax of the iconv code conversion definition in
extended BNF is illustrated below:
iconv_conversion_definition
: CONVERSION_NAME '{' definition_element_list '}'
;
definition_element_list
: definition_element ';'
| definition_element_list definition_element ';'
;
definition_element
: direction
| condition
| operation
| map
;
direction
: 'direction' NAME '{' direction_unit_list '}'
| 'direction' '{' direction_unit_list '}'
;
direction_unit_list
: direction_unit
| direction_unit_list direction_unit
;
direction_unit
: condition action ';'
| condition NAME ';'
| NAME action ';'
| NAME NAME ';'
| 'true' action ';'
| 'true' NAME ';'
;
action
: direction
| map
| operation
;
condition
: 'condition' NAME '{' condition_list '}'
| 'condition' '{' condition_list '}'
;
condition_list
: condition_expr ';'
| condition_list condition_expr ';'
;
condition_expr
: 'between' range_list
| expr
| 'escapeseq' escseq_list ';'
;
range_list
: range_pair
| range_list ',' range_pair
;
range_pair
: HEXADECIMAL '...' HEXADECIMAL
;
escseq_list
: escseq
| escseq_list ',' escseq
;
escseq : HEXADECIMAL
;
map : 'map' NAME '{' map_list '}'
| 'map' '{' map_list '}'
| 'map' NAME map_attribute '{' map_list '}'
| 'map' map_attribute '{' map_list '}'
;
map_attribute
: map_type ',' 'output_byte_length' '=' DECIMAL
| map_type
| 'output_byte_length' '=' DECIMAL ',' map_type
| 'output_byte_length' '=' DECIMAL
;
map_type: 'maptype' '=' map_type_name : DECIMAL
| 'maptype' '=' map_type_name
;
map_type_name
: 'automatic'
| 'index'
| 'hash'
| 'binary'
| 'dense'
;
map_list
: map_pair
| map_list map_pair
;
map_pair
: HEXADECIMAL HEXADECIMAL
| HEXADECIMAL '...' HEXADECIMAL HEXADECIMAL
| 'default' HEXADECIMAL
| 'default' 'no_change_copy'
| HEXADECIMAL 'error'
;
operation
: 'operation' NAME '{' op_list '}'
| 'operation' '{' op_list '}'
| 'operation' 'init' '{' op_list '}'
| 'operation' 'reset' '{' op_list '}'
;
op_list : op_unit
| op_list op_unit
;
op_unit : ';'
| expr ';'
| 'error' ';'
| 'error' expr ';'
| 'discard' ';'
| 'discard' expr ';'
| 'output' '=' expr ';'
| 'direction' NAME ';'
| 'operation' NAME ';'
| 'operation' 'init' ';'
| 'operation' 'reset' ';'
| 'map' NAME ';'
| 'map' NAME expr ';'
| op_if_else
| 'return' ';'
| 'printchr' expr ';'
| 'printhd' expr ';'
| 'printint' expr ';'
;
op_if_else
: 'if' '(' expr ')' '{' op_list '}'
| 'if' '(' expr ')' '{' op_list '}' 'else' op_if_else
| 'if' '(' expr ')' '{' op_list '}' 'else' '{' op_list '}'
;
expr : '(' expr ')'
| NAME
| HEXADECIMAL
| DECIMAL
| 'input' '[' expr ']'
| 'outputsize'
| 'inputsize'
| 'true'
| 'false'
| 'input' '==' expr
| expr '==' 'input'
| '!' expr
| '~' expr
| '-' expr
| expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| expr '%' expr
| expr '<<' expr
| expr '>>' expr
| expr '|' expr
| expr '^' expr
| expr '&' expr
| expr '==' expr
| expr '!=' expr
| expr '>' expr
| expr '>=' expr
| expr '<' expr
| expr '<=' expr
| NAME '=' expr
| expr '||' expr
| expr '&&' expr
;
EXAMPLES
Example 1: Code conversion from ISO8859-1 to ISO646
ISO8859-1%ISO646 {
// Use dense-encoded internal data structure.
map maptype = dense {
default 0x3f
0x0...0x7f 0x0
};
}
Example 2: Code conversion from eucJP to ISO-2022-JP
// Iconv code conversion from eucJP to ISO-2022-JP
#include <sys/errno.h>
eucJP%ISO-2022-JP {
operation init {
codesetnum = 0;
};
operation reset {
if (codesetnum != 0) {
// Emit state reset sequence, ESC ( J, for
// ISO-2022-JP.
output = 0x1b284a;
}
operation init;
};
direction {
condition { // JIS X 0201 Latin (ASCII)
between 0x00...0x7f;
} operation {
if (codesetnum != 0) {
// We will emit four bytes.
if (outputsize <= 3) {
error E2BIG;
}
// Emit state reset sequence, ESC ( J.
output = 0x1b284a;
codesetnum = 0;
} else {
if (outputsize <= 0) {
error E2BIG;
}
}
output = input[0];
// Move input buffer pointer one byte.
discard;
};
condition { // JIS X 0208
between 0xa1a1...0xfefe;
} operation {
if (codesetnum != 1) {
if (outputsize <= 4) {
error E2BIG;
}
// Emit JIS X 0208 sequence, ESC $ B.
output = 0x1b2442;
codesetnum = 1;
} else {
if (outputsize <= 1) {
error E2BIG;
}
}
output = (input[0] & 0x7f);
output = (input[1] & 0x7f);
// Move input buffer pointer two bytes.
discard 2;
};
condition { // JIS X 0201 Kana
between 0x8ea1...0x8edf;
} operation {
if (codesetnum != 2) {
if (outputsize <= 3) {
error E2BIG;
}
// Emit JIS X 0201 Kana sequence,
// ESC ( I.
output = 0x1b2849;
codesetnum = 2;
} else {
if (outputsize <= 0) {
error E2BIG;
}
}
output = (input[1] & 127);
// Move input buffer pointer two bytes.
discard 2;
};
condition { // JIS X 0212
between 0x8fa1a1...0x8ffefe;
} operation {
if (codesetnum != 3) {
if (outputsize <= 5) {
error E2BIG;
}
// Emit JIS X 0212 sequence, ESC $ ( D.
output = 0x1b242844;
codesetnum = 3;
} else {
if (outputsize <= 1) {
error E2BIG;
}
}
output = (input[1] & 127);
output = (input[2] & 127);
discard 3;
};
true operation { // error
error EILSEQ;
};
};
}
FILES
/usr/bin/geniconvtbl
the utility geniconvtbl
/usr/lib/iconv/geniconvtbl/binarytables/*.bt
conversion binary tables
/usr/lib/iconv/geniconvtbl/srcs/*
conversion source files for user reference
SEE ALSO
cpp(1), geniconvtbl(1), iconv(1), iconv(3C), iconv-
close(3C), iconv-open(3C), attributes(5), environ(5)
International Language Environments Guide
NOTES
The maximum length of HEXADECIMAL and DECIMAL digit length
is 128. The maximum length of a variable is 255. The maximum
nest level is 16.
Man(1) output converted with
man2html