GEDCOM/CHAR-Tag

From GenWiki

Jump to: navigation, search

Contents


This page is an English extract of the German page GEDCOM/CHAR-Tag [1], for full details see the German page.

Name and Meaning

Tag

CHAR

Meaning

CHARACTER

Usage

By the tag CHAR the coding of letters and characters used by the GEDCOM file shall be defined.

Formal Description of Permissible Values

Base: GEDCOM Standard Draft 5.5.1

The standard allows the use of the following character sets:

- 8-Bit ANSEL

- ASCII (USA Version)

- UNICODE

- UTF-8

Standard Case

Each GEDCOM file should provide by the header information used for coding. The format is as follows:

1 CHAR <CHARACTER_SET>

Example:

1 CHAR UTF-8

Warning: Only following codings are allowed:

<CHARACTER_SET> := ANSEL |UTF-8 | UNICODE | ASCII

Optional Data

By a directly subordinated tag

2 VERS <VERSION_NUMBER>

of the tag CHAR, the version level of the defined character set can be specified.

Agreements for CHAR

The agreements for CHAR are derived from the discussion on the Gedcom-L. They were decided by a vote of the program authors of the list.

CHAR at Export

E1 Specifying the Character Set

The character set used in the file must be explicitly specified in the header of the file in the form:

1 CHAR <CHARACTER_SET>

E2 Standard Character Set

The export in UTF-8 must be supported. This character set should be set as the default encoding.

E3 Optional Character Set

The character sets ANSEL, ASCII, UNICODE (includes UCS-2 BE, UCS-2 LE, UTF-16 BE, UTF-16 LE) may be supported optional. The use of the character set UNICODE is not recommended.

E4 Toleration of ANSI

The support of the character set ANSI will be tolerated, that means, it must be supported as an option, though ANSI is not permitted by the GEDCOM 5.5.1 Standard.

E6 Position of CHAR within the Header

For files without Byte Order Mark ( BOM ) CHAR should be put as far to the top, that no characters outside the scope of US_ASCII occur prior to the tag CHAR.

E7 Exclusion of further Character Set in Export

Other character sets as defined above in E2, E3 and E4 may not be used in the export.

CHAR at Import

I1 Support for Character Sets

For import UTF-8 and ANSEL must be supported.

I2 Support of further Character Sets

Other character sets may be supported on import without limitation. The character sets according E3 is recommended.

I3 Toleration and Support of ANSI

The ANSI character set may be supported. For a transitional period, the import of ANSI-encoded files must be supported.

I4 Identification of Encoding

The encoding must be automatically recognized by the program when it is correctly specified according to E1 or B1.

I5 Warning of incomplete Import

If an imported file can't be fully processed because of their encodings, the user should get a warning.

Qualification by Byte Order Mark (BOM)

B1 BOM for Export

The Byte Order Mark (BOM) must be issued for the character set UTF-8 and UNICODE (USC-2 and UTF-16).

B2 BOM for Import

By detection routines an automatic detection must ensure if a BOM is present in the file to be read, and if found, which BOM.

Personal tools
GenWiki-internal
In other languages