GEDCOM/Syntax GEDCOM-line

From GenWiki

Jump to: navigation, search

Contents


This page is an English extract of the German page GEDCOM/Syntax GEDCOM-Zeile [1], for full details see the German page.

Structure Element "GEDCOM Line"

A GEDCOM file is composed by single lines with exactly defined structures ( syntax ).

Formal Description of Permissible Values

Base: GEDCOM Standard Draft 5.5.1

GEDCOM lines are part of the record structure within a GEDCOM file. The structure of these records will be covered in a separate article.

The Standard defines

A GEDCOM line has the following syntax:

gedcom_line := level + delim + [optional_XREF_ID] + tag + [optional_line_value] + terminator

Example:

  • 1 NAME Will /Rogers/

The components used in the pattern above are defined below in the order of above definition. Some of the components are defined in terms of other primitive patterns. The spaces used in the patterns below are only to set them apart and are not a part of the resulting pattern. Character constants are specified in the hex form (0x20) which is the ASCII hex value of a space character. Character constants that are separated by a (-) dash represent any character with in that range from the first constant shown to and including the second constant shown.


level := [digit | digit + digit ]

The Level L may increase by 1 at most. Level numbers must not contain leading zeroes, for example level one must be (1), not (01). Level numbers must start with zero (0).


delim := [(0x20) ]

where: (0x20)=space_character

The delim ( delimiter ), a single space character, terminates both the variable-length level number and the variable-length tag. Note that space characters may also be present in a value.


optional_XREF_ID := XREF_ID + delim

for XREF_ID see article GEDCOM/XREF_ID.


tag := [alphanum | tag + alphanum ]

A tag consists of a variable length sequence of alphanumcharacters. All user-defined tags that have not been defined in the GEDCOM standard, must begin with an underscore character (0x95).

The tag represents the meaning of the line_valuewithin the context of the enclosing tags, and contributes to the meaning of the enclosed subordinate lines. Specific tags are defined in Appendix A. The presence of a tag together with a value represents an assertion which the submitter wishes to communicate to a receiver. A tag without a value does not represent an assertion. If a tag is absent, no assertion is made. Information of a negative nature ( such as knowing positively an event did not occur ) is handled through the semantic definition of a different tag and its accompanying value that assert the information explicitly.

Although formally defined tags are only three or four characters long, systems should prepare to handle user tags of greater length. Tags will be unique within the first 15 characters.

Valid combinations of specific tags, line_values, xref_IDs, and pointers are constrained by the GEDCOM form defined for representing a given kind of information. ( See Chapter 2 of the Standard ).


optional_line_value := delim + line_value


line_value := [ pointer | line_item ]

The line_value identifies an object within the domain of possible values allowed in the context of the tag. The combination of the tag, the line_value, and the hierarchical context of the supporting gedcom_lines provides the understanding of the enclosed values. This domain is defined by a specific grammar for representing a given GEDCOM form. ( See Chapter 2 of the Standard ). Values whose source information contains illegible parts of the value should be indicated by replacing the illegible part with an ellipsis (...). Values are generally not encoded in binary or other abbreviation schemes for reducing space requirements, and they are generally constrained to be understandable by a typical user without decoding. This is intended to reduce the decoding burden on the receiving software. A GEDCOM-optimized data compression standard will be defined in the future to reduce space requirements. Meanwhile, users may agree to compress and decompress GEDCOM files using any compression system available to both sender and receiver. The line_valuewithin the context of a tag hierarchy of gedcom_lines represents one piece of information and corresponds to one field in traditional database or file terminology.


terminator := [carriage_return | line_feed | carriage_return + line_feed | line_feed + carriage_return ]

The terminator delimits the variable-length line_value and signals the end of the GEDCOM line. The valid terminator characters are described above.


Minimum Content of a GEDCOM Line

All GEDCOM lines have either a value or apointer unless the line contains subordinate GEDCOM lines. In other words the presence of a level number and a tag alone should not be used to assert data ( i.e. 1 DEAT Y should be used to imply a death known to have happened but date and place are unknown, not 1 DEAT ). The Lineage-linked form does not allow a GEDCOM line with both a value and a pointer on the same line.


Illegal Indentations of Lines in a GEDCOM File

Some systems output indented GEDCOM data for better readability by putting space or tab characters between the terminator and the levelnumber of the next line to visibly show the hierarchy. Also, some people have suggested allowing extra blank lines to visibly separate physical records. GEDCOM files produced with these features are not to be used when transmitting GEDCOM to other systems.

Leading white space ( tabs, spaces, and extra line terminators ) preceding a GEDCOM line should be ignored by the reading system. Systems generating GEDCOM should not place any white space in front of the GEDCOM line.


Examples from the Standard

The following lines are independent examples of valid GEDCOM lines.

0 @1234@ INDI
… 
1 AGE 13y 
… 
1 CHIL @1234@ 
… 
1 NOTE This is a NOTE field that is
2 CONT continued on the next line.

The first line has a levelnumber 0, a XREF_ID of @1234@, an INDI tag and no value.

The second line has a levelnumber 1, no XREF_ID, an AGE tag, and a value of 13y.

The third line has a levelnumber 1, no XREF_ID, a CHIL tag, and a value of a pointer to a XREF_ID named @1234@.

0 @I12@ INDI
1 SEX M


Agreements for Syntax of GEDCOM lines

The agreements for "Syntax of GEDCOM lines" are derived from the discussion on the Gedcom-L. They were decided by a vote of the program authors of the list.

Agreements for Export

E1 Rules for Syntax

The following requirements of the standard must be complied with at export:

  • In front of the level number no character is allowed ( Exception: BOM in the very first line of the file ).
  • The level number must be a digit between 0 and 99 and may not contain a leading zero (0) in front of a second digit.
  • The delimiter between other parts of the GEDCOM line consist of exactly one space (0x20).
  • Empty lines ( containing only a terminator ) are not permitted
  • Every GEDCOM lines must have either a value or apointer unless the line contains subordinate GEDCOM lines.


E2 Composition of the GEDCOM Line

The GEDCOM line must comply to the definitions of the Standard when exporting:

  • Gedcom_line := level + delim + [optional_XREF_ID] + tag + [optional_line_value] + terminator

with

  • optional_XREF_ID := XREF_ID + delim
  • optional_line_value := delim + line_value

For the components the provisions defined by the Standard must be fulfilled.


E3 Terminator

The terminator signals the end of the GEDCOM line and must be exported according one of the alternatives.

  • CR/LF | LF | CR | LF/CR

For standard export the use of CR/LF or LF is recommended. LF/CR should not be used.


E4 Deviation from Standard to represent Blank Lines in Notes

A blank line in notes is represented by the GEDCOM code as:

  • n+1 CONT

This line does not have a line_value, nor a pointer/XREF_ID and may also not lead any subordinated line. According to the standard this is therefore not permitted. In deviation to E1 this schema is permissible to export blank lines in notes.


E5 Use of UNICODE-Characters

Formally the standard allows only the use of characters of the ANSEL scope, since the use of UNICODE is not yet incorporated in the grammar description of the syntax of the GEDCOM line. To use the UNICODE characters it is agreed, to supplement the definitions for otherchar by the UNICODE characters >= U+00A0. Therefor these characters are allowed within the line_values.


E6 Handling of @ in Line_Values at Export

In line_values no simple @ may stand in the exported GEDCOM file. To be able to transfer user input of simple @ clearly and according to the standard by line_value, the approach of the standard GEDCOM 5.5 will be used as default export: Each @ in the line_value will be doubled to a @@ when exporting.

An export option to waive the duplication of the @ ( eg for target programs that can't re-convert the @@ at import ) may be offered to the user. This option can be offered especially for e-mail addresses for the tag EMAIL.

Agreements for Import

I1 Extend of Import

GEDCOM lines formed according to the rules E1 to E3 need to be supported in general. With respect to the tags, cross reference pointer and line values the separate agreements apply.


I2 Procedure for Deviations from the Standard in the Import File

  • All characters in front of the level number, especially tabs and spaces, will be ignored ( Exception: BOM in the very first line of the file ).
  • Multiple spaces instead of a delimiter may be treated as a delimiter ( Special rules for CONC: see agreements CONC ).
  • Blank lines ( containing only the terminator ) are ignored.
  • Line does not have a line_value, nor a pointer/XREF_ID and may also not lead any subordinated line.
  • Lines that have neither a line_value, nor a pointer/XREF_ID and also have no subordinated line are ignored. Exception to this is defined in E4 to display blank lines in notes.


I3 Handling of the Character @ at Import

For standard import of line_values the @@ must be re-converted back into a simple @. In case of non-standard simple @ in line_values, they will be taken unchanged.

An import option without replacing the @@ by @ may be offered to the users, especially for files from programs that have not doubled the @ when exporting according E6.


Note:

The agreements governing the permissible length of the GEDCOM line and field sizes within the line are documented under GEDCOM/Field Sizes.

Personal tools
GenWiki-internal
In other languages