GEDCOM/Field Sizes

From GenWiki

Jump to: navigation, search

Contents


This page is an English extract of the German page GEDCOM/Feldlängen [1], for full details see the German page.

Name and Meaning

Field Sizes

This article covers the field sizes of components in a GEDCOM file.

Usage

Minimum lengths and the maximum lengths of components, requirements of the standards and recommendations will be covered.

Formal Description of Permissible Values

Base: GEDCOM Standard Draft 5.5.1

The Standards defines

Concepts

A GEDCOM transmission represents a database in the form of a sequential stream of related records. A record is represented as a sequence of tagged, variable-length lines, arranged in a hierarchy.


Grammar Rules

Long values can be broken into shorter GEDCOM lines by using a subordinate CONC or CONT tag. The CONC tag assumes that the accompanying subordinate value is concatenated to the previous line value without saving the carriage return prior to the line terminator. If a concatenated line is broken at a space, then the space must be carried over to the next line. The CONT assumes that the subordinate line value is concatenated to the previous line, after inserting a carriage return.

Logical GEDCOM record sizes should be constrained so that they will fit in a memory buffer of less than 32K. GEDCOM files with records sizes greater than 32K run the risk of not being able to be loaded in some programs. Use of pointers to records, particularly NOTE records, should ensure that this limit will be sufficient. Note from author: This limitation is historical and has no practical meaning any more.

Any length constraints are given in characters, not bytes. When wide characters ( characters wider than 8 bits ) are used, byte buffer lengths should be adjusted accordingly.

The cross-reference ID has a maximum of 22 characters, including the enclosing ‘ at’ signs (@), and it must be unique within the GEDCOM transmission.

The length of the GEDCOM TAG is a maximum of 31 characters, with the first 15 characters being unique.

The total length of a GEDCOM line, including level number, cross-reference number, tag, value, delimiters, and terminator, must not exceed 255 ( wide ) characters.


Syntax of the Grammar

Although formally defined tags are only three or four characters long, systems should prepare to handle user tags of greater length. Tags will be unique within the first 15 characters.


Primitive Elements

The field sizes show the minimum recommended field length within a database that is constrained to fixed length fields. The field sizes are in addition to the GEDCOM level and tag overhead. GEDCOM lines are limited to 255 characters. However, the CONCatenation or CONTinuation tags can be used to expand a field beyond this limit. CONT line implies that a new line should appear to preserve formatting. CONC implies concatenation to the previous line without a new line. This is used so that a text note or description can be processed ( word wrapped ) in a text window without fixed carriage returns. The CONT and CONC tags are being used to extend specified textual values.


Agreements for Field Sizes

The agreements for "Field sizes" are derived from the discussion on the Gedcom-L. They were decided by a vote of the program authors of the list.

Preface

The discussion in the GEDCOM-L has clearly shown that there are very different approaches regarding the possible length of individual data fields in the programs. There are both, programs that limit the possible length of fields such as names, occupations, places etc. ( the respective maximum length varies greatly by programs ) and programs that can handle any number of characters in the fields. The Gedcom-L has set itself the task to improve the data transfer between programs, but it is not the task to regulate internal program specifics.

With respect to the length of data field in the programs it is therefor only referred to the GEDCOM standard, RECOMMENDING a minimum length of the data fields for such programs using length-limited fields. Whether this recommendation is followed in each case, the decision is only with the program authors and not subject of agreements to be reached here.

This means that such points are taken from the discussion the list, that makes data transfer between the programs for the user as smoothly as possible. However, it is immediately noted that the transfer of data between programs with different length data fields cannot function 1:1. The more a program limits the field lengths, the more the user has to be expected that data from other programs can not be imported or not fully allocated properly. And the longer fields a program allows, the more the user must be aware that in case of using this field length his data may not be transferred to other programs completely or not to the correct associated data fields.

The proposed regulations will at least ensure that between programs with the same field lengths the data transfer will work. Further recommendations are formulated, to enable the user a processing with non importable data.

Agreements with Respect of Export

E1 Requirements of the GEDCOM Standard 5.5.1

The requirements of the standards must be met. This applies in particular to the following points:

  • The cross-reference identifier have a maximum length of 22 characters, including the framing 'at' characters (@).
  • The length of a GEDCOM tag is limited to 31 characters, and the first 15 must be unique.
  • The overall length of a GEDCOM line along with the level number, cross-reference id, tag, value, limiters and end of line must not exceed 255 ( wide ) characters.

All length limits are shown here as characters rather than bytes. If "wide characters" ( characters that are wider than 8 bits ) are used, the number of bytes is correspondingly higher.


E2 Extent of Export

The complete contents of the data fields must be exported. Where this would lead to exceeding the line length of 255 characters the lines must be wrapped ( for all tags ) with CONC or CONT. The agreements for the separation of lines with CONC must be maintained. CONT should not be used except for the cases explicitly defined by the GEDCOM standard:

  • in the HEADER after COPR and NOTE,
  • in NOTE_RECORD after NOTE,
  • in SOURCE_RECORD after AUTH, TITL, PUBL, TEXT
  • in INDIVIDUAL_ATTRIBUTE_STRUCTURE after DSCR
  • in NOTE_STRUCTURE after NOTE
  • in SOUR_CITATION after TEXT and SOUR

In addition CONT is allowed {0:3}:

  • in ADDRESS_STRUCTURE after ADDR

Lines with a length of less than 255 characters may be wrapped with CONC ( or CONT ). This is not recommended because of the risk of data loss in third-party programs, especially in cases where the use of CONC and CONT is not explicitly mentioned in the standard.


E3 Use of NOTE for very long content

With regard to destination programs with length limited data fields the data content may be transferred into notes ( NOTE, if necessary with CONC | CONT ) during export. It is recommended to prefix a reference name to the content, e.g.

  • n NOTE Occupation: .... here follows a very long text of the occupation input field ....

Agreements with Respect of Import

I1 Handling of Longer Data Content during Import

Wherever possible the data content to be imported to be should completely be transferred to the appropriate data fields. If this is not possible because of the ( possibly by CONC | CONT extended ) data content does not fit into the associated data field of the program or the data field is not part of the program functionality, the user must be informed. It is recommended to enable the user to access the not imported part of the data field. This can be done either by storing as NOTE or in a separate error file.


I2 Handling of Extended Text by CONC | CONT

Content from CONC lines should be added to the previous content without adding a space or line break. Content from CONT lines should be added with a line break. In all cases except those in E2 resp. in the GEDCOM standard explicitly defined extension this line break may be replaced by a space ( especially for single-line data fields as the target ). Does the program not support the extension of data content by CONC and/or CONT the user must get appropriate error messages.


I3 Allocation of Not Imported Parts

It is recommended for parts not imported to the associated data fields according I1 and I2 to provide to the user appropriate notes and references about the corresponding data fields. For notes this can be done as suggested in E3. For storing in error files instructions should be included that allow an allocation to the concerned record and data field.

Personal tools
GenWiki-internal
In other languages