From GenWiki

Jump to: navigation, search


Gedbas4all -- New Data Model for Genealogy

by Jesper Zedlitz

The results-oriented research and documentation using GEDCOM constantly causes problems for genealogist. These problems could be bypassed with a source-oriented data model. The paper first provides an example of a typical problem case and then describes a possible solution using a new data model: Gedbas4all.


A good information management is essential, both for ambitious amateur family historians and genealogists with scientific requirements. Even little mistakes that happen often during the research process, often lead to confusion and wrong conclusions later. One reason for this is the results-oriented work that leads to difficulties -- even with careful documentation of the research.

An Example

As an example, I want to show a case that happened to me during my genealogical research: While reading the church book of the Protestant Church Seitendorf, district of Sch\"onau, Silesia, I found a baptismal record for "Christiane Caroline Zedlitz, daughter of Christiana Beata Zedlitz, born 1843" (see Figure 1a).

Image:Gedbas4all article figure 1.svg

Some years later, in 1853, there is a marriage entry for Christiane Beata Zedlitz with a mister Herrmann. My first guess was that bother women are one and the same person. The age of the bride matched and also the interchange of a/e at the end of first names are typical form the relevant church records (fig. 1b). Almost a year later, a stillborn daughter of the couple Herrmann/Zedlitz is noted. Based on that information I reconstructed the family relationship illustrated in figure \ref{fig:kirchenbuch_seitendorf}c.

During further research, I came upon the burial register of the parish. To my great surprise, I discovered a burial entry from the year 1846 for one Beate Christiane Zedlitz (fig. 2) - my previous theory has proven to be false.

Now I was faced with the complicated task of disentangling the wrong results -- I opted for a complete re-entering.

View of Research Area

What happened? An abstract view of our research area (Figure 3) shows the difficulties we face during our research.

Image:Gedbas4all article figure 3.svg

A delight for any family historian is the part of the past, which is supported by sources. (A). However, there are sources that contain false information, due to random error or by intentional falsification (B). Another part of the past is not documented by written documentation (C). Here, the family researchers are in demand to draw conclusions from existing information based on their experience. For example if you do not have a birth record but a marriage certificate stating the age of the spouses you can reconstruct their date of birth. In addition to the cases in which these conclusions are correct (C1), it might also happens that you are wrong (C2).

The documentation of genealogical research make the "journey" through these parts of our research field traceable. However, it must remain recognizable for noted results to which part of the above model they belong.

The error in the above-illustrated example -- almost provoked through the familiar, results-oriented data model of GEDCOM -- has been created by mixing data from sources with my own conclusions. Mixing these two types of information makes later review and possible correction as good as impossible.

Correcting errors properly

Errors in genealogical work are inevitable [Stoyan09][1]. The open handling of errors is essential for serious research. On the one hand it helps with your own work. If you re-encounter some vague or incorrect information -- perhaps years -- later, it might save you a lot (and perhaps fruitless) thoughts on how this conflicting data matches our own -- already corrected -- records. How convenient, if you have listed that you already thought through this case in the past.

On the other hand a more open approach avoids to let other researchers and publications appear in a bad light. As a fictitious example I choose an online publication that lists a birth record "18 June 1857". In a second publication in this entry is cited, accordance with standard scientific citation rules. Later, the author of the online publication finds out that he had made a typing error in the birth entry and silently changes year to "1875". Eventually someone checks the second publication, discovers the error and wrongly accuses the second author inaccurate work -- while both second author and auditors have worked carefully. When using the method of assessing the reliability of genealogical literature proposed in the aforementioned talk the second author would get a bad rating, while the actual error has been made in the online publication. So it is almost careless and unfair to future researchers to correct errors silently and drop them.

Source-oriented data model

The solution to these problems lies in the use of a source-oriented data model that contains information from sources and conclusions clearly separated from each other. In [GDM][2] such a model is presented. It serves as the basis for the data model presented here. The central component is the ASSERTION, which links two other elements (SUBJECT) (fig. 4}).

Image:Gedbas4all Artikel Abbildung 4.svg

These SUBJECTs include the occurrence of a person (PERSONA), a GROUP, a CHARACTERISTIC, an EVENT, or an object (THING). Additionally an ASSERTION contains information on which SOURCE it is based, who made the conclusion, and possibly to which project and genealogical society it belongs. Usually ASSERTIONs are positive but they can also be negative, e.g. stating that a person was not part of a group. Not all combinations between two SUBJECTs are allowed, since they do not make any sense, or there are better ways to model the facts. But first, let me given an overview of the various components:

In a source-oriented data model, it is important that a new persons will be created for each source. Only in a later step -- namely the reasoning -- people from multiple sources are going to be connected. Regardless of how clearly the match might appear, always create a new PERSONA.
Groups are useful to model various facts. The most obvious case is a group of people, e.g. The children of a person, the residents of a house, the members of a military regiment. But things can form a group, too -- such as houses in a street.
individuals, groups, events and things can have properties. For a human, this could for example be the name or the color of hair. For a ship (a THING) it could be its name.
Events take place at different time instances (birth, marriage, death) or for extended periods of time (ship journey, residence, occupation)
Things can be used i a variety of cases: a house, a ship, a company. Things can be linkt to CHARACTERISTICs like PERSONAs. You can connect them with events they appear in, e.g. the work of a person in a company.
In a source-oriented data model sources naturally play a major role. In this context a source refers to the abstract source itself, this is the church register, the tombstone, the list, etc., not a digital image or a transcription. The latter are REPRESENTATIONs of a source. For each source a variety of REPRESENTATIONs can exist. Sources are hierarchical in structure, i.e. a source can consist of several sub-sources. Using the example of a book this hierarchy could be: paper → page → entry. While processing secondary literature you will references to sources. To map this, there is an element SOURCE_REFERENCE, that describes information like "Source 1 says, that source 2 says, that ..."
Representations are digital versions of sources. For example this can be the text church record transcript (fig. 5).

The photo of a gravestone is also a representation, while the photographed tombstone itself is the source. If you have taken several photos of the same gravestone, it is simply a source with multiple representations. Also audio-visual data may represent a representation -- think of the recording of an interview with older relatives.

Image:Gedbas4all Artikel Abbildung 5.svg

Now that the ingredients are introduced, let's take a look at the already mentioned allowed SUBJECT combinations (fig. 6).

Image:Gedbas4all Artikel Abbildung 6.svg

PERSONA-GROUP The person was a member of the 1st Holstein Dragoon Regiment.
PERSONA-CHARACTERISTIC The person had blond hair.
PERSONA-EVENT The person participated as the bride at the wedding.
THING-GROUP The house is located on Main Street. Easier to remember: "The house is part of the houses on Main Street".
THING-CHARACTERISTIC The name of the vessel was "Unsinkable II".
THING-EVENT The ship was involved in the voyage from Bremen to New York.
GROUP-CHARACTERISTIC The street name is "Main Stree".

The regiment took part in the Battle of Waterloo.

Actually the combinations PERSONA-PERSONA and THING-THING seem to be useful to mark equality. However, [GDM] has already shown that it is smarter to merge all probably identical individuals into a GROUP. Then generate a new PERSONA from this GROUP using a GROUP- PERSONA conclusion.

Display of data

If we create new entries for all persons occurring in each source -- that will make presentation confusing, won't it? The key to greater clarity is to aggregate information which do not contradict themselves. A large number of individual occurrences can be combined into a single claim which is provided with a number of references (to the individual source entries). Only when contradictions occur (e.g. two different years of birth) they will highlighted in the result separately. This way these unclear points become well visible and you can start to clarify them. As an example, some entries from Berlin address books are used (fig. 7).

Image:Gedbas4all article figure 7.svg

See table below how these sources can be summarized to information concerning the above-mentioned persons.

Information Value Sources
Name Ernst Alexander [a] [b] [c] [d]
Occupation Hof-Tapezierer [a] [b] [c] [d] [e]
Death between 1824 and 1825 [d] [e]
Residence Berlin, Französischestraße 67 [a] [b] [c] [d] [e]
Wife D. Friedrich [e]
  Death after 1825 [e]

Merging sources

Large part of the genealogical information for German speaking countries are stored in the databases of the Verein für Computergenealogie and regional genealogical societies. However, there is as good as no connection between the individual collections. A person appears in multiple address books, there are personal columns, entries in parish registers, account books, passenger lists, etc. If we could combine these primary sources, it would be possible to create a comprehensive picture of a person's circumstances. If you do these combinations systematically for one place you will get a heritage book with a comprehensive database behind it. The close connection to primary sources ensures a high quality of data. Of course, it is essential not to mix data from different primary sources and to clearly identify conclusions. That ensures that the results are easy to review and incorrect conclusions can be marked by replace them with new conclusions without any impact on the primary sources.

A prerequisite for the linking of information is a central data storage. Different web applications can access this central memory and read their data. The targeted access is made possible by the fact that in addition to the actual genealogical information meta data is stored, describing sources, the researcher, the project and possibly the society that collected the data. That way it is possible to show information from old address books, without mixing them with data from e.g. family columns.

  1. Herbert Stoyan, Der Fehler in genealogischen Systemen, 61. Deutscher Genealogentag, Bielefeld, 2009
  2. GENTECH, Genealogical Data Model, Phase 1, May 2000
Personal tools
In other languages