SlownikGeo

From GenWiki

Jump to: navigation, search

Contents

Project

Here you will find the Polish Description of the Project.

Introduction

The project aims to obtain a searchable text and a German translation of the series (16 volumes and 14,785 pages) of: Słownik Geograficzny Królestwa Polskiego i innych krajów slowiańskich (Warsaw, 1880-1902) [Geographical Dictionary of the Kingdom of Poland and other Slavic countries], and a transfer of the data into the historical genealogical gazetteer Das Genealogische Ortsverzeichnis [GOV], GenWiki und Hic Leones.

To get an idea about the work achieved by the editor Filip Sulimierski and his nearly 700 collaborators during the years 1880 - 1902, please visualize that the entire text of the Słownik Geograficzny (one line after the other to result in a single text string) would have a length of about 130000 km (i.e. more than 3 times around the equator ...).

History

In 2003, the Polish Genealogical Society of America (PGSA) published the 16 volumes of the Słownik Geograficzny Królestwa Polskiego on CD-ROM (DJVU format). This project was carried out by Rafał T. Prinke (Editor-in-Chief), Poznan, Poland, (Digitization: Michał and Stanisław Prinke) with additional material from William F. Hoffman and financed by the PGSA (Project Manager: Kenneth Czerwinski; Project Committee: Marcia Bergman, Jim Czuchra, Virginia Hill, Rosalie Lindberg, Annmarie Utroska and Stanley Schmidt). Selected entries were translated by PGSA members into English.


Idea

The volumes of the Słownik Geograficzny contain descriptions of all regions, towns, villages and other settlements, mountains, rivers and lakes of the Kingdom of Poland (Congress Poland, including the Baltic (non-Slavic) and the Western and Southern gubernias of the Russian Empire, West and East Prussia, the Grand Duchy of Poznań and Prussian Silesia, Galicia, Austrian Silesia, Moravia, the Slovak parts of Hungary and Bukovina, and additionally also the more important places in the remaining gubernias of European Russia (such as county seats, parishes, railroad stations, etc.), with an estimated 315,000 to 350,000 entries.

  • it consists of GRAPHIC files, which cannot be searched for textual terms, and
  • it is Polish text, which often poses a language problem for non-Polish researchers.

In consequence, it would be highly desirable to render this treasure of information accessible to researchers of history and genealogists

  • as a TEXT file (with a global search function) and
  • in the German language.


Preliminary Work

Tests involving digitizing the graphic files of the Słownik-Geograficzny-CD-ROM with OCR (optical character recognition using FineReader, Ver. 6.0) technology yielded very good results (> 95 % correct text files), including the specific Polish characters.


Legal Aspects

H.V.J. Kolbe (Hic Leones) contacted the PGSA (Public Relations: Mrs. Cynthia Piech, Chicago) with examples of the OCR results, and proposed the following collaboration:

  • The PGSA officially authorizes Hic Leones to digitize the 14,785 graphic files of the Słownik-Geograficzny-CD-ROM, at no cost, for own uses, and to create the corresponding text files.
  • In exchange, the PGSA will receive these fully formatted text files from Hic Leones, for their own purposes (e.g. translation into English, additional CDs [text + graphics] etc.), without costs and without strings attached.

The PGSA considered this offer, and in Sept. 2005 H.V.J. Kolbe received a written agreement from the PGSA Board of Directors. This agreement created the legal basis for the use of the data, and in addition creates a new interesting contact for family research in Poland.

This result was announced by Hic Leones in 2005 at the 57th Genealogentag in Hannover/Germany. The PGSA announced this collaboration simultaneously at their 27th Annual Conference in Schaumburg/IL.

At the end of Sept. 2005, all 14,785 pages of the Słownik Geograficzny were digitized, and in Oct. 2005 the Polish text files were sent to Cynthia Piech on CD-ROM.

The Project

Type

Scientific cooperation.


Project Partners and Contributions

PGSA (Polish Genealogical Society of America)

http://www.pgsa.org

  • Scanning of the 14,785 pages of the Słownik Geograficzny and storage in graphic format
  • Publication of the Słownik Geograficzny Królestwa Polskiego on CD-ROM

Hic Leones

http://www.hicleones.com

  • Concept and coordination of the project
  • Text recognition (OCR) of the 14,785 pages of the Słownik Geograficzny and storage in text format
  • Preparation of a PL => D dictionary (from 1879; 1000 pages), free for collaborators

CompGen

(Verein für Computergenealogie e. V.)

  • Supply of the database import/export modules
  • Co-coordination of the project

FGG (Forschungsgruppe Grafschaft Glatz)

http://www.genealogienetz.de/vereine/AGoFF/fst/fgr_glat.htm

  • Contribution of a Polish => German translation software for pre-translations


Requirements

For an efficient and rapid translation of the enormous volume of data into German (about 250 MB formatted text; non-formatted about 70-80 MB), the text will be transferred into a temporary database.

Pages from this temporary database can then be sent, together with a short work instruction, to interested researchers proficient in Polish and German (see below: Example of collaboration with translators), who then can translate (at leisure, i.e. offline) pages of their own interest. [1]

An estimation indicates that after already one month's work on (Polish) word by (German) word replacement (about 200 abbreviations and about 300-400 standard terms, e. g. railway station, post office, church, inhabitants etc.) within the temporary database, about 40 % of the text becomes understandable to a German reader who has no notions of the Polish language.


Schedule

The project comprises the following steps [ green = achieved or 'under construction', red = in the planning ]

  • Scanning of the 14,785 pages of the Słownik Geograficzny and publication of the CD-ROM (completed by the PGSA in 2003).
  • OCR (optical character recognition) of the 14,785 graphic files of the Słownik Geograficzny and storage of the text files in CP 1250 format (completed by Hic Leones in Sept. 2005).
  • Scanning and publication of a Polish => German dictionary (1879; 998 pages) as a PDF file, annotated by bookmarks for fast word access (completed by Hic Leones in Dec. 2006; included in the CompGen-CD 2006/2007).
  • Programming of the import/export modules of the temporary database (completion planned for middle of Feb. 2007).
  • Import of the 14,785 text files of the Słownik Geograficzny into the temporary database, including the Polish foreign characters (middle of Feb. 2007).
  • Compilation of a Polish-German word-by-word list for translation, to be expanded continuously (since the beginning of Jan. 2007).
  • Summary of the project in a review article for the German journal 'Computergenealogie' (beginning of Feb. 2007), which was published in the end of March 2007.
  • Internally: Removal of systematic OCR-errors, unnecessary hyphens and empty spaces; insertion of an empty line behind every geographical entry (continuously starting in the middle of Feb. 2007).
  • Internally: Documented Search/Replace of Polish abbreviations by corresponding German abbreviations; removal of systematic OCR-errors; removal of residual hyphens etc. etc. (continuously, starting from the middle/end of Feb. 2007).
  • Internally: Documented Search/Replace of additional (ca. 200 - 300) standard terms (continuously, starting in the middle/end of Feb. 2007).
  • Internally: Preparation of lists of geographical terms which follow defined terms of administrative districts ('Gub.' and 'Pow.'). This is done by using a parser program. These geographical terms are then sorted and replaced by the corresponding geographical standard term throughout the database (continuously since the middle of may 2007).
  • Internally: Documented Search/Replace of additional (ca. 800) standard terms (i.e. the basic vocabulatory) of the polish language (continuously since the middle of June 2007).
  • Internally: All geographical names following the term 'Kom.' (i.e. Komitat, i.e. administrative unit in Hungary; found on about 1100 pages), have been standardized from the Polish to the Hungarian name (June/July 2007).
  • Internally: Formatting (one empty line between single entries) has been terminated after two rounds of parsing of the whole text and manual correction of the indicated potentially missing empty line (Sept. 2007).
  • We shall need many volunteers for the work on the translation of SlownikGeo. We plan: Announcement of the project in genealogical mailing lists (beginning of 2008) to attract external collaborators, and repetition of these announcements, with short progress reports, about every 2 months.
  • Compilation of a glossary of Polish terms that will not be translated, but will be explained in a separate note (collaboration with Prof. Eichler, Leipzig) (end of May 2007).
  • Corresponding announcements in genealogical journals (continuously, starting in the end of May 2007).)
  • Incorporation of translated and corrected pages into GOV and Hic Leones: As soon as a page has been completed, it will be labelled (and protected against additional modifications) and transferred to GOV collaborators, who will enter the geographical descriptions into the GOV database (continuously, starting in Mar. 2007).
  • 'Mission accomplished' for the SlownikGeo Project (careful estimation): A.D. 2011.

It took the Polish authors 22 years to research, document, correct, and publish this encyclopaedic series. Using modern technology, it should be possible to translate and incorporate the data in a quarter of this time, i.e. 5 - 6 years ...

Example of Collaboration with Translators

  • Who can participate ?
    • For the translation, we are looking for motivated collaborators who are proficient in Polish and German. If we are informed about preferences for specific regions or capitals, we intend to honor these choices (if possible) in the selection of the pages to be sent.
    • For the incorporation of data into GOV, proficiency in Polish will not be required: Here anybody who is interested can contribute !
  • Collaborators (proficient in Polish and German) signal their interest to participate (contact addresses, see below). Collaborators will receive (by e-mail):
    • the TEXT pages of their choice (let us say, 10) in simple text format (including a special font for the foreign characters), and
    • the corresponding original 10 GRAPHIC pages of the Słownik Geograficzny (JPG files) as controls,
    • a short work instruction (in German),
    • an Excel data file documenting the status of the word-by-word replacements so far established,
    • the PDF-file of the Polish => German dictionary of 1879, with bookmarks for fast navigation (either from the CompGen-CD 2006/2007 or by download).
  • The collaborator translates the pages sent. If certain words (standard terms) appear to be suitable for global replacement in the temporary database, these can be entered into the Excel data file (the Polish word and its German equivalent) and submitted to the project coordinators.[2] Translators are asked to send back the 10 corrected text files and the Excel data file to the coordinator, who will re-import the pages into the temporary database and mark them for further processing.


Do you have any questions ? Are you interested to participate ?

Please contact:

Dr. Hanno V. J. Kolbe (Coordinator)
6, rue des Tuiliers
67204 Achenheim/France
E-Mail: mailto:kolbe@hicleones.com

Peter Lingnau (Co-Coordinator, GOV)
Spicherer Str. 43
86157 Augsburg/Germany
E-Mail: mailto:PeterLingnau(a)yahoo.de

Notes

  1. As a contemporary tool for the translation, a very rich Polish => German dictionary (1879; 998 pages) was image-digitized and thoroughly bookmarked alphabetically (a PDF-file is included on the CompGen-CD 2006/2007), which every interested collaborator will receive free of charge. In addition, the FGG (Forschungsgruppe Grafschaft Glatz) contributed a Polish <=> German translation program to the project coordinator, to evaluate the use of rapid pre-translations (the limitations of translation programs are well known, but the Słownik Geograficzny contains listings of statistical data and is not a philosophical thesis or a compilation of poetry ...).
  2. In this way, collaborators can transfer their experience in a controlled way (via the Excel data file and the project coordinators) to all pages; in result, the initial quality (= percentage of German text) of the not yet completely translated text will increase and the translation time per page will diminish (i.e. translation speed increases). Completed pages will be made available for incorporation into GOV, GenWiki and Hic Leones.
Personal tools
GenWiki-internal
In other languages