This is another retrodigitization of the second edition of Linde's dictionary in a form of a two-layer (texts and scans) DjVu corpus and the two-level regular expressions Poliqarp search engine. It is supplemented by a preliminary index which allows to find entries using *a fronte* and *a tergo* lists.
This version of the corpus is available since September 2016. It contains about 7 millions tokens representing about 5000 pages of the dictionary.
The recommended form of acknowledging in the scientific publications the use of this retrodigitization and the search engine is to reference the paper Efficient search in hidden text of large DjVu documents or Skanowane teksty jako korpusy. There are also some relevant presentations, such as An incremental approach to retrodigitization and Przyrostowa metoda dygitalizacji słowników.
The recommended form of acknowledging the use of index is referencing the presentation Elektroniczny indeks do słownika Lindego.
The recommended way to use the search engine and the index is the djview4poliqarp remote client available on all basic platforms, preferably together with the djview4 viewer which supports outlines included in the dictionary files.
Scans were made with Fujitsu fi-6130 scanner and scanhelper software by Joanna A. Bilińska, who also improved the scans using Scan Tailor. They were converted to the DjVu format by Janusz S. Bień with the didjvu program. OCR was prepared by Janusz S. Bień with ocrodjvu using the Tesseract engine. Various adjustments of the OCR results has been done by Janusz S. Bień and Michał Rudolf. The corpus was created and published online by Michał Rudolf.
Search can be limited to a specific volume using metadata, e.g. meta volume=6
.
The meta
clause can
be also used to limit the search using other metadata fields, but it's not practical.
Every token has the following attributes:
orth
, the text segment according
to Unicode® Standard Annex
#29 Unicode Text Segmentation
base
, the same as orth
lang
, possible values: de
(rather
reliable), fr
(rather reliable), pl
(all
other languages including Polish), ru
(not yet used), en
(not used);
script
, possible values: latn
(normal Latin
script and misrecognized other script), latf
(German
Fraktur), cyrl
(not yet used);
series
, possible values, rather unreliable: medium
, bold
;
shape
, possible values, very unreliable: upright
, italic
;
wconf
word recognition confidence, the value is a digit,
you can query it with such regular expressions like [8-9] (meaning
highly confident recognition); this representation has been proposed
by Jakub Wilk.