Corpus of catholic Bible

The corpus of catholic Bible bible-rkc-1.0 has been prepared in collaboration with the Slovak Bishops’ Conference. It was made available on 28 June 2023, containing  796 704 tokens. It is comprised of  73 books of the Old and New Testament with a specific external annotation.

This corpus is lemmatized and morphologically annotated by MorphoDiTa tagger, trained and tuned on tagset developed by the SNK.

The following metadata added to all texts:

  • doc.name: book title, e.g. Genezis
  • doc.bogo: book title abbreviation, e.g. Gn
  • doc.bibliography: Book title. In: Sväté Písmo Starého i Nového zákona. Trnava: Spolok svätého Vojtecha 2003. 2623 p.
  • doc.genre: text specification
    • glett – general letters (Jak, 1 Pt, 2 Pt, 1 Jn, 2 Jn, 3 Jn, Júd, Zjv)
    • gospel – gospels and Acts of the Apostles (Mt, Mk, Lk, Jn, Sk)
    • hist – historical books (Joz, Sdc, Rút, 1 Sam, 2 Sam, 1 Kr, 2 Kr, 1 Krn, 2 Krn, Ezd, Neh, Tob, Jdt, Est)
    • pentateuch – 5 books of Moses (Gn, Ex, Lv, Nm, Dt)
    • plett – letters of Paul (Rim, 1 Kor, 2 Kor, Gal, Ef, Flp, Kol, 1 Sol, 2 Sol, 1 Tim, 2 Tim, Tit, Flm, Hebr)
    • proph – prophetic books (Iz, Jer, Nár, Bar, Ez, Dan, Oz, Joel, Am, Abd, Jon, Mich, Nah, Hab, Sof, Ag, Zach, Mal, 1 Mach, 2 Mach)
    • sapient – wisdom books (Jób, Ž, Prís, Kaz, Pies, Múd, Sir)
  • doc.type: text specification
    • old – Old Testament books
    • new – New Testament books
  • verse.coord: verse coordinates, e.g. Gn_1_1.