Historical newspaper archive research at OMILab

OmiLab’s project on Historical Newspaper Archive Research is run in collaboration with the National Library of Israel, that provided access to selected image and OCR output files at the back end of JPRESS - the Historical Jewish Press collection of the Tel Aviv University and the National Library of Israel. As a pilot study we chose HaZefirah, which started 1862 in Warsaw as a weekly publication, and in 1886 became the first daily newspaper in the Hebrew language. Over decades of its publication it changed editorship, places of publication, formats, genres and ideologies and therefore makes for a fascinating case for ‘distant reading’ approaches.
Worldwide, large historical newspapers digitization project provide oceans of valuable historical periodical data. Much of this data, however, is available in platforms that provide mediocre OCR, which hinders  the application of text analytical methods of distant reading, NLP, NER and various methods of semantic processing.
In the first phase we use machine learning based tools through the platform Transkribus to train a model for historical Hebrew print, improving significantly both line detection and character recognition. Our next challenge was to enable improving the OCR without losing the valuable work that was done to analyze the layout and content structure of the newspapers.  For this we created an open workflow which migrates legacy segmentation data  into the open Page format, on which the improved text recognition technologies can run, and then outputs the data as a TEI-XML encoded and enriched corpus.
Creating an OCR Ground Truth and training Hebrerw OCR with Transkribus
The project will not only enable thorough digital research of the fascinating corpus of HaZefirah but will serve as a proof-of-concept study for any future endeavor to salvage 315 other titles, entailing over two million pages, in the Jewish Historical Press collection of JPRESS. The workflow can also be adopted by hundreds of digital collections which are in similar condition.

The open code of the workflow is available here: https://github.com/omilab/historical_press​

The resulted texts are now available using the He-Story search engine

We are currently  experimenting with various analysis research pivots: named entity extraction, geo-temporal examination, keyword analysis over time and topic modelling.
 Visualizing topics and keywords in Hatzfirah
Comparative mapping of geographic references over time: 1882 (left map) and 1883 (right map).

Team members

Presentations and publications

זף סגל, "דבר אל סופרינו: בעיית חדשות הכזב בדיווחים של כתבי העיתון הצפירה, 1874", קשר 52 (2019).
Soffer O., Segal Z., Greidinger N., Rusinek S., Silber-Varod V. (2019). Computational Analysis of Historical Hebrew Newspapers: Proof of Concept. Zutot.
Zef Segal, "From one end of the Earth […] unto the other end of the Earth": Changing perceptions of the world in late-nineteenth-century Jewish journalism," in AJS 50, Boston, December 2018.  
Zef Segal, Vered Silber Varod, Nurit Greidingher, Oren Soffer, "Computational Analysis of Historical Hebrew Newspapers: Affordances and Restrictions," in JJCHC2019, New York, March 2019.
Rusinek, S. & Greidingher, N. (2019). No Tabula Rasa -Digitizing Historical Newspapers here and now. DATECH2019 - Digital Access to Textual Cultural Heritage, Brussels, April 2019.
Zef Segal, "The Periodical as a geographical space," Teldan 34, Tel Aviv, May 2019.
Zef Segal, "The periodical as a geographical space: the 19th century Hebrew “HaZefirah” as a case study," in the framework of a special panel organized by Cliff Wulfmann and Sinai Rusinek: "Complexities in the Use, Analysis, and Representation of Historical Digital Periodicals"  DH 2019, Utrecht, July 2019.

Zef Segal, "Putting the World on Paper: maps and enlightenment in early Hebrew journals," in ICHC 2019, Amsterdam, July 2019.