Wikisource Loves Manuscripts
Background
In 2020–21, the Wikimedia Foundation funded two projects that helped create a new Wikisource in the Balinese language of Indonesia. One of the projects focused on creating the technology to support the transcription of hand-written palm-leaf manuscripts on Wikisource while the other focused on scanning and digitizing more manuscripts from archives and individual collectors. We believe this is a replicable strategy for engaging with culture and heritage.
Wikisource Loves Manuscripts pilot in Indonesia
Pusat Pengkajian Islam dan Masyarakat (PPIM), a research institute based in Jakarta, will be leading the Wikisource Loves Manuscripts pilot in Indonesia in collaboration with Wikimedia Indonesia and the community-led WikiLontar project, with the support of Wikimedia Foundation.
Regions
The project will focus on manuscripts rescue on three islands: Bali, Java and Sumatra. Manuscripts from this area have a fairly rich diversity in terms of language, script, writing support, and text content.
Timeline
- October to December 2022 – Project planning and announcement
- January to March 2023 – First preservation mission & proofread-a-thon
- April to June 2023 – Second preservation mission & proofread-a-thon
- July to September 2023 – Third preservation mission & proofread-a-thon
- November to December 2023 - Program extension & reports
Primary Activities
Manuscript digitization
The core activity of this project is to digitize manuscript collections belonging to individuals and institutions (libraries, museums etc.) that are in danger of being damaged. All pages of the manuscript will be photographed (or scanned) and a digital copy will be uploaded to Wikimedia Commons under sufficient Creative Common license. Each manuscript bundle will be provided with sufficient metadata via Wikidata.
Wikisource proofread-a-thon
Manuscripts that have been uploaded to Wikimedia Commons and with metadata will then be processed through a transcription process using Wikisource. The manuscript will be typed by volunteers using the script corresponding to that used in the manuscript. For this reason, there will be an introduction to how Wikisource works to handle typing non-Latin scripts. In the next stage, a competition will be held to transcribe manuscripts from the results of digitization.
Transkribus pilot
Texts on Wikisource are transcribed through a mix of automated text recognition and community corrections. Good quality Optical Character Recognition (OCR) allows contributors to focus on improving the quality of content, through proofreading, rather than doing the full transcription manually. It is a prerequisite for scaling Wikisource projects. The Foundation’s CommTech team improved Wikisource by integrating two OCR engines, Google OCR and Tesseract. But many languages and documents are still not supported with high-quality on-wiki OCR, including the Balinese and Javanese language Wikisources that launched in 2021.
- Transkribus (website) is an AI-powered text and handwriting recognition tool which can be used to create OCR models based on transcriptions on Wikisource. Based on an initial research, there are no other text and handwriting recognition tools that can be trained to support any language. There is also an existing community demand (West Bengal Wikimedians) and partner engagement (British Library) with Transkribus.
- A team from IIIT Hyderabad with expertise in computer vision and applied machine learning will test the viability of Transkribus with the under-supported languages of South-East Asia. In the first phase of the pilot, we will be using Balinese language documents already transcribed by volunteers on Wikisource, in order to build a new OCR model.
Updates
- January 2023
- February 2023
- March 2023
- April 2023
- May 2023
- June 2023
- July 2023
- August 2023
- September 2023
- October 2023
- November 2023
- December 2023
Team
PPIM
Wikimedia Indonesia
|
IIIT Hyderabad
|
Wikimedia Foundation
|