Project Leads: Laurie Allen (formerly Penn Libraries), Samantha Blickhan (Adler Planetarium/Zooniverse), Laura Newman Eckstein (formerly Penn Libraries), Emily Esten (Penn Libraries), Arthur Kiron (Penn Libraries), Marina Rustow (Princeton University). Our full team is listed here on our About page of the project website and in the Acknowledgements of this article.
providing functionality for the custom transcription interface (front-end) and classification process;
designing and creating the infrastructure for transcription;
implementing needed interface changes based on beta review feedback;
ensuring the site will function on various browsers;
sharing the project with their community base; and
long-term maintenance of the project.
Penn Libraries was responsible for the data preparation, providing project content, research preparation (identifying researchers for content review, examining the data, and using results), volunteer preparation, and ongoing social engagement with the project community. Penn Libraries continues to maintain the day-to-day engagement with the site, exporting and processing the crowdsourced data for regular reviews by scholars and for future public use.
Scholars from the Princeton Geniza Lab at Princeton University, the e-Lijah Lab, and the Centre for Interdisciplinary Research of the Cairo Genizah at the University of Haifa, École Pratiques des Hautes Études, and the University of California Los Angeles contributed to the content development for the project’s classification structure. They suggested the necessary inputs and outputs for the classification and transcription phases, devising a workflow that is efficient for data production. They also produced exemplar documents, lists of tags, help text, and alphabets for Hebrew script. This helped to ensure the data collected about each fragment was accurate and helpful for the transcription process. These scholars continue to play an active role in the project through online and in-person community engagement, reviewing crowdsourced transcription data for accuracy, and making use of crowdsourced data for related projects at their respective institutions.
Image partners provide the image files and metadata for Cairo Geniza fragments and have first access to the data. We worked closely with digital and/or Judaic Studies departments at each of the following institutions to facilitate image and metadata transfer for use in Zooniverse: Library of the Jewish Theological Seminary, the Genizah Research Unit at Cambridge University Library, the University of Manchester Library, the Bodleian Libraries at the University of Oxford, the National Library of Israel, and Columbia University Libraries.
Scribes of the Cairo Geniza is an international partnership led by the University of Pennsylvania Libraries and the Zooniverse, the world’s largest platform for online crowdsourced research. The project invites the public to help classify and transcribe fragments from the Cairo Geniza, a corpus of 350,000 fragments primarily from the 10th-13th centuries, found in a storeroom (or ‘geniza’) of the Ben Ezra synagogue in Fustat. This project has three primary goals:
Provide our community of citizen scientists opportunities to view and decipher Cairo Geniza fragments;
Contribute to the classification of fragments by script-type and content;
Produce classifications and transcriptions of the material to be available as open-source datasets for historians, linguists, and other scholars to reuse, republish, and communicate research findings back to the crowdsourcing community.
No language proficiency in either Hebrew or Arabic is required to participate. This project operates on two primary workflows: Sorting and Transcription. In the Sorting workflow, volunteers answer a series of questions about a fragment regarding the text on the page and any visual characteristics. Once the fragment has been reviewed independently by five volunteers, it moves into a corresponding Transcription workflow based on a consensus regarding its language and difficulty. In the Transcription workflows, volunteers transcribe lines of a fragment independently of one another using a clickable on-screen keyboard. The results are aggregated together to make a single, “best” version.
This project and platform do not intend to recreate spaces for the preservation and access to digital assets like the Geniza fragments. Each image partner has worked towards making, or has already made, this content accessible to the public through their respective institutional platforms. Similarly, projects specifically facilitating access to materials, metadata, and research related to Geniza fragments already exist. Instead, the strength of the Zooniverse platform is in its task-specific approach to enabling a collaborative and public research process. Using the Zooniverse platform, this project builds a community space for analysis and discussion around these fragments as digital objects.
As opposed to an archive or data repository, Zooniverse allows us to aggregate several distinct collections together to ask a standard set of questions of those images. Volunteers can broadly and quickly identify features or similarities across a collection. Acknowledging that there will be a wide variety of responses, this provides our researchers with a starting point for further research. Volunteers interrogate individual images to a specific end, and are then given the freedom to develop their thoughts and responses further via discussions with other volunteers and researchers in the message board space (Talk). This in-depth nature of engaging with content and metadata facilitates the process for discoveries that can be made or documented in these other repositories.
Research on the Geniza has proceeded slowly and disconnectedly: most of the manuscripts are fragments; they are housed in sixty collections on four continents, and highly specialized skills are needed to make sense of them. But the use of digital technologies and a critical mass of specialists has brought us to the brink of a paradigm shift. Scribes of the Cairo Geniza have the potential to rewrite the history of the pre-modern Middle East, Mediterranean and Indian Ocean trade, and the Jewish diaspora. We hope that this project not only furthers scholarly research, but also raises awareness of Geniza and this region’s history for a wider public audience.
Scribes of the Cairo Geniza is also unusual not only as a non-English digital humanities project (though this number is steadily growing in the field at large), but also as a multilingual undertaking in the crowdsourcing community. This project is one of the first custom projects on the Zooniverse platform to address right-to-left languages, incorporating Hebrew and Arabic translations within the project interface and data output.
This project presented a unique set of challenges for both Penn Libraries and Zooniverse, related to the project’s content as well as scale: How do you design and develop a project with basic outcomes that can support a wide variety of research goals? How do you design a transcription interface for a project that features multiple scripts, languages, genres and time periods? Furthermore, how do you ensure such a project is accessible by a variety of users, from beginners to enthusiasts to experts? By creating a project with multiple pathways to entry, we can ensure that volunteers feel confident in their ability to participate both in the task and in the ongoing Talk discussions.
We hope that Scribes of the Cairo Geniza can serve as a resource for enabling digital open access research projects. Our collaboration provides one model for teams interested in approaching open data initiatives when working in multi-institution partnerships, exploring the ethics of crowdsourcing as a research method, building and managing a multilingual user community, and using large amounts of text-based data to further research.
When did you begin this project? When did you complete this project?
Time Span: October 1, 2016 - present
Length: 4+ years
What is the outcome of the project?
As of September 2020, over 9,500 citizen scientists from 27 different countries have participated in the project. We have hosted two transcribe-a-thons, given several community presentations, and piloted educational programming within schools in Haifa, Israel, to expose students in various schools to the historical story behind the Cairo Geniza while introducing them to crowdsourcing as a method for historical research.
As part of these activities, we are interested in exposing the younger generation to the tools and possibilities of historical research by using the digital and technological capabilities that have developed in recent years. As part of the change in the style of historical research, students are exposed both to the innovative tools and to the historical world documented in the project.
68,434 pages (an estimated 20 percent of the total Geniza) are visible within the project interface, bringing together collections from six different institutions. We have completed classification for the initial sorting phase, which began on 8 August 2017 and was completed on 8 February 2019. In this phase, citizen scientists sorted 40,109 subjects from the University of Pennsylvania, the Library of the Jewish Theological Seminary, and the Taylor-Schechter Genizah Research Unit at Cambridge University. This data is publicly available for download, along with a summary analysis.
Since 6 March 2019, we have been actively transcribing fragments since. As of September 2020, 247 Easy Arabic fragments and 139 Easy Hebrew fragments have been transcribed. Early transcription results of the project are being used by partners at the University of Haifa as part of their ongoing work to apply handwritten text recognition to medieval Hebrew texts.
What tools, resources, programs, or equipment did you use for this project
Zooniverse, the world’s largest and most popular platform for people-powered research, used for hosting the project.(http://www.zooniverse.org)
Zooniverse Project Builder, Zooniverse’s free software for hosting project data and workflows. (http://www.zooniverse.org/lab)
Custom Transcription Interface, built by Zooniverse developers for annotating Cairo Geniza fragments using custom Hebrew and Arabic keyboards. (https://github.com/zooniverse/scribes-of-the-cairo-geniza)
Currently, this is not a full-time project for any one individual on the core team, although the co-authors devote 30% of their time towards the project.
Our core team involves eight members, including a project manager, three content experts with various Geniza interests, one Zooniverse lead, two associate university librarians, and a curator. The project development and launch periods, in particular, were more time- and labor-intensive, and involved many more people, including two Zooniverse web developers and a designer, all of whom contributed full-time effort to the project during these periods.
Throughout the project’s lifecycle, we have relied on support from several undergraduate students, graduate students, and research assistants for translation, software development, outreach, and engagement. We have also reached out to scholars, librarians, developers, and digital humanists within and outside of our respective institutions for task-specific support related to data collection and management, outreach, and translation. Finally, we have supported volunteer moderators as part of our community engagement efforts, who play a key role in monitoring discussion on Talk, identifying where the project team’s attention is needed, and fostering user engagement.
Please describe any costs incurred for this project, and (if relevant) how you secured funding for these costs.
A National Leadership Grant from the Institute of Museum and Library Services was awarded to the Adler Planetarium to fund Zooniverse researcher and developer time for the custom transcription interface. Translations for the trilingual interface were produced by graduate student employees at Penn, Princeton, and the University of Haifa, as part of respective departmental budgets.
Please give an overview of the workflow or process you followed to execute this project, including time estimates where possible.
OCTOBER 2016: In October 2016, Penn Libraries partnered with Princeton Geniza Lab to develop a transcribing game tool for Cairo Geniza fragments, in order to produce openly available transcriptions for broad reuse.
DECEMBER 2016: In December 2016, the Zooniverse received a National Leadership Grant from the Institute of Museum and Library Services, to support the development of library and archive Zooniverse projects that explore improvements to full text transcription and image annotation crowdsourcing tools. The team put out a call for project proposals in early 2017, and the Penn Libraries team submitted a proposal to develop a crowdsourced transcription project based around the Cairo Geniza. On 29 March 2017, the Zooniverse team reached out to confirm that the Penn proposal had been selected for development.
MAY 2017: In May 2017, we held an all-day meeting (both in-person and virtually) with our project team to introduce one another and define the project layout. In this meeting, we realized that the type of data being transcribed was so diverse that we needed to implement some manner of sorting, to allow transcribers to have a certain amount of choice as to which types of fragments they wanted to transcribe (or, more importantly, they felt comfortable transcribing). After discussion, we ultimately made the choice to crowdsource the Sorting process, adding an additional ‘first pass’ workflow, to be built using the Zooniverse Project Builder, that volunteers could work on while the custom translation interface was being built. The Sorting tasks would ultimately be based around how we wanted to be able to separate the fragments for transcription.
Additionally, we brainstormed potential goals for sorting and transcription, discussed how to manage logistics with existing and future projects, defined project scope, designed workflows based on research questions, addressing project marketing efforts for post-launch, and developed a timeline.
Our initial plan was to have the content of the project approved by the end of June 2017, beta testing and revision in July 2017, with the rollout of Phase 1 (Sorting) in August 2017.
JUNE 2017: We had another meeting in June to finalize the research goals and make final decisions about which additional transcription resources we needed to include in the custom transcription interface. We specifically looked at other projects on the Zooniverse platform, including AnnoTate, Shakespeare’s World, and Galaxy Zoo, to review interface differences and educational uses of the project and content.
We decided that the best way to engage volunteers was to separate the effort into major workflows for volunteers to classify and transcribe. In the classification workflow, volunteers identify the script (Hebrew or Arabic) and the script type (called Formal or Informal, though these are not authoritative paleographic distinctions), as well as various visual characteristics of interest to scholars. By breaking down the fragments into Formal or Informal script types, we can further allow volunteers to identify fragments that are difficult to transcribe, creating an Easy/Challenging workflow distinction.
Additionally, we set up tasks for our four sub-teams:
a statement of work from Zooniverse for the custom transcription interface;
our content experts finding Arabic and Hebrew translators for social media and marketing materials, hashtags for the Talk boards, and engaging graduate students as part of the crowdsourcing community;
our tech team preparing images for upload into Zooniverse for sorting;
and the research team thinking of promotional spaces for the project.
JULY 2017: In July, an email was sent to approximately 30,000 Zooniverse beta reviewers, soliciting feedback for the beta version of the Sorting workflow. We specifically looked for feedback regarding instructional content. We learned that restructuring the help and tutorial texts, providing more information about the overall goals of the project, and some translated examples helped ease volunteers into participation. Following the beta review and implementation of feedback, we planned for a rollout of the Sorting phase by August 1.
AUGUST 2017: We officially launched our project on August 8, 2017. In the first month, we had 62,915 sorting classifications from 3,062 volunteers, and had sorted/retired 128 subjects. After the project launch, we re-grouped in late August to discuss some initial results. We reviewed frequently asked questions by users about fragments, putting together some basic responses for our team on the Talk boards. We brainstormed how to engage students in the humanities as well as people other than academics. Finally, we planned to send monthly data exports to the project team.
SEPTEMBER 2017: Separate from the Scribes of the Cairo Geniza effort, Zooniverse developers had been working on a translation interface for the Project Builder, which we decided to implement for the Sorting workflow. In the transcription workflow, volunteers transcribe a fragment line-by-line using an on-screen keyboard. These workflows apply to Easy Hebrew, Challenging Hebrew, Easy Arabic, and Challenging Arabic.
In this meeting, we specifically discussed some of the issues for transcription (reference alphabets, keyboards, and how to transcribe other transcriptions), the content that we needed to prepare for transcription, and the research goals/use cases for transcription data.
JANUARY 2018: By early 2018, we had completed the help and interface text for the transcription interface, the custom keyboards for Hebrew and Arabic transcription, and the phrase finder characteristics. We also discussed the different language “pathways” or workflows in which users could participate. As part of these pathways, we identified the phrase finder’s purpose, the data we would export, and the questions we could ask of the data.
For example, in planning a “phrase finder” workflow, we wanted to identify keywords to serve as a laboratory for experimentation with our data. The data we exported would be in the form of keywords attached to particular subjects to help in sorting and testing. These keywords might tell us that a fragment is a documentary text (important for researchers) and, in certain cases, could tell us the sub-genre of the documentary text. This workflow would also answer questions like, “did the identification of diagonal/perpendicular writing in the margins in phase 1 really help sort out documentary from non-documentary texts?”
MAY 2018: By May 2018, the bulk of the fragments we started with had been sorted. We made minor changes to the sorting workflow regarding additional tagging questions for visual characteristics.
AUGUST 2018: In August 2018, we performed beta testing on the easy transcription workflows, receiving about 30 responses. We asked reviewers about the support text, project goals, and how users moved through the workflow.
Due to staffing changes on our project team, we held off on launching Phase 2 until we had a designated project manager. During this stagnant time, we continued to review data coming out of the sorting workflow and to communicate with volunteers on the Talk boards.
JANUARY 2019: In January 2019, we were ready as a team to move forward with the Phase 2 launch. Similar to the pre-launch meetings for the sorting workflow, we assessed our ability to engage new volunteers, participate on the Talk boards, regularly update volunteers on development progress, and promote the project to scholarly and public audiences.
MARCH 2019: We launched the transcription workflows in March 2019, starting with Easy Hebrew and Easy Arabic. This process involves communicating with the content specialists to review early transcription results for data quality, identifying whether transcription behaviors are reflecting the instructions, and making data-driven decisions about the metrics we’re using for retirement.
Since launching the transcription workflows in March 2019, we have been working as a project team to accomplish a lot of review and restructuring, including:
reviewing the phase 1 data and understand the data we’ve collected;
receiving feedback on the Easy transcription workflows;
engaging with our community, conducting surveys, and hosting events;
and beta testing our phrase-finding workflow.
We still have three workflows to launch (Hebrew Phrase Finder, Challenging Hebrew Transcription, and Challenging Arabic Transcription). Our current work structure is organized in the following way:
Zooniverse hosts the project, updating the custom front-end display and data exports as necessary.
Penn handles the day-to-day management of the project: monitoring Talk boards, responding to volunteer and partner questions, uploading new materials to the site, and reviewing and processing data.
On a monthly basis, data is exported from Zooniverse and processed for content review.
Project partners review project data and transcription data for accuracy, and make use of data produced for related projects at their respective institutions. They also respond to questions on the Talk boards when tagged.
All project partners host regular events related to the project, whether in-class engagement, transcribe-a-thons, or educational programs.
The project team in full meets quarterly to discuss updates, data review, and outstanding issues.
Additional partners (image partners, researchers, volunteers) receive periodic updates about the project’s progress.
What, if anything, changed between beginning your project and its current/final form?
While we initially proposed a gamified transcription tool for the fragments, working within an established project interface & technological team provided strong expertise in crowdsourcing project management. It also helped bring together a variety of researchers and experts (both content and technical) in a way that we may not have been able to do on another platform.
Much of our pathway development towards transcription, as well as the current content available for sorting, transcription, and phrase finding, was iterative. We knew we wanted transcription data as a project outcome, but the breakdown into smaller goals and pathways came out of much discussion among our project team.
Our user testing and feedback has been crucial to the project’s development. This has included beta testing, testing out transcription in classrooms, transcribe-a-thons, and responding to volunteer questions.
Is there anything specific you wish you had known when beginning your project that might help other people to know?
When we first started working on Scribes, we weren’t aware how many image partners we would ultimately end up working with. Had we known then that we would have more than 65,000 fragments in the project, we would have broken up the data set into smaller pieces. Luckily, we’re able to do this retroactively using some of Zooniverse’s tools for automation, but our advice to others would be to work on data in groups, if possible — it allows you to move data through the entire pipeline more quickly, meaning research team members can start examining results while the transcription process is ongoing. You can then give feedback to your volunteer transcribers based on the results, creating a feedback loop that will benefit the project as a whole.
Do you have any plans to follow up on this project or work on something similar in the future?
At Penn Libraries, we hope to continue partnerships on large-scale projects that make use of digitized, high-resolution images of library collections. We also plan to consider how crowdsourcing can complement efforts by the library in other projects. The project has been a key example for exploring how our collections root history into people’s everyday lives. Bringing medieval manuscripts, modern technology, and people together for the advancement of learning at Penn and around the world continues to be a key goal of our library, and we can’t wait to see what discoveries can be made of this data or in other initiatives.
For Zooniverse, this project was one of our first custom projects to be translated into Hebrew and Arabic, which gave us a lot of insight into the ways that we were unintentionally prioritizing LTR languages, even with the availability of translations.zooniverse.org. The project has helped us to engage with more project builders from Hebrew- and Arabic-speaking communities who are now building projects of their own. This is helping us to grow our international community and to offer more project opportunities for volunteers who read Arabic and Hebrew, as well as to examine our own practices for supporting non-English languages on the platform.
The Princeton Geniza Lab at Princeton University will make use of the consensus transcriptions towards their goal of a technological infrastructure that links images, transcriptions, translations, and previous research materials in mapping the entire documentary Geniza corpus.
The e-Lijah Lab at the University of Haifa is using initial transcription data as part of their ongoing work to apply handwritten text recognition to medieval Hebrew texts. During a 2019 hack-a-thon at the university, participants developed a prototype for automatic identification and cataloging of Geniza fragments. This prototype uses consensus transcriptions of literary Geniza fragments to quickly align with matches in Sefaria, an open-source library of Jewish texts and their interconnections, in Hebrew and in translation.
At Penn Libraries, the project team hopes that all the data produced through Scribes of the Cairo Geniza will be the basis for a future database of the project. We also hope to use this data for text analysis projects and advancing linked data technologies for manuscript researchers.
Eckstein, Laura Newman. “Of Scribes and Scripts: Citizen Science and the Cairo Geniza.” Manuscript Studies: a Journal of the Schoenberg Institute for Manuscript Studies 3, no. 1, (2018): 208-214.
Humanities for All. “A Typology of the Publicly Engaged Humanities.” National Humanities Alliance, 2018. Web.
Esten, Emily. “Cultivating Community with the Cairo Geniza”. Museum Computer Network, San Diego, California, Nov. 2019.
We have a repository on ScholarlyCommons, University of Pennsylvania’s open access institutional repository dedicated to supplementary resources and reports generated from the project:
Esten, Emily. “Reviewing Sorting Phase Data”. Scribes of the Cairo Geniza, Scholarly Commons, 2019. Web.
Esten, Emily. “Who are the #GenizaScribes: 2019-2020 Community Survey Report”. Scribes of the Cairo Geniza, Scholarly Commons, 2020. Web.
Esten, Emily. “Dataset: Scribes of the Cairo Geniza, Sorting Phase, August 2017 - February 2019”. Scribes of the Cairo Geniza, Scholarly Commons, 2020. Web.
We periodically post about the project on the JudaicaDH blog. A full list of press, publications, and presentations can be found on our website at https://judaicadh.library.upenn.edu/work/cairo-geniza/.
This project would not be possible without the contribution of Zooniverse volunteers.
The following people participated in the conception, design, development, launch, and/or running of Scribes of the Cairo Geniza:
Nicky Agate (Penn Libraries)
Laurie Allen (formerly Penn Libraries)
Samantha Blickhan (Adler Planetarium/Zooniverse)
Laura Newman Eckstein (formerly Penn Libraries)
Doug Emery (Penn Libraries)
Emily Esten (Penn Libraries)
Shimon Fogel (University of Haifa)
Mitch Fraas (Penn Libraries)
Will Granger (Adler Planetarium/Zooniverse)
Arthur Kiron (Penn Libraries)
Coleman Krawczyk (University of Portsmouth/Zooniverse)
Moshe Lavee (University of Haifa)
Vered Raziel Kretzmer (University of Haifa)
Eve Krakowski (Princeton University)
Will Noel (formerly Penn Libraries)
Shaun A. Noordin (Oxford University/Zooniverse)
Becky Rother (Adler Planetarium/Zooniverse)
Marina Rustow (Princeton University)
Zach Wolfenbarger (Adler Planetarium/Zooniverse)
Special thanks and credit to Ahmed Y. Almaazmi, Elizabeth Bates, Dr. Jean Bauer, Hal Blackburn, Taieb Cherif, Michelle Chesner, Jessica Dummer, Timothy Dungate, Scott Enderle, Dr. Jessica Goldberg, Yonatan Gutenmacher, Rebecca Hill, Amey Hutchins, Dr. L. Clifton Johnson, Dr. David Kraemer, Dr. Nita Krevans, Kate Lynch, Gayatri B. Oruganti, Dr. Ben Outhwaite, Dr. Craig Perry, Raha Rafii, Besan Radwan, Dr. Sinai Rusinek, Dr. Judith Olszowy-Schlanger, Jasmine Shinohara, Dr. Smadar Shtuhl, Emma Stanford, Einat Tamir, Dr. Kelly Tuttle, Mostafa Younesie, and Dr. Oded Zinger for their contributions.