What is Open Syllabus?
Open Syllabus is a non-profit research organization that collects and analyzes millions of syllabi to support novel teaching and learning applications. Open Syllabus helps instructors develop classes, libraries manage collections, and presses develop books. It supports students and lifelong learners in their exploration of topics and fields. It creates incentives for faculty to improve teaching materials and to use open textbooks. It supports work on aligning higher education with job market needs and on making student mobility easier. It also challenges faculty and universities to work together to steward this important data resource.
Open Syllabus currently has a corpus of eighteen million English-language syllabi from 140 countries. It uses machine learning and other techniques to extract citations, dates, fields, and other metadata from these documents. The resulting data is made available through four online tools:
- The Syllabus Explorer provides free/open access to citation data from the archive. It shows how often titles are assigned on different subjects, what they're assigned with, and how that has changed over time.
- Open Syllabus Analytics is the 'pro' version of the Syllabus Explorer and is intended for institutional subscription by staff at colleges, universities, publishers, and high schools. It has more ways to navigate and analyze curricula and provides more information about the contexts of assignment.
- The Co-Assignment Galaxy is a massive plot of the top million titles in the Open Syllabus dataset, grouped by how often they are assigned together. It is the closest thing available to a single representation of the project of higher education.
- The Coursematcher predicts 'course equivalance' across the catalogs of hundreds of schools as a way to support the course transfer process for students and college staff.
Open Syllabus was founded at The American Assembly, a public policy institute associated with Columbia University. It has been independent since 2019.
Tell me more. What is a syllabus?
For our purposes, a syllabus is a document that provides a detailed description of a class or class section beyond what one would find in a course catalog (we have also begun to collect catalog data). We collect a wide variety of documents that meet this crition, including reading lists and descriptions produced by curricular portals. The resulting archive is very heterogeneous. Not all syllabi contain all the major elements. Around 50% of syllabi contain assigned readings. A bit over 50% list learning outcomes.
All of the syllabi in the current collection are English language documents –- including from universities where English is not the primary teaching language. Eventually we will create workflows for the large but currently hidden non-English portions of the collection.
We have no means at present of mapping syllabi back to class size or enrollment. A MOOC and a seminar are treated identically. Nor do we know how much coverage the collection provides overall or how representative it is (although it is now large enough to permit the construction of representative subsets using various criteria). Our best estimates point to around 6% the US curricular universe over the past several years.
How does Open Syllabus get its syllabi?
Primarily by crawling publicly-accessible university websites. We currently update the syllabus dataset twice per year. For a variety of reasons, collection size always lags the current years.
Faculty contributions make up a small but significant portion of the collection. They also form the basis of an emerging ‘full permission’ collection that we will index, display, and make available for download in a later version of the Explorer. If you would like to contribute syllabi to either the public or non-public collections, please read more here.
The first step in the Open Syllabus data pipeline is to separate syllabi from other documents collected in OS’s internet crawls. The second step is to deduplicate these documents. As of May 2023, around 18 million documents passed these tests. The Explorer currently displays the roughly 7.2 million syllabi collected through 2018. Analytics shows the full collection.
What are ranks, counts, and scores?
Citation counts-–how often titles are assigned across the collection–-appear throughout the Syllabus Explorer.
If a title appears on a syllabus, it ‘counts.’ If it appears 10 times on a syllabus, it counts only once. If it appears in ‘suggested reading’ or some other secondary list, it still counts. We don’t distinguish primary from secondary reading (yet).
A title’s ‘Rank’ is calculated with respect to the 7.9 million unique titles identified in the collection. The most frequently taught text in version 2.9 of the dataset–-A Manual for Writers of Term Papers, Theses, and Dissertations–-is ranked #1.
A title’s ‘Score’ is another representation of rank converted to a 1-100 scale (using a dense ranking of appearance counts, converting to a percentile, and shaving off the decimal places). At present, the top twenty-four titles have a score of 100, whereas low-count titles in the long tail of the distribution have scores of 1 or 2. Our goal is to provide a number that is easier to interpret than the raw rankings and counts.
These numbers contribute to an evolving discussion in the academy about publication metrics. We think syllabus counts are a useful addition because they privilege types of work that are commonly underrepresented in metrics derived from journal citation, including the more synthetic, accessible, and public-facing forms of work that often represent a large part of faculty labor. In short, they give faculty a way to gain recognition for scholarship with classroom applications and so create incentives to improve teaching materials.
How are counts calculated?
The Syllabus Explorer and Analytics rely on a master catalog to identify titles within the syllabus collection. Currently, this catalog is a combination of The Library of Congress, Open Library, OpenAlex, and open access databases such as the Directory of Open Access Books and Open Textbook Library.
The Syllabus Explorer identifies citations by looking for citation-like patterns in the raw text of the syllabi. The title-author components are then compared against the catalog. This process is accurate a bit over 90% of the time compared to human labeling.
Some of the remaining 8-9% are cleaned through rule-based and hand-composed blacklists. The rest show up in the data.
Why don’t we show results for X?
There are many possible reasons but here are the most likely:
- X is not assigned on the syllabi currently in the collection.
- Citations of X don't conform to our matching model. Because we rely on 'Title - Author Last Name' as the identifier, we struggle with certain kinds of citations and categories of work. We won't find, for example, movies cited by title and date rather than title and director. Textbooks for which authors change across editions can be a problem.
- X is not in the master bibliographical catalogs that we use to identify titles. Our catalog is weak, especially, for non-English titles, which appear frequently in the European portion of the collection.
- X was improperly merged with another title in building the master catalog. With nearly 200 million total records, some title/author combinations appear hundreds of times. The process of collapsing large numbers of variants and potential duplicates into single records is imperfect.
- The original catalog data for X is ambiguous or incorrect. This is common. Records sometimes fail to list all of the authors for a title, list editors or translators in the author field, or have other erroneous information.
We’ve worked to minimize these problems but if you spend time with the Explorer, you will see them.
What about date, location, and field information?
The dates of classes are obtained by analyzing the date strings that appear in the syllabus text or the source URL for the document. This process is around 90% accurate, which means that erroneous dates will appear with some frequency. Some schools, too, use date formats that we have difficulty parsing accurately.
Fields are challenging because there is a great deal of variation in how different educational systems and intellectual traditions divide human knowledge. Our classifiers identify 62 fields derived from the Department of Education’s Classification of Instructional Programs (CIP 2015). This process is not perfect. The attribution of texts to unusual fields in the Explorer (for example, The Prince to Public Safety) is most often a problem with the field classifiers and is more common with syllabi obtained from non-Anglophone universities.
Institutional attribution is based on a mixed analysis of URLs and e-mail strings found in the documents, which are then mapped to a combination of the Research Organization Registry (ROR) and IPEDS data. These methods resolve institutional location for around 94% of the syllabi in the collection. What does it tend to miss? Vocational schools outside the US.
What about people and publishers?
Unlike titles, schools, fields, and countries, authors and publishers do not have unique records in the database. An author search simply returns hits on a particular name. These results can be nearly unique for people with rare names, but are less reliable for common names. Additionally, our source catalogs often contain multiple versions of the same person’s name, and often duplicates based on the different citation conventions around initials. Our efforts to reconcile these variations are imperfect. Stable author identities remain one of the major challenges of library science. We can’t solve that problem, but we will try to adopt emerging solutions, such as ORCID (go on, get an ORCID ID).
Publishers data is structurally similar to authors but has some unique features. The quality of publisher data in the source catalogs is generally terrible, with no consistent representation of publisher names, ownership structures, or roles in publication. However, the finite number of publishers makes the data easier to clean. We have aggressively cleaned much of it, making relatively complete publisher records possible in many cases (at the expense of some of the complexity embedded in the records).
We also list academic journals in the publisher section.
In practice, a title can be composed of dozens of underlying records, with different publishers, publication years, and even authors. Where we have multiple records for a single title, we show (1) the most frequent (i.e., modal) title and subtitle; (2) the modal author or authors; (3) the dominant publisher (i.e, representing 70% or more of the records); and (4) the earliest publication date from among all the records. The result is a useful indicative record, rather than a ‘true’ representation of the title and edition assigned.
Does Open Syllabus show syllabi?
Not at all in the Explorer and in anonymized, limited ways in Analytics.
The free, public-facing Syllabus Explorer provides no representation of the underlying documents and limits exploration to statistical aggregates and metadata extracted from the syllabi. We also set a minimum number of syllabi required to search fields within schools or countries (250). As a result, individual teaching choices are submerged in larger aggregates.
Open Syllabus Analytics has different rules. It does show simplified, anonymized representations of underlying syllabi, focusing on the description, learning outcomes, and assigned titles. However, Analytics is not a public service and requires a verified account to access.
Finally, OS does not display syllabi from countries or territories where we think the disclosure of teaching choices could put faculty at risk or diminish academic freedom. The list of excluded countries and territories is based, in part, on the Scholars at Risk Network’s Academic Freedom Monitoring Project and related Free to Think reports, as well as Freedom House country reports. This list currently includes:
Afghanistan, Algeria, Angola, Azerbaijan, Bahrain, Bangladesh, Benin, Burundi, Cameroon, Chad, China, Colombia, Congo, Côte d’Ivoire, Cuba, Ecuador, Egypt, Ethiopia, Guatemala, Honduras, Hong Kong, Hungary, Indonesia, Iran, Iraq, Jordan, Kenya, Kuwait, Lebanon, Liberia, Libya, Macao, Madagascar, Malawi, Malaysia, Mali, Mauritania, Mexico, Morocco, Mozambique, Myanmar, Nepal, Niger, Nigeria, Oman, Pakistan, Palestine (OPT), Papua New Guinea, Paraguay, The Philippines, Russia, Saudi Arabia, Senegal, Sierra Leone, Sri Lanka, Sudan, Swaziland, Syria, Tajikistan, Tanzania, Thailand, Togo, Tunisia, Turkey, Uganda, United Arab Emirates, Venezuela, Viet Nam, Yemen, Zambia, and Zimbabwe.
We have decided not to separate STEM fields from humanities and social sciences in this context. We will revisit this list as needed.
How is Open Syllabus funded?
Open Syllabus has been supported by The Arcadia Fund, The ECMC Foundation, The Sloan Foundation, The Hewlett Foundation, and The Templeton Foundation. The project also received a Catalyst Grant from Digital Science in 2018.
The release of Open Syllabus Analytics marks a shift toward a subscription-based model for 'pro' versions of OS services designed for professional educators. We also license books recommendation and other anonymized slices of the dataset to publishers.
Can I access the underlying data?
We provide limited, anonymized versions of the OS dataset under some circumstances for academic research. Research leads must be based at a college or university and be able to secure the support of their schools for a 'research use agreement.' If you'd like to inquire about access, write us at email@example.com. If we are slow to follow up, apologies. We receive more of these requests than we can manage.