What is it?

Open Syllabus is a non-profit research organization that collects and analyzes millions of syllabi to support novel teaching and learning applications.  Open Syllabus helps instructors develop classes, libraries manage collections, and presses develop books.  It supports students and lifelong learners in their exploration of topics and fields.  It creates incentives for faculty to improve teaching materials and to use open licenses.  It supports work on aligning higher education with job market needs and on making student mobility easier.  It also challenges faculty and universities to work together to steward this important data resource.

Open Syllabus currently has a corpus of nine million English-language syllabi from 140 countries.  It uses machine learning and other techniques to extract citations, dates, fields, and other metadata from these documents.  The resulting data is made freely available via the Syllabus Explorer and for academic research. 

The project was founded at The American Assembly, a public policy institute associated with Columbia University.

Tell me more. What is a syllabus?

For our purposes, a syllabus is any detailed account of a class or class section.  Not every country represents classes in the same way.  In the UK and Australia, we also collect reading lists.  In India, a syllabus is a bundle of class descriptions associated with a degree.  We collect those too.

All of the syllabi in the current collection are English language documents – including from universities where English is not the primary teaching language.  Eventually we will create workflows for the large non-English portions of the collection.

We have no means at present of mapping syllabi back to class size or enrollment.  A MOOC and a seminar are treated identically.  Nor do we know how much coverage the collection provides overall or representative it is (although it is now large enough to permit the construction of representative subsets using various criteria).  Our best estimates point to around 5-6% the US curricular universe over the past several years.

Because the Syllabus Explorer focuses on assigned texts, it privileges syllabi that provide detail about class readings.  These represent a bit over 50% of the collection.  

The other 50% come either from classes that assign no texts (such as many studio, lab, and internship classes) or from curricular management systems that do not or only inconsistently list assigned readings.  These contribute nothing to the citation rankings but are useful for other kinds of analysis, including work on learning outcomes and curricular trends.

How does OS get its syllabi?

Primarily by crawling publicly-accessible university websites. 

Faculty contributions make up a small but significant portion of the collection.  They also form the basis of an emerging ‘full permission’ collection that we will index, display, and make available for download in a later version of the Explorer.  If you would like to contribute syllabi to either the public or non-public collections, please read more here.

The first step in the OS data pipeline is to separate syllabi from other documents collected in OS’s internet crawls.  The second step is to deduplicate these documents.  As of September 2020, around 9 million documents passed these tests.  The Explorer currently displays the roughly 6 million syllabi collected through 2017.

What are ranks, counts, and scores?

Citation counts–how often titles are assigned across the collection–appear throughout the Syllabus Explorer.

If a title appears on a syllabus, it ‘counts.’  If it appears 10 times on a syllabus, it counts only once.  If it appears in ‘suggested reading’ or some other secondary list, it still counts. We don’t distinguish primary from secondary reading (yet).

A title’s ‘Rank’ is calculated with respect to the 1.7 million unique titles identified in the collection. The most frequently taught text–The Elements of Style–is ranked #1.

A title’s ‘Score’ is another representation of rank converted to a 1-100 scale  (using a dense ranking of appearance counts, converting to a percentile, and shaving off the decimal places). At present, the top sixteen works have a score of 100, whereas low-count works in the long tail of titles have scores of 1 or 2. Our goal is to provide a number that is easier to interpret than the raw rankings and counts.

These numbers contribute to an evolving discussion in the academy about publication metrics. We think syllabus counts are a useful addition because they privilege types of work that are commonly underrepresented in metrics derived from journal citation, including the more synthetic, accessible, and public-facing forms of work that often represent a large part of faculty labor. In short, they give faculty a way to gain recognition for scholarship with classroom applications and so create incentives to improve teaching materials.

How are counts calculated?

The Syllabus Explorer has a master catalog of titles that it can identify within the syllabus collection.  In Explorer Version 2, this catalog is primarily a combination of The Library of Congress, Open Library, Crossref—the latter a scholarly publishing catalog with records for around 80 million articles.  We also incorporate the Open Textbook Library, which allows for tracking the adoption of openly-licensed titles.

The Syllabus Explorer identifies citations by looking for title/author name combinations in the raw text of the syllabi.  The resulting matches are then run through a neural network trained to distinguish citations from other occurrences of the same words.  This process is accurate a bit over 90% of the time when compared to human labeling.

Some of the remaining 8-9% are cleaned through rule-based and hand-composed blacklists.  The rest show up in the Explorer.

Why doesn’t the Syllabus Explorer show results for X?

There are many possible reasons but here are the most likely:

  1. X is not assigned on the syllabi currently in the collection. 
  2. X is not in the master bibliographical catalogs that we use to identify titles (such as The Library of Congress and Crossref).
  3. X was improperly merged with another title in building the master catalog.  With nearly 150 million total records, some title/author combinations appear hundreds of times.  The process of collapsing large numbers of variants and potential duplicates into single records is imperfect.
  4. The original catalog data for X is ambiguous or incorrect.  This is common.  Records sometimes fail to list all of the authors for a title, list editors or translators in the author field, or have other erroneous information.

We’ve worked to minimize these problems but if you spend time with the Explorer, you will see them.

What about date, location, and field information?

The dates of classes are obtained by analyzing the date strings that appear in the syllabus text or the source URL for the document.  This process is around 90% accurate, which means that erroneous dates will appear with some frequency.  Some schools, too, use date formats that we have difficulty parsing accurately.

Fields are challenging because there is a great deal of variation in how different educational systems and intellectual traditions divide human knowledge.  In addition, these boundaries are constantly changing.  Our classifiers identify 62 fields derived in the Department of Education’s Classification of Instructional Programs (CIP 2015).  This process is not perfect. The attribution of texts to unusual fields in the Explorer (for example, The Prince to Public Safety) is most often a problem with the field classifiers and is more common with syllabi obtained from non-Anglophone universities.

Institutional attribution is based on a mixed analysis of URLs and e-mail strings found in the documents, which are then mapped to Digital Science’s GRID database of research and teaching institutions as well as IPEDS data.  These methods resolve institutional location for around 94% of the syllabi in the collection.

What about people and publishers?

Unlike titles, schools, fields, and countries, authors and publishers do not have unique records in the database.  An author search simply returns people who share a particular name.  These results can be nearly unique for people with rare names, but are less reliable for common names.  Additionally, our source catalogs often contain multiple versions of the same person’s name, and often duplicates based on the different citation conventions around initials.  Our efforts to reconcile these variations are imperfect.  Stable author identities remain one of the major challenges of library science. We can’t solve that problem, but we will try to adopt emerging solutions, such as ORCID (go on, get an ORCID ID).

Publishers data is structurally similar to authors but has some unique features. The quality of publisher data in the source catalogs is generally terrible, with no consistent representation of publisher names, ownership structures, or roles in publication.  However, the finite number of publishers makes the data easier to clean.  We have aggressively cleaned much of it, making relatively complete publisher records possible in many cases (at the expense of some of the complexity embedded in the records).

We also list academic journals in the publisher section. 

In practice, a title can be composed of dozens of underlying records, with different publishers, publication years, and even authors. Where we have multiple records for a single title, we show (1) the most frequent (i.e., modal) title and subtitle; (2) the modal author or authors; (3) the dominant publisher (i.e, representing 70% or more of the records); and (4) the earliest publication date from among all the records. The result is a useful indicative record, rather than a ‘true’ representation of the title and edition assigned.

Are you exposing individual syllabi or teaching choices to the world?

No.  We provide no access to underlying documents in the Explorer.  All of the tools available through the Explorer limit exploration to statistical aggregates and metadata extracted from the syllabi.  These practices do not expose personally identifying information.

We also set a minimum number of syllabi required to search fields within schools or countries (250).  As a result, individual teaching choices are submerged in larger aggregates. 

Finally, OS does not display syllabi from countries or territories where we think the disclosure of teaching choices could put faculty at risk or otherwise impede academic freedom. The list of excluded countries and territories is based, in part, on the Scholars at Risk Network’s Academic Freedom Monitoring Project and related Free to Think reports, as well as Freedom House country reports. This list currently includes:

Afghanistan, Algeria, Angola, Azerbaijan, Bahrain, Bangladesh, Benin, Brazil, Burundi, Cameroon, Chad, China, Colombia, Congo, Côte d’Ivoire, Cuba, Ecuador, Egypt, Ethiopia, Guatemala, Honduras, Hong Kong, Hungary, Indonesia, Iran, Iraq, Jordan, Kenya, Kuwait, Lebanon, Liberia, Libya, Macao, Madagascar, Malawi, Malaysia, Mali, Mauritania, Mexico, Morocco, Mozambique, Myanmar, Nepal, Niger, Nigeria, Oman, Pakistan, Palestine (OPT), Papua New Guinea, Paraguay, The Philippines, Russia, Saudi Arabia, Senegal, Sierra Leone, Singapore, Sri Lanka, Sudan, Swaziland, Syria, Tajikistan, Tanzania, Thailand, Togo, Tunisia, Turkey, Uganda, United Arab Emirates, Venezuela, Viet Nam, Yemen, Zambia, and Zimbabwe.

We have decided not to separate STEM fields from humanities and social sciences in this context.  We will revisit this list as needed.

How is OS funded?

The OSP has been supported by The Arcadia Fund, The Sloan Foundation, The Hewlett Foundation, and The Templeton Foundation.  The project also received a Catalyst Grant from Digital Science in 2018. 

We are exploring models of longer-term sustainability for the project.  The answer is likely to be a combination of grants and commercial revenues–with the latter potentially including licensing of the OS books recommendation data to publishers, book vendors, and curricular development services. We also use Amazon affiliate links in the Explorer. Any purchase made after clicking through an Amazon link in the Explorer will return a small percentage to OS. None of these potential models will provide access to the underlying documents or personally identifying information to third parties.

In the medium to long term, our goal is to build a Syllabus Commons, in which universities pool syllabi and support the work of the OSP.  We will try to encourage this pooling by developing deeper curricular analytics and other university-facing services for partnering schools.

Academic researchers working through their universities can request access to the data.