ABOUT THE OPEN SYLLABUS PROJECT

What is it?

The Open Syllabus Project (OSP) collects and analyzes millions of syllabi to support educational research and novel teaching and learning applications.  The OSP helps instructors develop classes, libraries manage collections, and presses develop books.  It supports students and lifelong learners in their exploration of topics and fields.  It creates incentives for faculty to improve teaching materials and to use open licenses.  It supports work on aligning higher education with job market needs and on making student mobility easier.  It also challenges faculty and universities to work together to steward this important data resource.

The OSP currently has a corpus of seven million English-language syllabi from over 80 countries.  It uses machine learning and other techniques to extract citations, dates, fields, and other metadata from these documents.  The resulting data is made freely available via the Syllabus Explorer and in bulk for academic research. 

The OSP is based at Open Syllabus, a non-profit research organization. The project was founded at The American Assembly, a public policy institute associated with Columbia University.

Tell me more. What is a syllabus?

For our purposes, a syllabus is any detailed account of a class or class section.  Not every country represents classes in the same way.  In the UK and Australia, we also collect reading lists.  In India, a syllabus is a bundle of classes associated with a degree.  We collect those too.

All of the syllabi in the current collection are English language documents – including from universities where English is not the primary teaching language.  Eventually we will create workflows for the large non-English portions of the collection that we cannot currently analyze.

We have no means at present of mapping syllabi back to class size or enrollment.  A MOOC and a seminar are treated identically.  Nor do we know how representative the collection is of the larger universe of syllabi (although the collection is now large enough to permit the construction of representative subsets using various criteria).  Our rough guess is 5-10% of the Anglophone curricular universe over the last 10 years.

Because the Syllabus Explorer focuses on assigned texts, it privileges syllabi that provide detail about class readings.  These represent a bit over 50% of the collection.  

The other 50% come either from classes that assign no texts (such as many studio, lab, and internship classes) or from curricular management systems that do not or only inconsistently list assigned readings.  These contribute nothing to the citation rankings but are useful for other kinds of analysis, including work on skills and trends.

How does the OSP get its syllabi?

Primarily by crawling and scraping publicly-accessible university websites. 

Faculty contributions make up a small but significant portion of the collection.  They also form the basis of an emerging ‘full permission’ collection that we will display and make available for download in a later version of the Explorer.  If you would like to contribute syllabi to either the public or non-public collections, please read more here.

We also have institutional contributors that we hope will form the basis of a larger Syllabus Commons. 

The first step in the OSP data pipeline is to separate syllabi from other documents collected in the OSP’s internet crawls.  The second step is to deduplicate these documents.  As of May 2019, around 7 million documents passed these tests.  The Explorer currently displays the roughly 6 million syllabi collected through 2017.

What are ranks, counts, and scores?

Citation counts–how often titles are assigned across the collection–appear throughout the Syllabus Explorer.

If a title appears on a syllabus, it ‘counts.’  If it appears 10 times on a syllabus, it counts only once.  If it appears in ‘suggested reading’ or some other secondary list, it still counts. Our methods can’t distinguish primary from secondary reading (yet).

A title’s ‘Rank’ is calculated with respect to all 1.7 million unique titles identified in the collection. The most frequently taught text–The Elements of Style–is ranked #1.

A title’s ‘Score’ is another method of ranking titles, on a 1-100 scale.  We use a dense ranking of appearance counts and convert to a percentile, shaving off the decimal places. In practice, the top sixteen works have a score of 100, whereas low-count works in the long tail of titles have scores of 1 or 2. Our intention is to provide a number that is easier to interpret than the raw rankings and counts.

We think that providing these numbers gives faculty a way to gain recognition for scholarship with classroom applications.  Syllabus counts privilege types of work that are underrepresented in metrics derived from journal citation, including more synthetic, accessible, and public-facing work that often represents a large part of faculty labor. We think that counting classroom use also creates a positive feedback loop for teaching-directed and public-facing scholarship, potentially leading to better teaching materials.

How are counts calculated?

The Syllabus Explorer has a master catalog of titles that it can identify within the syllabus collection.  In OSP Version 2, this catalog is primarily a combination of The Library of Congress and Crossref—the latter a scholarly publishing catalog with records for around 80 million articles.  We also incorporate the Open Textbook Library, which allows for tracking the adoption of openly-licensed titles.

The Syllabus Explorer identifies citations by looking for title/author name combinations in the raw text of the syllabi.  The resulting matches are then run through a neural network trained to distinguish citations from other occurrences of the same words.  This process is accurate a bit over 90% of the time compared to a hand-labeled dataset.

Some of the remaining 8-9% are cleaned through rule-based and hand-composed blacklists.  The rest show up in the Explorer.

Why doesn’t the Syllabus Explorer show results for X?

There are many possible reasons but here are the most likely:

  1. X is not assigned on the syllabi currently in the collection. 
  2. X is not in the master bibliographical catalogs that we use to identify titles (The Library of Congress and Crossref).
  3. X was improperly merged with another title in building the master catalog.  With nearly 100 million total records, some title/author combinations appear hundreds of times.  The process of collapsing large numbers of variants and potential duplicates into single records is imperfect.
  4. The original catalog data for X is ambiguous or incorrect.  This is common.  Records sometimes fail to list all of the authors for a title, list editors or translators in the author field, or have other erroneous information.

We’ve worked to minimize these problems but if you spend time with the Explorer, you will see them.

What about date, location, and field information?

The dates of classes are obtained by analyzing the date strings that appear in the syllabus text or the source URL for the document.  This process is around 90% accurate, which means that erroneous dates will appear with some frequency.  Some schools, too, use date formats that we have difficulty parsing accurately.

Fields are challenging because there is a great deal of variation in how different educational systems and intellectual traditions divide human knowledge.  Also, these boundaries are constantly changing.  We trained classifiers to identify—initially—117 fields found in the Department of Education’s Classification of Instructional Programs (CIP 2015).  For this round of OSP work, 62 were accurate enough to use (all can be mapped back to CIP codes).  The attribution of texts to unusual fields in the Explorer (for example, The Prince to Public Safety) is most often a problem with the field classifiers and is common, especially, with syllabi obtained from non-Anglophone universities.

Institutional attribution is based on a mixed analysis of URLs and e-mail strings found in the documents, which are then mapped to Digital Science’s GRID database of research and teaching institutions as well as IPEDS data.  These methods resolve institutional location for around 94% of the syllabi in the collection.

What about people and publishers?

Unlike titles, schools, fields, and countries, authors and publishers do not have unique records in the database.  An author search simply returns people who share a particular name.  These results can be nearly unique for people with rare names, but are less reliable for common names.  Additionally, our source catalogs often contain multiple versions of the same person’s name, and often duplicates based on the different citation conventions around initials.  Our efforts to reconcile these variations are imperfect.  Stable author identities remain one of the major challenges of library science. We can’t solve that problem, but we will try to adopt emerging solutions, such as ORCID (go on, get an ORCID ID).

Publishers data is structurally similar to authors but has some unique features. The quality of publisher data in the source catalogs is generally terrible, with no consistent representation of publisher names, ownership structures, or roles in publication.  However, the finite number of publishers makes the data easier to clean.  We have aggressively cleaned much of it, making relatively complete publisher records possible in many cases (at the expense of some of the complexity embedded in the records).

We also list academic journals in the publisher section. 

In practice, a title can be composed of dozens of underlying records, with different publishers, publication years, and even authors. Where we have multiple records for a single title, we show (1) the most frequent (i.e., modal) title and subtitle; (2) the modal author or authors; (3) the dominant publisher (i.e, representing 70% or more of the records); and (4) the earliest publication date from among all the records. The result is a useful indicative record, rather than a ‘true’ representation of the title and edition assigned.

Are you exposing individual syllabi or teaching choices to the world?

No.  We provide no access to underlying documents in the Explorer.  All of the tools available through the Explorer limit exploration to statistical aggregates and metadata extracted from the syllabi.  These practices do not expose personally identifying information.

We also set a minimum number of syllabi required to search fields within schools or countries (250).  As a result, individual teaching choices are submerged in larger aggregates. 

Finally, the OSP does not display syllabi from countries or territories where we think the disclosure of teaching choices could put faculty at risk or otherwise impede academic freedom. The list of excluded countries and territories is based, in part, on the Scholars at Risk Network’s Academic Freedom Monitoring Project and related Free to Think reports, as well as Freedom House country reports. This list currently includes:

Afghanistan, Algeria, Angola, Azerbaijan, Bahrain, Bangladesh, Benin, Brazil, Burundi, Cameroon, Chad, China, Colombia, Congo, Côte d’Ivoire, Cuba, Ecuador, Egypt, Ethiopia, Guatemala, Honduras, Hong Kong, Hungary, Indonesia, Iran, Iraq, Jordan, Kenya, Kuwait, Lebanon, Liberia, Libya, Macao, Madagascar, Malawi, Malaysia, Mali, Mauritania, Mexico, Morocco, Mozambique, Myanmar, Nepal, Niger, Nigeria, Oman, Pakistan, Palestine (OPT), Papua New Guinea, Paraguay, The Philippines, Russia, Saudi Arabia, Senegal, Sierra Leone, Singapore, Sri Lanka, Sudan, Swaziland, Syria, Tajikistan, Tanzania, Thailand, Togo, Tunisia, Turkey, Uganda, United Arab Emirates, Venezuela, Viet Nam, Yemen, Zambia, and Zimbabwe.

We have decided not to separate STEM fields from humanities and social sciences in this context.  We will revisit this list as needed.

How is the OSP funded?

The OSP has been supported by the The Sloan Foundation, The Hewlett Foundation, The Arcadia Fund, and The Templeton Foundation.  The project also received a Catalyst Grant from Digital Science in 2018. 

We are exploring models of longer-term sustainability for the project.  The answer is likely to be a combination of grants and commercial revenues–with the latter potentially including licensing of the OSP books recommendation data to publishers, book vendors, and curricular development services. We also use Amazon affiliate links in the Explorer. Any purchase made after clicking through an Amazon link in the Explorer will return a small percentage to the OSP. None of these potential models will provide access to the underlying documents or personally identifying information to third parties.

In the medium to long term, our goal is to build a Syllabus Commons, in which universities pool syllabi and support the work of the OSP.  We will try to encourage this pooling by developing deeper curricular analytics and other university-facing services for partnering schools.

Academic researchers working through their universities can request access to the full dataset.