Austen Said:

Patterns of Diction in Jane Austen's Major Novels


Project Team

Laura White, Principal Investigator, John E. Weaver Professor of English

Carmen Smith, Associate Investigator, Ph.D. candidate, English

Brian Pytlik Zillig, Professor, Libraries and Fellow, CDRH

Laura Weakly, Metadata Encoding Specialist, CDRH

Karin Dalziel, Digital Design/Development Specialist, CDRH

Stephen Ramsay, Susan J. Rosowski Associate University Professor, English, and Fellow, CDRH (on project 2011-13)

Matthew Jockers, Assoc. Professor, English, and Fellow, CDRH (on project 2013-present)

Jessica Dussault, Programmer, CDRH

Encoding by

Download the Data


The TEI below contains the markup that was used to power the visualizations.

Pride and Prejudice Persuasion Northanger Abbey Sense and Sensibility Emma Mansfield Park

Technical Information




Data Creation

First all documents were encoded following a sample encoding developed by Laura Weakly. More on this process can be found in the background section. XSLT was then used to transform the documents into HTML which preserved many of the elements of the TEI, including speaker and FID information. For the visualizations, CSS and Javascript was written to highlight various aspects of the markup. For the document search, an XSLT script was written to convert the TEI XML into the Solr XML ingest format, with each <said> tag representing one document in the Solr index. We automatically numbered the <said> so that results can be viewed in document order.

For the word frequencies, XSLT scripts were written to create a unique word list for each character and trait.

Website Creation

The first proof of concept version of Austen Said was developed in Apache Cocoon, before creating the current iteration in Ruby on Rails. Cocoon allowed for dynamic XSLT transformations, while files in the Ruby on Rails version are preprocessed whenever files change.