Categories
DH Project Update Research Projects Undergraduate Fellows

Mapping the Scottish Reformation: Transatlantic Adventures in the Digital Humanities

[Please enjoy this guest post by Michelle D. Brock, Associate Professor of History at Washington and Lee University. Professor Brock has been a fabulous supporter of DH at W&L through the years and we’re thrilled to see this project take off.]

In the spring of 2020 (before the world seemed to change overnight), I spent just over two wonderful months as a Digital Scholarship Fellow at the Institute for Advanced Studies in the Humanities at the University of Edinburgh during my sabbatical from W&L. During this time, I pursued work on a project called Mapping the Scottish Reformation (MSR), directed by myself and Chris Langley of Newman University and featuring Mackenzie Brooks on our project team and Paul Youngman on our advisory board.

Mapping the Scottish Reformation (MSR) is a digital prosopography of ministers who served in the Church of Scotland between the Reformation Parliament of 1560 to the Revolution in 1689. By extracting data from thousands of pages of ecclesiastical court records held by the National Records of Scotland (NRS), Mapping the Scottish Reformation tracks clerical careers, showing where they were educated, how they moved between parishes, and their personal and disciplinary history. This early modern data drives a powerful mapping engine that will allow users to build their own searches to track clerical careers over time and space.

The need for such a project was born of the fact that, despite a few excellent academic studies of individual ministers written in recent years, we still know remarkably little about this massive and diverse group. Many questions remain unanswered: How many ministers were moving from one area of Scotland to another? What was the influence of key presbyteries—the regional governing bodies of the Scottish kirk—or universities in this process? What was the average period of tenure for a minister? As of now, there is no way to answer such questions comprehensively, efficiently, and accurately. The voluminous ecclesiastical court records that contain the most detail about the careers of the clergy are not indexed, cumbersome to search, and completely inaccessible to the public or scholars less familiar with the challenges of Scottish handwriting. The multi-volume print source with much of this biographical data on ministers, Hew Scott’s invaluable Fasti Ecclesiae Scoticanae, is not searchable across volumes and contains numerous errors and omissions. A new resource is thus necessary to both search and visualize clerical data, and we intend Mapping the Scottish Reformation to be that resource.

Our project began in earnest in 2017, when, thanks to funding from a W&L Mellon grant, Caroline Nowlin ’19 and Damien Hansford (a postgraduate at Newman University) began working with the Project Directors to pull initial data from the Fasti that could be used to test the feasibility of the project. Three years and a National Endowment for the Humanities HCRR grant later, we are in the pilot “proof of concept” phase of MSR, centered on gathering data on the clergy in the Synod of Lothian and Tweeddale—a large and complex region that includes modern day Edinburgh. As such, my time at IASH was spent almost exclusively going through the presbytery records from this synod region to collect data on ministers at all levels in their clerical careers. I have often referred to this as the “unsexy” part of our work—dealing with the nitty gritty of navigating often challenging and inconsistent records in order to gather the data that will power Mapping the Scottish Reformation. There was, of course, no better setting to do this work in than IASH, an institute in the heart of the very university where many of the ministers in the Synod of Lothian and Tweeddale were educated and near to the parishes where many of the most prominent of them served.

Throughout my fellowship period, two questions were at the forefront of my mind: Are there patterns, chronological or regional, that account for the great variance in ministerial lives and trajectories? Was any such thing as a “typical” clerical career at all? What Dr. Langley and I have learned over the previous months is that the answers to these questions are significantly more complicated than previously understood by both historians and the wider public.

As we discussed during a presentation given in January at the Centre for Data, Culture and Society, the clerical career path was far less standardized than scholars usually assume. The terminology generally applied by historians and drawn from Hew Scott’s work— of “admitting,” “instituting,” and “transferring” ministers — was one of a distinct profession. Unfortunately, by applying such terms to the early modern ministry, we may be transposing a system and language of formality that just wasn’t there or wasn’t yet fully developed. Thus, one of our central goals is to shed light on the complexity of clerical experiences and development of the ministerial profession by capturing messy data from manuscripts and turning it into something machine readable and suited to a database and visualization layer. In short, we hope to make the qualitative quantitative, and to do so in a way that can also serve as a supplementary finding aid to the rich church court records held at NRS.

To date, my co-Director and I have gone through approximately 3,000 pages of presbytery minutes and collected information on over 300 clerics across more than twenty categories using Google Sheets. Dr. Langley has begun the process of uploading this data to Wikidata and running initial queries using SPARQL to generate basic data-driven maps. The benefit of using Wikidata at this phase in our project is that it is a linked open data platform and is already used as a data repository for the Survey of Scottish Witchcraft, which captured information on most of the parishes and a number of the ministers in our project. We are deeply grateful to the University of Edinburgh’s “Wikimedian in Residence” Ewan McAndrew, who met with us early in my fellowship period to explore opportunities for using Wikidata, which is now a critical part of the technical infrastructure of our project. Thanks to a recently awarded grant from the Strathmartine Trust, in the coming months we hope to collaborate with an academic technologist to build our own Mapping the Scottish Reformation interface, driven by our entries in Wikidata.

Though I sadly had to cut my fellowship period two weeks short due to the COVID-19 crisis, I had a wonderful and productive two months as a Digital Scholarship Fellow at IASH, thanks in no small part to the general sabbatical support from Washington and Lee. In this time, Mapping the Scottish Reformation progressed by leaps and bounds, thanks to the generosity and support of the Scottish history and digital humanities communities at the University of Edinburgh, as well as our colleagues at NRS. Our talk at the Edinburgh’s Centre for Data, Culture and Society, which drew an audience not only of academics but also genealogists and local residents, was a real highlight, allowing us to make connections with a wide range of people interested in the history of Scotland, family history, the Reformation, and the digital humanities. These connections, and the ability to make access to data widely available, are more important than ever on both sides of the Atlantic, and I am looking forward to continuing this work at home in Virginia.

Categories
DH Event on campus Project Update Research Projects Speaker Series

Report on “Pray for Us: The Tombs of Santa Croce and Santa Maria Novella”

In her public talk on January 16, 2019, Dr. Anne Leader discussed her DH project Digital Sepoltuario, which will offer students, scholars and the general public an online resource for the study of commemorative culture and medieval and renaissance Florence. Supported by the Institute for Advanced Technology in the Humanities (IATH) team at the University of Virginia, Digital Sepoltuario will chart the locations, designs and epitaphs of tombs made for Florentine families in sacred spaces across the city from about 1200 to about 1500, and then uses archival data to analyze social networks, patterns of patronage and markers of status in the late Middle Ages and Early Modern period.

While the project is not yet complete, it will include transcriptions, translations, photographs and analysis of fragile manuscripts, like registers that kept track of where different people were buried and records that indicate which tombs have been moved or destroyed. These documents demonstrate that tombs were frequently recycled from one family to another when lineages died out or when the family could no longer afford it. Because these records sometimes lost track of the owners of some tombs or the decorations faded away or disintegrated over time, there remains some uncertainty about some tombs’ owners that makes it impossible for historians to figure out now.

From these documents, scholars like Leader gain insight into why people chose certain tombs or churches as their final resting places. The tombstones are imbedded in the floors of churches in Florence, carpeting the churches with stone slabs that mark people’s final resting places and serving as reminders of everyone’s ultimate death. People would look down at the floor and contemplate what lay beneath the beautiful paintings and frescoes on the tombstones and within the churches, encouraging them to prepare for the final judgment and consider: am I ready for what’s to come?

By examining these records and incorporating them in a DH project, scholars can begin to answer questions about Florentines’ burial practices and ultimately about Florentines’ lives. Leader is interested in questions such as: How did Florentines decide on their final resting places, and how did they decide on the tombstones’ designs? So far, Leader noted that most people chose to be buried in their own parishes and close to their homes. However, she finds it interesting that increasing numbers of citizens requested burial elsewhere. This trend transformed the topography of Florence, causing tension within churches that relied on money from burying their dead and enriching some parishes while impoverishing others. Burial placement was one of the most important decisions Florentines would make, so considering why people wanted to be buried elsewhere and understanding the  implications these decisions had on social status help scholars today decipher how early modern Europeans thought about burial and death. Digital Sepoltuario will make all of this possible.

This event was sponsored by Washington and Lee University’s Art History Department, the Digital Humanities Cohort and the Digital Humanities Mellon Grant.

-Jenny Bagger ’19, DH Undergraduate Fellow

Categories
DH Event on campus Project Update Research Projects

DH Research Talk with Stephen P. McCormick

DH Research Talk with Stephen P. McCormick
Wednesday, February 6th, 2019
12:15 PM – 1:15 PM
IQ Center
Lunch is provided. Please register here!


Join Stephen P. McCormick to learn more about his Huon d’Auvergne project and his work with DH students!

McCormick will speak on his research and work with the digital and facsimile edition of Huon d’Auvergne, a pre-modern Franco-Italian epic. Linking institutions and disciplines, the Huon d’Auvergne Digital Archive is a collaborative scholarly project that presents for the first time to a modern reading audience the Franco-Italian Huon d’Auvergne romance epic.

This talk is sponsored by the Medieval and Renaissance Studies Program and the Digital Humanities Cohort.

Categories
Announcement Event on campus Pedagogy People Research Projects

Winter Academy 2018 — rescheduled!

[FYI this event has been rescheduled for January 16, 2019 from 12:15pm-1:15pm. Join us for the same great lineup! Please register on Event Manager.]

With a fresh snow and impending finals, it is certainly time to look toward Winter Academy offerings. The entire line-up looks great this year, but we invite you to join us for the following DH event:

Monday, December 10th, 2018
12:15-1:15pm
Hillel 101
Lunch provided

Digital Humanities Summer Research Panel
Curious about how “digital humanities”–whatever that means–can fit into your research? What it’s like to work collaboratively with undergraduates working on humanistic questions? What impact the research can have on your pedagogy? Then, you should hear from Mellon Summer Digital Humanities Faculty Research awardees and a Special Collections project.

Presenters: George Bent, Professor of Art History; Sydney Bufkin, Mellon Digital Humanities Fellow; Megan Hess, Assistant Professor of Accounting

Don’t forget to register at http://go.wlu.edu/winteracademy!


Looking to fill out the rest of your week? We recommend the following:

  • Leveraging Technology to Cultivate an Inclusive Classroom – Kelly Hogan and Viji Sathy (UNC Chapel Hill), Monday at 9:15-10:45pm
  • Imaging in the IQ Center – Dave Pfaff, Monday at 2:15pm
  • How is Technology Affecting Your Mojo? Finding Mindfulness – Marsha Mays-Bernard (JMU), Wednesday at 2:30-4pm
Categories
Announcement DH Research Projects Tools

New Resource – Ripper Press Reports Dataset

[Crossposted on my personal blog.]

Update: since posting this, Laura McGrath reached out about finding an error in the CSV version of the data. The version linked to here should be cleaned up now. In addition, you will want to follow steps at the end of this post if using the CSV file in Excel. And thanks to Mackenzie Brooks for her advice on working with CSV files in Excel.

This semester I have been co-teaching a course on “Scandal, Crime, and Spectacle in the Nineteenth Century” with Professor Sarah Horowitz in the history department at W&L. We’ve been experimenting with ways to make the work we did for the course available for others beyond our students this term, which led to an open coursebook on text analysis that we used to teach some basic digital humanities methods.

I’m happy to make available today another resource that has grown out of the course. For their final projects, our students conducted analyses of a variety of historical materials. One of our student groups was particularly interested in Casebook: Jack the Ripper, a site that gathers transcriptions of primary and secondary materials related to the Whitechapel murders. The student group used just a few of the materials on the site for their analysis, but they only had the time to copy and paste a few things from the archive for use in Voyant. I found myself wishing that we could offer a version of the site’s materials better formatted for text analysis.

So we made one! With the permission of the editors at the Casebook, we have scraped and repackaged one portion of their site, the collection of press reports related to the murders, in a variety of forms for digital researchers. More details about the dataset are below, and we’ve drawn from the descriptive template for datasets used by Michigan State University while putting it together. Just write to us if you’re interested in using the dataset – we’ll be happy to give you access under the terms described below. And also feel free to get in touch if you have thoughts about how to make datasets like this more usable for this kind of work. We’re planning on using this dataset and others like it in future courses here at W&L, so stay tuned for more resources in the future.


Title

Jack the Ripper Press Reports Dataset

Download

he dataset can be downloaded here. Write walshb@wlu.edu if you have any problems accessing the dataset. This work falls under a cc by-nc license. Anyone can use this data under these terms, but they must acknowledge, both in name and through hyperlink, Casebook: Jack the Ripper as the original source of the data.

Description

This dataset features the full texts of 2677 newspaper articles between the years of 1844 and 1988 that reference the Whitechapel murders by Jack the Ripper. While the bulk of the texts are, in fact, contemporary to the murders, a handful of them skew closer to the present as press reports for contemporary crimes look back to the infamous case. The wide variety of sources available here gives a sense of how the coverage of the case differed by region, date, and publication.

Preferred Citation

Jack the Ripper Press Reports Dataset, Washington and Lee University Library.

Background

The Jack the Ripper Press Reports Dataset was scraped from Casebook: Jack the Ripper and republished with the permission of their editorial team in November 2016. The Washington and Lee University Digital Humanities group repackaged the reports here so that the collected dataset may be more easily used by interested researchers for text analysis.

Format

The same dataset exists here organized in three formats: two folders, ‘by_journal’ and ‘index’, and a CSV file.

  • by_journal: organizes all the press reports by journal title.
  • index: all files in a single folder.
  • casebook.csv: a CSV file containing all the texts and metadata.

Each folder has related but slightly different file naming conventions:

  • by_journal:
    • journal_title/YearMonthDayPublished.txt
    • eg. augusta_chronicle/18890731.txt
  • index:
    • journal_title_YearMonthDayPublished.txt
    • eg. augusta_chronicle_18890731.txt

The CSV file is organized according to the following column conventions:

  • id of text, full filename from within the index folder, journal title, publication date, text of article
  • eg. 1, index/august_chronicle_18890731.txt, augusta_chronicle, 1889-07-31, “lorem ipsum…”

Size

The zip file contains two smaller folders and a CSV file. Each of these contains the same dataset organized in slightly different ways.

  • by_journal – 24.9 MB
  • index of all articles- 24.8 MB
  • casebook.csv – 18.4 MB
  • Total: 68.1 MB uncompressed

Data Quality

The text quality here is high, as the Casebook contributors transcribed them by hand.

Acknowledgements

Data collected and prepared by Brandon Walsh. Original dataset scraped from Casebook: Jack the Ripper and republished with their permission.


If working with the CSV data in Excel, you have a few extra steps to import the data. Excel has character limits on cells and other configurations that will make things go sideways unless you take precautions. Here are the steps to import the CSV file:

  1. Open Excel.
  2. Make a blank spreadsheet.
  3. Go to the Data menu.
  4. Click “Get External Data”.
  5. Select “Import Text File”.
  6. Navigate to your CSV file and select it.
  7. Select “Delimited” and hit next.
  8. In the next section, uncheck “Tab” and check “Comma”, click next.
  9. In the next section, click on the fifth column (the column one to the right of the date column).
  10. At the top of the window, select “Text” as the column data format.
  11. It will take a little bit to process.
  12. Click ‘OK’ for any popups that come up.
  13. It will still take a bit to process.
  14. Your spreadsheet should now be populated with the Press Reports data.
Categories
DH Research Projects Undergraduate Fellows

Markdown and Manuscripts

What I’m Currently Working On:


The Commissione of Gerolimo Morosini is a late 17th century manuscript that
serves simultaneously as a letter of appointment, a professional code of
conduct, and a list of legal actions and precedents. It was issued to Morosini,
a Venetian noble from a prominent family, by Doge Marcantonio Giustinian, whose
short reign of four years helps to accurately date the work. Currently housed
in Washington and Lee’s own Special Collections, working with this text offers
me a rare opportunity in several ways:

  • Commissioni are unique, as no more than two copies of each were ever made,
    meaning little, if any, research has already examined this piece
  • Transcribing and translating the text allows me to apply the Italian I am
    learning in class this year
  • Working with an item such as this might normally happen in graduate school,
    but I’m beginning this project as a sophomore

Needless to say, I’m extremely excited to have this chance to make some really
cool discoveries. But the problem with transcribing a manuscript, regardless
of its age, became apparent the very moment I began. How closely should I
replicate the appearance and context of the original text? In a poetic work,
elements such as enjambment and line breaks have an impact on the reader’s
perception of the work, and therefore ought to be preserved. Prose lacks these
restrictions, and may be rendered in a less restrictive format, but the issue
of chapters, titles, page numbers, and more can still pose a problem. No two
projects are identical, and it is up to the researcher to decide how to
approach the text.

Because the manuscript I am working on was written in prose and not verse, and
the text lacks page numbers and other identifying features, I have decided to
transcribe it in a manner as close to plaintext as possible. To this end I
have made use of Markdown, a simple way to format text without all the
complexity of a markup language like HTML or XML/TEI. It’s also very easy to
export to these formats later on, so Markdown presents the best option for me
to start transcribing as quickly and painlessly as possible.

While plaintext is great for its simplicity, it still helps to have a few more
features that Notepad lacks. Using the text editor Atom allows me to keep
track of particular elements by highlighting Markdown syntax, as seen here:

“Atom”

Items in orange are bold and indicate important sections of text, such as
passage headers/titles. Items in purple have been italicized by me, indicating
a word whose spelling or transcription I am not 100% certain of. I’ve used
dashes to indicate a hard-to-read letter and the pound symbol (not a hashtag!)
for headers to indicate recto and folio pages for my own ease of use.

As I spend more time working with the manuscript and studying Italian, I’ll go
back and edit the text appropriately. The ultimate goal for this project is to
have the entire commissione transcribed and then translated into English,
with both the Italian and English versions encoded using TEI/XML and made
publicly available.

– Aidan Valente

Categories
DH Project Update Research Projects

Reading Speech: Virginia Woolf, Machine Learning, and the Quotation Mark

[Cross-posted on the my personal blog as well as the Scholars’ Blog. What follows is a slightly more fleshed out version of what I presented this past week at HASTAC 2016 (complete with my memory-inflected transcript of the Q&A). I gave a bit more context for the project at the event than I do here, so it might be helpful to read my past two posts on the project here and here before going forward. This talk continues that conversation.]

This year in the Scholar’s Lab I have been working with Eric on a machine learning project that studies speech in Virginia Woolf’s fiction. I have written elsewhere about the background for the project and initial thoughts towards its implications. For the purposes of this blog post, I will just present a single example to provide context. Consider the famous first line of Mrs. Dalloway:

Mrs Dalloway said, “I will buy the flowers myself.”

Nothing to remark on here, except for the fact that this is not how the sentence actually comes down to us. I have modified it from the original:

Mrs Dalloway said she would buy the flowers herself.

My project concerns moments like these, where Woolf implies the presence of speech without marking it as such with punctuation. I have been working with Eric to lift such moments to the surface using computational methods so that I can study them more closely.

I came to the project by first tagging such moments myself as I read through the text, but I quickly found myself approaching upwards of a hundred instances in a single novel-far too many for me to keep track of in any systematic way. What’s more, the practice made me aware of just how subjective my interpretation could be. Some moments, like this one, parse fairly well as speech. Others complicate distinctions between speech, narrative, and thought and are more difficult to identify. I became interested in the features of such moments. What is it about speech in a text that helps us to recognize it as such, if not for the quotation marks themselves? What could we learn about sound in a text from the ways in which it structures such sound moments?

These interests led me towards a particular kind of machine learning, supervised classification, as an alternate means of discovering similar moments. For those unfamiliar with the concept, an analogy might be helpful. As I am writing this post on a flight to HASTAC and just finished watching a romantic comedy, these are the tools that I will work with. Think about the genre of the romantic comedy. I only know what this genre is by virtue of having seen my fair share of them over the course of my life. Over time I picked up a sense of the features associated with these films: a serendipitous meeting leads to infatuation, things often seem resolved before they really are, and the films often focus on romantic entanglements more than any other details. You might have other features in mind, and not all romantic comedies will conform to this list. That’s fine: no one’s assumptions about genre hold all of the time. But we can reasonably say that, the more romantic comedies I watch, the better my sense of what a romantic comedy is. My chances of being able to watch a movie and successfully identify it as conforming to this genre will improve with further viewing. Over time, I might also be able to develop a sense of how little or how much a film departs from these conventions.

Supervised classification works on a similar principle. By using the proper tools, we can feed a computer program examples of something in order to have it later identify similar objects. For this project, this process means training the computer to recognize and read for speech by giving it examples to work from. By providing examples of speech occurring within quotation marks, we can teach the program when quotation marks are likely to occur. By giving it examples of what I am calling ‘implied speech,’ it can learn how to identify those as well.

For this machine learning project, I analyzed Woolf texts downloaded from Project Gutenberg. Eric and I put together scripts in Python 3 that used a package known as the Natural Language Toolkit] for classifying. All of this work can be found at the project’s GitHub repository.

The project is still ongoing, and we are still working out some difficulties in our Python scripts. But I find the complications of the process to be compelling in their own right. For one, when working in this way we have to tell the computer what features we want it to pay attention to: a computer does not intuitively know how to make sense of the examples that we want to train it on. In the example of romantic comedies, I might say something along the lines of “while watching these films, watch out for the scenes and dialogue that use the word ‘love.'” We break down the larger genre into concrete features that can be pulled out so that the program knows what to watch out for.

To return to Woolf, punctuation marks are an obvious feature of interest: the author suggests that we have shifted into the realm of speech by inserting these grammatical markings. Find a quotation mark-you are likely to be looking at speech. But I am interested in just those moments where we lose those marks, so it helps to develop a sense of how they might work. We can then begin to extrapolate those same features to places where the punctuation marks might be missing. We have developed two models for understanding speech in this way: an external and an internal model. To illustrate, I have taken a single sentence and bolded what the model takes to be meaningful features according to each model. Each represents a different way of thinking about how we recognize something as speech.

External Model for Speech:

“I love walking in London,” said Mrs. Dalloway.  “Really it’s better than walking in the country.”

The external model was our initial attempt to model speech. In it, we take an interest in the narrative context around quotation marks. In any text, we can say that there exist a certain range of keywords that signal a shift into speech: said, recalled, exclaimed, shouted, whispered, etc. Words like these help the narrative attribute speech to a character and are good indicators that speech is taking place. Given a list of words like this, we could reasonably build a sense of the locations around which speech is likely to be happening. So when training the program on this model, we had the classifier first identify locations of quotation marks. Around each quotation mark, the program took note of the diction and parts of speech that occurred within a given distance from the marking. We build up a sense of the context around speech.

Internal Model for Speech:

I love walking in London,” said Mrs. Dalloway. “Really it’s better than walking in the country.”

The second model we have been working with works in an inverse direction: instead of taking an interest in the surrounding context of speech, an internal model assumes that there are meaningful characteristics within the quotation itself. In this example, we might notice that the shift to the first-person ‘I’ is a notable feature in a text that is otherwise largely written in the third person. This word suggests a shift in register. Each time this model encounters a quotation mark it continues until it finds a second quotation mark. The model then records the diction and parts of speech inside the pair of markings.

Each model suggests a distinct but related understanding for how sound works in the text. When I set out on this project, I had aimed to use the scripts to give me quantifiable evidence for moments of implied speech in Woolf’s work. The final step in this process, after all, is to actually use these models to identify speech: looking at texts they haven’t seen before, the scripts insert a caret marker every time they believe that a quotation mark should occur. But it quickly became apparent that the construction of the algorithms to describe such moments would be at least as interesting as any results that the project could produce. In the course of constructing them, I have had to think about the relationships among sound, text, and narrative in new ways.

The algorithms are each interpretative in the sense that they reflect my own assumptions about my object of study. The models also reflect assumptions about the process of reading, how it takes place, and about how a reader converts graphic markers into representations of sound. In this sense, the process of preparing for and executing text analysis reflects a certain phenomenology of reading as much as it does a methodology of digital study. The scripting itself is an object of inquiry in its own right and reflects my own interpretation of what speech can be. These assumptions are worked and reworked as I craft algorithms and python scripts, all of which are as shot through with humanistic inquiry and interpretive assumptions as any close readings.

For me, such revelations are the real reasons for pursuing digital study: attempting to describe complex humanities concepts computationally helps me to rethink basic assumptions about them that I had taken for granted. In the end, the pursuit of an algorithm to describe textual speech is nothing more or less than the pursuit of deeper and enriched theories of text and speech themselves.

Postscript

I managed to take note of the questions I got when I presented this work at HASTAC, so what follows are paraphrases of my memory of them as well as some brief remarks that roughly reflect what I said in the moment. There may have been one other that I cannot quite recall, but alas such is the fallibility of the human condition.

Q: You distinguish between speech and implied speech, but do you account at all for the other types of speech in Woolf’s novels? What about speech that is remembered speech that happened in earlier timelines not reflected in the present tense of the narrative’s events?

A: I definitely encountered this during my first pass at tagging speech and implied speech in the text by hand. Instead of binaries like quoted speech/implied speech, I found myself wanting to mark for a range of speech types: present, actual; remembered, might not have happened; remembered incorrectly; remembered, implied; etc. I decided that a binary was more feasible for the machine learning problems that I was interested in, but the whole process just reinforced how subjective any reading process is: another reader might mark things differently. If these processes shape the construction of the theories that inform the project, then they necessarily also affect the algorithms themselves as well as the results they can produce. And it quickly becomes apparent that these decisions reflect a kind of phenomenology of reading as much as anything: they illlustrate my understanding of how a complicated set of markers and linguistic phenomenon contribute to our understanding that a passage is speech or not.

Q: Did you encounter any variations in the particular markings that Woolf was using to punctuate speech? Single quotes, etc., and how did you account for them?

A: Yes – the version of Orlando that I am working with used single quotes to notate speech. So I was forced to account for such edge cases. But the question points at two larger issues: one authorial and one bibliographical. As I worked on Woolf I was drawn to the idea of being able to run such a script against a wider corpus. Since the project seemed to impinging on how we also understand psychologized speech, it would be fascinating to be able to search for implied speech in other authors. But, if you are familiar with, say, Joyce, you might remember that he hated quotation marks and used dashes to denote speech. The question is how much can you account for such edge cases, and, if not, the study becomes only one of a single author’s idiosyncrasies (which still has value). But from there the question spirals outwards. At least one of my models (the internal one) relies on quotation marks themselves as boundary markers. The model assumes that quotation marks will come in pairs, and this is not always the case. Sometimes authors, intentionally or accidentally, omit a closing quotation mark. I had to massage the data in at least half a dozen places where there was no quotation mark in the text and where its lack was causing my program to fail entirely. As textual criticism has taught us, punctuation marks are the single most likely things to be modified over time during the process of textual transmission by scribes, typesetters, editors, and authors. So in that sense, I am not doing a study of Woolf’s punctuation so much as a study of Woolf’s punctuation in these particular versions of the texts. One can imagine an exhaustive study that works on all versions of all Woolf’s texts as a study that might approach some semblance of a correct and thorough reading. For this project, however, I elected to take the lesser of two evils that would still allow me to work through the material. I worked with the texts that I had. I take all of this as proof that you have to know your corpus and your own shortcomings in order to responsibly work on the materials – such knowledge helps you to validate your responses, question your results, and reframe your approaches.

Q: You talked a lot about text approaching sound, but what about the other way around – how do things like implied speech get reflected in audiobooks, for example? Is there anything in recordings of Woolf that imply a kind of punctuation that you can hear?

A: I wrote about this extensively in my dissertation, but for here I will just say that I think the textual phenomenon the questioner is referencing occurs on a continuum. Some graphic markings, like pictures, shapes, punctuation marks, do not clearly translate to sound. And the reverse is true: the sounded quality of a recording can only ever be remediated by a print text. There are no perfect analogues between different media forms. Audiobook performers might attempt to convey things like punctuation or implied speech (in the audiobook of *Ulysses*, for example, Jim Norton throws his voice and lowers his volume to suggest free indirect discourse). In the end, I think such moments are playing with an idea of what my dissertation calls audiotextuality, the idea that all texts recordings of texts, to varying degrees, contain both sound and print elements. The two spheres may work in harmony or against each other as a kind of productive friction. The idea is a slippery one, but I think it speaks to moments like the implied punctuation mark that come through in a particularly powerful audiobook recording.

Categories
Announcement DH Project Update Publication Research Projects

In Case You Missed the News

We’re caught up in the craziness of our four week spring term here at W&L, but we wanted to make sure you were caught up on some recent news from our DH community.

Ancient Graffiti Project wins NEH Digital Humanities Start-Up Grant

Heralded as the “epitome of liberal arts,” the Ancient Graffiti Project was recently awarded $75,000 to continue work on their database for textual and figural graffiti. Learn more from the W&L press release or the Atlantic Monthly article. Congrats to Sara Sprenkle, Rebecca Benefiel, and the rest of their team!


Stephen P. McCormick wins Mednick Fellowship from the Virginia Foundation for Independent Colleges

Stephen P. McCormick, Assistant Professor of French, has been awarded the 2016 Menick Fellowship by VFIC for his work on the Huon d’Auvergne project. Learn more about McCormick’s work on one of the last unpublished Franco-Italian Romance Epics from this article or dig into the digital edition yourself.


Joel Blecher publishes chapter on Digital Humanities pedagogy

Joel Blecher, Assistant Professor of Religion, won a DH Incentive Grant in fall of 2014 for incorporating data visualization into a History of Islamic Civilization course. You can now read about this experience in a new title from De Grutyer, The Digital Humanities and Islamic & Middle East Studies. Blecher’s chapter is titled, “Pedagogy and the Digital Humanities: Undergraduate Exploration into the Transmitters of Early Islamic Law” which you can read in print or electronic form through Leyburn Library.


Look forward to reports on our summer activities coming soon. We have teams going to DHSI, ILiADS, the Oberlin Digital Scholarship Conference, and more!

Categories
DH Pedagogy Research Projects

Reflections on a Year of DH Mentoring

[Cross-posted on the Scholars’ Lab blog]

This year I am working with Eric Rochester in the Scholars’ Lab on a fellowship project that has me learning natural language processing (NLP), the application of computational methods to human languages. We’re adapting these techniques to study quotation marks in the novels of Virginia Woolf (read more about the project here). We actually started several months before this academic year began, and, as we close out another semester, I have been spending time thinking about just what has made it such an effective learning experience for me. I already had a technical background from my time in the Scholars’ Lab at the beginning of the process, but I had no experience with Python or NLP. Now I feel most comfortable with the former of any other programming language and familiar enough with the latter to experiment with it in my own work.

The general mode of proceeding has been this: depending on schedules and deadlines, we meet once or twice every two weeks. Between our meetings I would work as far and as much as I could, and the sessions would offer a space for Eric and me to talk about what I had done. The following are a handful of things we have done that, I think, have helped to create such an effective environment for learning new technical skills. Though they are particular to this study, I think they can be usefully extrapolated to apply to many other project-based courses of study in digital humanities. They are primarily written from the perspective of a student but with an eye to how and why the methods Eric used proved so effective for me.

Let the Wheel Be Reinvented Before Sharing Shortcuts

I came to Eric with a very small program adapted from Matt Jockers’s book on Text Analysis with R for Students of Literature that did little beyond count quotation marks and give some basic statistics. I was learning as I built the thing, so I was unaware that I was reinventing the wheel in many cases, rebuilding many protocols for dealing with commonly recognized problems that come from working with natural language. After working on my program and my approach to a degree of satisfaction, Eric pulled back the curtain to reveal that a commonly used python module, the Natural Language ToolKit (NLTK), could address many of my issues and more. NLTK came as something of a revelation, and working inductively in this way gave me a great sense of the underlying problems the tools could address. By inventing my own way to read in a text, clean it to make its text uniformly readable by the computer, and breaking the whole piece into a series of words that could be analyzed, I understood the magic behind a couple lines of NLTK code that could do all that for me. The experience also helped me to recognize ways in which we would have to adapt NLTK for our own purposes as I worked through the book.

Have a Plan, but Be Flexible

After discussing NLTK and how it offered an easier way of doing the things that I wanted, Eric had me systematically work through the NLTK book for a few months. Our meetings took on the character of an independent study: the book set the syllabus, and I went through the first seven chapters at my own pace. Working from a book gave our meetings structure, but we were careful not to hew too closely to the material. Not all chapters were relevant to the project, and we cut sections of the book accordingly. We shaped the course of study to the intellectual questions rather than the other way around.

Move from Theory to Practice / Textbook to Project

As I worked through the book, I was able to recognize certain sections that felt most relevant to the Woolf work. Once I felt as though I had reached a critical mass, we switched from the book to the project itself and started working. I tend to learn from doing best, so the shift from theory to execution was a natural one. The quick and satisfying transition helped the work to feel productive right away: I was applying my new skills as I was still learning to feel comfortable with them. Where the initial months had more the feel of a traditional student-teacher interaction, the project-based approach we took up at this point felt more like a real and true collaboration. Eric and I would develop to-do items together, we would work alongside each other, and we would talk over the project together.

Document Everything

Between our meetings I would work as far and as much as I could, carefully noting places at which I encountered problems. In some cases, these were conceptual problems that needed clarifying, and these larger questions frequently found their way into separate notes. But my questions were frequently about what a particular line of code, a particular command or function, might be doing. In that case, I made comments directly in the code describing my confusion. I quickly found that these notes were as much for me as for Eric–I needed to get back in the frame of mind that led to the confusion in the first place, and copious notes helped remind me what the problem was. These notes offered a point of departure for our meetings: we always had a place to start, and we did so based on the work that I had done.

Communicate in as Many Ways as Possible

We met in person as much as possible, but we also used a variety of other platforms to keep things moving. Eric and I had all of our code on GitHub so that we could share everything that we had each been working on and discuss things from a distance if necessary. Email, obviously, can do a lot, but I found the chat capabilities of the Scholars’ Lab’s IRC channel to be far better for this sort of work. If I hit a particular snag that would only require a couple minutes for Eric to answer, we could quickly work things out through a web chat. With Skype and Google Hangouts we could even share the code on the other person’s computer even from hundreds of miles away. All of these things meant that we could keep working around whatever life events happened to call us away.

Recognize Spinning Wheels

These multiple avenues of communication are especially important when teaching technical skills. Not all questions or problems are the same: students can work through some on their own, but others can take them days to troubleshoot. Some amount of frustration is a necessary part of learning, and I do think it’s necessary that students learn to confront technical problems on their own. But not all frustration is pedagogically productive. There comes a point when you have tried a dozen potential solutions and you feel as though you have hit a wall. An extra set of eyes can (and should) help. Eric and I talked constantly about how to recognize when it was time for me to ask for help, and low-impact channels of communication like IRC could allow him to give me quick fixes to what, to me at least, seemed like impossible problems. Software development is a collaborative process, and asking for help is an important skill for humanists to develop.

In-person Meetings Can Take Many Forms

When we met, Eric and I did a lot of different things. First, we would talk through my questions from the previous week. If I felt a particular section of code was clunky or poorly done, he would talk and walk me through rewriting the same piece in a more elegant form. We would often pair program, where Eric would write code while I watched, carefully stopping him each time I had a question about something he was doing. And we often took time to reflect on where the collaboration was going – what my end goal was as well as what my tasks before the next meeting would be. Any project has many pieces that could be dealt with at any time, and Eric was careful to give me solo tasks that he felt I could handle on my own, reserving more difficult tasks for times in which we would be able to work together. All of this is to say that any single hour we spent together was very different from the last. We constantly reinvented what the meetings looked like, which kept them fresh and pedagogically effective.

This is my best attempt to recreate my experience of working in such a close mentoring relationship with Eric. Obviously, the collaboration relies on an extremely low student-to-teacher ratio: I can imagine this same approach working very well for a handful of students, but this work required a lot of individual attention that would be hard to sustain for larger classes. One idea for scaling the process up might be to divide a course into groups, being training one, and then have students later in the process begin to mentor those who are just beginning. Doing so would preserve what I see as the main advantage of this approach: it helps to collapse the hierarchy between student and teacher and engage both in a common project. Learning takes place, but it does so in the context of common effort. I’d have to think more about how this mentorship model could be adapted to fit different scenarios. The work with Eric is ongoing, but it’s already been one of the most valuable learning experiences I have had.

Categories
DH Incentive Grants Pedagogy Research Projects Tools

Raw Density & early Islamic law

Professor Joel Blecher received a DH Incentive grant from W&L for the course History of Islamic Civilization I: Origins to 1500. A pedagogical DH component of that course is for students to produce a set of visualizations of data that they have collected about the transmission of early Islamic law. The students will be using two tools for the visualizations: Palladio and Raw Density.

In this post we’ll examine the use of Raw Density. Separate posts will explore the use of Palladio and the data collection process. This post will provide one example of a data visualization of early Islamic law.

 Raw Density

Raw Density is a Web app offering a simple way to generate visualizations from tabular data, e.g., spreadsheets or delimiter-separated values. Getting started with Raw is deceptively simple: just upload your data.

The complicated part is deciding which of the sixteen visuals is best for your data. While an entire course could be taught on data visualizations, the purpose within this course is for the students to develop familiarity with visualizing historical data. Not all types of charts are appropriate for every type of data.

Our sample diagram uses the first option in Raw Density, which is what the creators behind Raw Density call an “Alluvial diagram (Fineo-like)”. (Fineo was a former research project by Density Design, the developers of Raw Density.) We’re using this type of diagram to show relationships among different types of categories.

Transmitters of early Islamic law

This diagram is based on 452 transmitters of early Islamic law. A transmitter is classified either as a companion or a follower. A companion is one who encountered Muhammad in his lifetime. A follower is one who lived in the generation after Muhammad’s death.

alluvialtransmittersStatusConverted

The data collected consists of 17 fields but for the purpose of this diagram we used only 4 categories: gender, transmitterStatus, Converted (Yes/No), priorRelgion. When the transmitterStatus was unknown then the transmitter was grouped into either other or undetermined.

In the diagram you can see how the colored ribbons visualize the data flow from the general category of gender to the more specific categories. The right-side of the diagram divides the transmitters into those that had converted from a prior religion (marked as ‘Yes’) and those that did not (marked as ‘No’).

Visualization allows for a clearer understanding of the data than is possible through a simple examination of tabular content in a spreadsheet. Visualization makes it easy to spot data collecting errors. For example, is there a distinction in the transmitterStatus field between Other and Undetermined or could we have collapsed that into a single field in our data collection form? Also, the visualization identifies where further research is needed, e.g., other data sources should provide details about whether the transmitters with undetermined/other status were companions or followers.

The students in this course will produce various visualizations using Raw Density.