Categories
DH Project Update Research Projects

Reading Speech: Virginia Woolf, Machine Learning, and the Quotation Mark

[Cross-posted on the my personal blog as well as the Scholars’ Blog. What follows is a slightly more fleshed out version of what I presented this past week at HASTAC 2016 (complete with my memory-inflected transcript of the Q&A). I gave a bit more context for the project at the event than I do here, so it might be helpful to read my past two posts on the project here and here before going forward. This talk continues that conversation.]

This year in the Scholar’s Lab I have been working with Eric on a machine learning project that studies speech in Virginia Woolf’s fiction. I have written elsewhere about the background for the project and initial thoughts towards its implications. For the purposes of this blog post, I will just present a single example to provide context. Consider the famous first line of Mrs. Dalloway:

Mrs Dalloway said, “I will buy the flowers myself.”

Nothing to remark on here, except for the fact that this is not how the sentence actually comes down to us. I have modified it from the original:

Mrs Dalloway said she would buy the flowers herself.

My project concerns moments like these, where Woolf implies the presence of speech without marking it as such with punctuation. I have been working with Eric to lift such moments to the surface using computational methods so that I can study them more closely.

I came to the project by first tagging such moments myself as I read through the text, but I quickly found myself approaching upwards of a hundred instances in a single novel-far too many for me to keep track of in any systematic way. What’s more, the practice made me aware of just how subjective my interpretation could be. Some moments, like this one, parse fairly well as speech. Others complicate distinctions between speech, narrative, and thought and are more difficult to identify. I became interested in the features of such moments. What is it about speech in a text that helps us to recognize it as such, if not for the quotation marks themselves? What could we learn about sound in a text from the ways in which it structures such sound moments?

These interests led me towards a particular kind of machine learning, supervised classification, as an alternate means of discovering similar moments. For those unfamiliar with the concept, an analogy might be helpful. As I am writing this post on a flight to HASTAC and just finished watching a romantic comedy, these are the tools that I will work with. Think about the genre of the romantic comedy. I only know what this genre is by virtue of having seen my fair share of them over the course of my life. Over time I picked up a sense of the features associated with these films: a serendipitous meeting leads to infatuation, things often seem resolved before they really are, and the films often focus on romantic entanglements more than any other details. You might have other features in mind, and not all romantic comedies will conform to this list. That’s fine: no one’s assumptions about genre hold all of the time. But we can reasonably say that, the more romantic comedies I watch, the better my sense of what a romantic comedy is. My chances of being able to watch a movie and successfully identify it as conforming to this genre will improve with further viewing. Over time, I might also be able to develop a sense of how little or how much a film departs from these conventions.

Supervised classification works on a similar principle. By using the proper tools, we can feed a computer program examples of something in order to have it later identify similar objects. For this project, this process means training the computer to recognize and read for speech by giving it examples to work from. By providing examples of speech occurring within quotation marks, we can teach the program when quotation marks are likely to occur. By giving it examples of what I am calling ‘implied speech,’ it can learn how to identify those as well.

For this machine learning project, I analyzed Woolf texts downloaded from Project Gutenberg. Eric and I put together scripts in Python 3 that used a package known as the Natural Language Toolkit] for classifying. All of this work can be found at the project’s GitHub repository.

The project is still ongoing, and we are still working out some difficulties in our Python scripts. But I find the complications of the process to be compelling in their own right. For one, when working in this way we have to tell the computer what features we want it to pay attention to: a computer does not intuitively know how to make sense of the examples that we want to train it on. In the example of romantic comedies, I might say something along the lines of “while watching these films, watch out for the scenes and dialogue that use the word ‘love.'” We break down the larger genre into concrete features that can be pulled out so that the program knows what to watch out for.

To return to Woolf, punctuation marks are an obvious feature of interest: the author suggests that we have shifted into the realm of speech by inserting these grammatical markings. Find a quotation mark-you are likely to be looking at speech. But I am interested in just those moments where we lose those marks, so it helps to develop a sense of how they might work. We can then begin to extrapolate those same features to places where the punctuation marks might be missing. We have developed two models for understanding speech in this way: an external and an internal model. To illustrate, I have taken a single sentence and bolded what the model takes to be meaningful features according to each model. Each represents a different way of thinking about how we recognize something as speech.

External Model for Speech:

“I love walking in London,” said Mrs. Dalloway.  “Really it’s better than walking in the country.”

The external model was our initial attempt to model speech. In it, we take an interest in the narrative context around quotation marks. In any text, we can say that there exist a certain range of keywords that signal a shift into speech: said, recalled, exclaimed, shouted, whispered, etc. Words like these help the narrative attribute speech to a character and are good indicators that speech is taking place. Given a list of words like this, we could reasonably build a sense of the locations around which speech is likely to be happening. So when training the program on this model, we had the classifier first identify locations of quotation marks. Around each quotation mark, the program took note of the diction and parts of speech that occurred within a given distance from the marking. We build up a sense of the context around speech.

Internal Model for Speech:

I love walking in London,” said Mrs. Dalloway. “Really it’s better than walking in the country.”

The second model we have been working with works in an inverse direction: instead of taking an interest in the surrounding context of speech, an internal model assumes that there are meaningful characteristics within the quotation itself. In this example, we might notice that the shift to the first-person ‘I’ is a notable feature in a text that is otherwise largely written in the third person. This word suggests a shift in register. Each time this model encounters a quotation mark it continues until it finds a second quotation mark. The model then records the diction and parts of speech inside the pair of markings.

Each model suggests a distinct but related understanding for how sound works in the text. When I set out on this project, I had aimed to use the scripts to give me quantifiable evidence for moments of implied speech in Woolf’s work. The final step in this process, after all, is to actually use these models to identify speech: looking at texts they haven’t seen before, the scripts insert a caret marker every time they believe that a quotation mark should occur. But it quickly became apparent that the construction of the algorithms to describe such moments would be at least as interesting as any results that the project could produce. In the course of constructing them, I have had to think about the relationships among sound, text, and narrative in new ways.

The algorithms are each interpretative in the sense that they reflect my own assumptions about my object of study. The models also reflect assumptions about the process of reading, how it takes place, and about how a reader converts graphic markers into representations of sound. In this sense, the process of preparing for and executing text analysis reflects a certain phenomenology of reading as much as it does a methodology of digital study. The scripting itself is an object of inquiry in its own right and reflects my own interpretation of what speech can be. These assumptions are worked and reworked as I craft algorithms and python scripts, all of which are as shot through with humanistic inquiry and interpretive assumptions as any close readings.

For me, such revelations are the real reasons for pursuing digital study: attempting to describe complex humanities concepts computationally helps me to rethink basic assumptions about them that I had taken for granted. In the end, the pursuit of an algorithm to describe textual speech is nothing more or less than the pursuit of deeper and enriched theories of text and speech themselves.

Postscript

I managed to take note of the questions I got when I presented this work at HASTAC, so what follows are paraphrases of my memory of them as well as some brief remarks that roughly reflect what I said in the moment. There may have been one other that I cannot quite recall, but alas such is the fallibility of the human condition.

Q: You distinguish between speech and implied speech, but do you account at all for the other types of speech in Woolf’s novels? What about speech that is remembered speech that happened in earlier timelines not reflected in the present tense of the narrative’s events?

A: I definitely encountered this during my first pass at tagging speech and implied speech in the text by hand. Instead of binaries like quoted speech/implied speech, I found myself wanting to mark for a range of speech types: present, actual; remembered, might not have happened; remembered incorrectly; remembered, implied; etc. I decided that a binary was more feasible for the machine learning problems that I was interested in, but the whole process just reinforced how subjective any reading process is: another reader might mark things differently. If these processes shape the construction of the theories that inform the project, then they necessarily also affect the algorithms themselves as well as the results they can produce. And it quickly becomes apparent that these decisions reflect a kind of phenomenology of reading as much as anything: they illlustrate my understanding of how a complicated set of markers and linguistic phenomenon contribute to our understanding that a passage is speech or not.

Q: Did you encounter any variations in the particular markings that Woolf was using to punctuate speech? Single quotes, etc., and how did you account for them?

A: Yes – the version of Orlando that I am working with used single quotes to notate speech. So I was forced to account for such edge cases. But the question points at two larger issues: one authorial and one bibliographical. As I worked on Woolf I was drawn to the idea of being able to run such a script against a wider corpus. Since the project seemed to impinging on how we also understand psychologized speech, it would be fascinating to be able to search for implied speech in other authors. But, if you are familiar with, say, Joyce, you might remember that he hated quotation marks and used dashes to denote speech. The question is how much can you account for such edge cases, and, if not, the study becomes only one of a single author’s idiosyncrasies (which still has value). But from there the question spirals outwards. At least one of my models (the internal one) relies on quotation marks themselves as boundary markers. The model assumes that quotation marks will come in pairs, and this is not always the case. Sometimes authors, intentionally or accidentally, omit a closing quotation mark. I had to massage the data in at least half a dozen places where there was no quotation mark in the text and where its lack was causing my program to fail entirely. As textual criticism has taught us, punctuation marks are the single most likely things to be modified over time during the process of textual transmission by scribes, typesetters, editors, and authors. So in that sense, I am not doing a study of Woolf’s punctuation so much as a study of Woolf’s punctuation in these particular versions of the texts. One can imagine an exhaustive study that works on all versions of all Woolf’s texts as a study that might approach some semblance of a correct and thorough reading. For this project, however, I elected to take the lesser of two evils that would still allow me to work through the material. I worked with the texts that I had. I take all of this as proof that you have to know your corpus and your own shortcomings in order to responsibly work on the materials – such knowledge helps you to validate your responses, question your results, and reframe your approaches.

Q: You talked a lot about text approaching sound, but what about the other way around – how do things like implied speech get reflected in audiobooks, for example? Is there anything in recordings of Woolf that imply a kind of punctuation that you can hear?

A: I wrote about this extensively in my dissertation, but for here I will just say that I think the textual phenomenon the questioner is referencing occurs on a continuum. Some graphic markings, like pictures, shapes, punctuation marks, do not clearly translate to sound. And the reverse is true: the sounded quality of a recording can only ever be remediated by a print text. There are no perfect analogues between different media forms. Audiobook performers might attempt to convey things like punctuation or implied speech (in the audiobook of *Ulysses*, for example, Jim Norton throws his voice and lowers his volume to suggest free indirect discourse). In the end, I think such moments are playing with an idea of what my dissertation calls audiotextuality, the idea that all texts recordings of texts, to varying degrees, contain both sound and print elements. The two spheres may work in harmony or against each other as a kind of productive friction. The idea is a slippery one, but I think it speaks to moments like the implied punctuation mark that come through in a particularly powerful audiobook recording.

Categories
Conference DH Event off campus

Apps, Maps, & Models: A New View

[Crossposted on my personal blog.]

Last Monday several of us here at WLUDH traveled down to Duke University for their symposium on Apps, Maps & Models: Digital Pedagogy in Art History, Archaeology & Visual Studies. I found the trip to be enlightening and invigorating. If you are interested in the event, you can find videos of the talks here and here as well as a storify of the Twitter action here. That the event was so well documented is a testimony to how well organized it was by the Wired! Lab.

Many speakers at the event considered how the tools they were using might relate to more “traditional” modes for carrying out their research. They considered and responded to tough questions with and about their work. Are digital methods for tracing the topography of a surface, for example, fundamentally different in kind from analog means of doing so? If so, are they meant to displace those old tools? Why should we spend the time to learn the new technologies? A related question that comes up at almost every digital humanities presentation (though not at any of these): can digital humanities methods show us anything that we do not already know?

Such questions can be particularly troubling when we are investing such time and energy on the work they directly critique, but we nonetheless need to have answers for them that demonstrate the value of digital humanities work, in and out of the classroom. Numerous well-known scholars have offered justifications of digital work in a variety of venues, and, to my mind, the symposium offered many answers of its own, in part by showcasing amazing work that spanned a variety of fields related to preservation, public humanities, and academic scholarship. Presenters were using digital technology to rebuild the past, using digital modeling to piece together the fragments of a ruined church that have since been incorporated into other structures. They were using these tools to engage the present, to draw the attention of museum patrons to overlooked artifacts. The work on display at the symposium struck me, at its core, as engaging with questions and values that cut across disciplines, digital or otherwise.

Most compelling to me, the symposium drew attention to how the tools we use to examine the objects of our study change our relationship to them. The presenters acknowledged that such an idea does hold dangers – after all, we want museum-goers to consider the objects in a collection, not just spend time perusing an iPad application meant to enrich them. But just as new tools offer new complications, changes in medium also offer changes in perspective. As was illustrated repeatedly at the symposium, drone photography, for all its deeply problematic political and personal valences, can offer you a new way of seeing the world, a new way of looking that is more comprehensive than the one we see from the ground. Even as we hold new methodologies and tools up to critique we can still consider how they might cause us to consider an object, a project, or a classroom differently.

Seeing from a different angle allows us to ask new questions and re-evaluate old ones, an idea that speaks directly to my experience at the symposium. I work at the intersections of digital humanities, literary studies, and sound studies. So my participation in the symposium was as something of an outsider, someone ready to learn about an adjacent and overlapping field but, ultimately, not a home discipline. Thinking through my work from an outsider perspective made me want to ask many questions of my own work. The presenters here were deeply engaged in preserving and increasing access to the cultural record. How might I do the same through text analysis or through my work with audio artifacts? What questions and goals are common to all academic disciplines? How might I more thoroughly engage students in public humanities work?

Obviously, the event left me with more questions than answers, but I think that is ultimately the sign of a successful symposium. I would encourage you to check out the videos of the conference, as this short note is necessarily reductive of such a productive event. The talks will offer you new thoughts on old questions and new ways of thinking about digital scholarship no matter your discipline.

 

 

Categories
DH Project Update Tools

Embedding COinS Metadata on a Page Using the Zotero API

[Cross-posted on my personal blog]

This year I am working with Mackenzie, Steve McCormick, and his students on the Huon d’Auvergne project, a digital edition of a Franco-Italian romance epic. Last term we finished TEI-encoding of two of the manuscripts and put them online, and there is still much left to do. Making the digital editions of each manuscript online is a valuable scholarly endeavor in its own right, but we’ve also been spending a lot of time considering other ways in which we can enrich this scholarly production using the digital environment.

All of which brings me to the bibliography for our site. At first, our bibliography page was just a transcription of a text file that Steve would send along with regular updates. This collection of materials is great to have in its own right, but a better solution would be to leverage the many digital humanities approaches to citation management to produce something a bit more dynamic.

Steve already had everything in a Zotero, so my first step was to integrate the site’s bibliography with the Zotero collection that Steve was using to populate the list. I found a python 2 library called zot_bib_web that could do all this quite nicely with a bit of modification. Now, by running the script from my computer, the site’s bibliography will automatically pull in an updated Zotero collection for the project. Not only is it now easier to update our site (no more copying and pasting from a word document), but now others can contribute new resources to the same bibliography on Zotero by requesting to join the group and uploading citations. The project’s bibliography can continue to grow beyond us, and we will capture these additions as well.

Mackenzie suggested that we take things a bit further by including COiNS metadata in the bibliography so that someone coming to our bibliography could export our information into the citation manager of their choosing. Zotero’s API can also do this, and I used a piece of the pyzotero Python library to do so. The first step was to add this piece to the zot_bib_web code:

zot = zotero.Zotero(library_id, library_type, api_key)
coins = zot.collection_items(collection_id, content='coins')
coin_strings = [str(coin) for coin in coins]
for coin in coin_strings:

fullhtml += coin

Now, before the program outputs html for the bibliography, it goes out to the Zotero API and gets COinS metadata for all the citations, converts them into a format that will work for the embedding, and then attaches each returned span to the HTML for the bibliography.

Now that I had the data that I needed, I wanted to make it work a bit more cleanly in our workflow. Initially, the program returned each bibliographic entry in its own page and meant for the whole bibliography to also be a stand-alone page on the website. I got rid of all that and, instead, wanted to embed them within the website as I already had it. I have the python program exporting the bibliography and COinS data into a small HTML file that I then attach to a <div id="includedContent"> inserted in the bibliography page. I use some jQuery to do so:

<script type="text/javascript">

$(function(){
$("#includedContent").load("/zotero-bib.html");
});
</script>

Instead of distributing content across several different pages, I mark a placeholder area on the main site where all the bibliographic data and metadata will be dumped. All of the relevant data gets saved in a file ‘zot-bib.html’ that gets automatically included inside the shell of the bibliography.html page. From there, I just modified the style so that it would fit into the aesthetic of the site.

Now anyone going to our bibliography page with a Zotero extension will see this in the right of the address bar:

Screen Shot 2016-02-08 at 1.07.04 PM

Clicking on the folder icon will bring up the Zotero interface for downloading any of the items in our collection.

Screen Shot 2016-02-08 at 1.13.09 PM

And to update this information we only need to run a single python script from the terminal to re-generate everything.

The code is not live on the Huon site just yet, but you can download and manipulate these pieces from an example file I uploaded to the Huon GitHub repository. You’ll probably want to start by installing zot_bib_web first to familiarize yourself with the configuration, and you’ll have a few settings to update before it will work for you: the library id, library type, api key, and collection ID will all need to be updated for your particular case, and the jQuery excerpt above will need to point to wherever you output the bibliography file.

These steps have strengthened the way in which we handle bibliographic metadata so that it can be more useful for everyone, and we were really only able to do it because of the many great open source libraries that allow others to build on them. It’s a great thing – not having to reinvent the wheel.

Categories
DH Pedagogy Research Projects

Reflections on a Year of DH Mentoring

[Cross-posted on the Scholars’ Lab blog]

This year I am working with Eric Rochester in the Scholars’ Lab on a fellowship project that has me learning natural language processing (NLP), the application of computational methods to human languages. We’re adapting these techniques to study quotation marks in the novels of Virginia Woolf (read more about the project here). We actually started several months before this academic year began, and, as we close out another semester, I have been spending time thinking about just what has made it such an effective learning experience for me. I already had a technical background from my time in the Scholars’ Lab at the beginning of the process, but I had no experience with Python or NLP. Now I feel most comfortable with the former of any other programming language and familiar enough with the latter to experiment with it in my own work.

The general mode of proceeding has been this: depending on schedules and deadlines, we meet once or twice every two weeks. Between our meetings I would work as far and as much as I could, and the sessions would offer a space for Eric and me to talk about what I had done. The following are a handful of things we have done that, I think, have helped to create such an effective environment for learning new technical skills. Though they are particular to this study, I think they can be usefully extrapolated to apply to many other project-based courses of study in digital humanities. They are primarily written from the perspective of a student but with an eye to how and why the methods Eric used proved so effective for me.

Let the Wheel Be Reinvented Before Sharing Shortcuts

I came to Eric with a very small program adapted from Matt Jockers’s book on Text Analysis with R for Students of Literature that did little beyond count quotation marks and give some basic statistics. I was learning as I built the thing, so I was unaware that I was reinventing the wheel in many cases, rebuilding many protocols for dealing with commonly recognized problems that come from working with natural language. After working on my program and my approach to a degree of satisfaction, Eric pulled back the curtain to reveal that a commonly used python module, the Natural Language ToolKit (NLTK), could address many of my issues and more. NLTK came as something of a revelation, and working inductively in this way gave me a great sense of the underlying problems the tools could address. By inventing my own way to read in a text, clean it to make its text uniformly readable by the computer, and breaking the whole piece into a series of words that could be analyzed, I understood the magic behind a couple lines of NLTK code that could do all that for me. The experience also helped me to recognize ways in which we would have to adapt NLTK for our own purposes as I worked through the book.

Have a Plan, but Be Flexible

After discussing NLTK and how it offered an easier way of doing the things that I wanted, Eric had me systematically work through the NLTK book for a few months. Our meetings took on the character of an independent study: the book set the syllabus, and I went through the first seven chapters at my own pace. Working from a book gave our meetings structure, but we were careful not to hew too closely to the material. Not all chapters were relevant to the project, and we cut sections of the book accordingly. We shaped the course of study to the intellectual questions rather than the other way around.

Move from Theory to Practice / Textbook to Project

As I worked through the book, I was able to recognize certain sections that felt most relevant to the Woolf work. Once I felt as though I had reached a critical mass, we switched from the book to the project itself and started working. I tend to learn from doing best, so the shift from theory to execution was a natural one. The quick and satisfying transition helped the work to feel productive right away: I was applying my new skills as I was still learning to feel comfortable with them. Where the initial months had more the feel of a traditional student-teacher interaction, the project-based approach we took up at this point felt more like a real and true collaboration. Eric and I would develop to-do items together, we would work alongside each other, and we would talk over the project together.

Document Everything

Between our meetings I would work as far and as much as I could, carefully noting places at which I encountered problems. In some cases, these were conceptual problems that needed clarifying, and these larger questions frequently found their way into separate notes. But my questions were frequently about what a particular line of code, a particular command or function, might be doing. In that case, I made comments directly in the code describing my confusion. I quickly found that these notes were as much for me as for Eric–I needed to get back in the frame of mind that led to the confusion in the first place, and copious notes helped remind me what the problem was. These notes offered a point of departure for our meetings: we always had a place to start, and we did so based on the work that I had done.

Communicate in as Many Ways as Possible

We met in person as much as possible, but we also used a variety of other platforms to keep things moving. Eric and I had all of our code on GitHub so that we could share everything that we had each been working on and discuss things from a distance if necessary. Email, obviously, can do a lot, but I found the chat capabilities of the Scholars’ Lab’s IRC channel to be far better for this sort of work. If I hit a particular snag that would only require a couple minutes for Eric to answer, we could quickly work things out through a web chat. With Skype and Google Hangouts we could even share the code on the other person’s computer even from hundreds of miles away. All of these things meant that we could keep working around whatever life events happened to call us away.

Recognize Spinning Wheels

These multiple avenues of communication are especially important when teaching technical skills. Not all questions or problems are the same: students can work through some on their own, but others can take them days to troubleshoot. Some amount of frustration is a necessary part of learning, and I do think it’s necessary that students learn to confront technical problems on their own. But not all frustration is pedagogically productive. There comes a point when you have tried a dozen potential solutions and you feel as though you have hit a wall. An extra set of eyes can (and should) help. Eric and I talked constantly about how to recognize when it was time for me to ask for help, and low-impact channels of communication like IRC could allow him to give me quick fixes to what, to me at least, seemed like impossible problems. Software development is a collaborative process, and asking for help is an important skill for humanists to develop.

In-person Meetings Can Take Many Forms

When we met, Eric and I did a lot of different things. First, we would talk through my questions from the previous week. If I felt a particular section of code was clunky or poorly done, he would talk and walk me through rewriting the same piece in a more elegant form. We would often pair program, where Eric would write code while I watched, carefully stopping him each time I had a question about something he was doing. And we often took time to reflect on where the collaboration was going – what my end goal was as well as what my tasks before the next meeting would be. Any project has many pieces that could be dealt with at any time, and Eric was careful to give me solo tasks that he felt I could handle on my own, reserving more difficult tasks for times in which we would be able to work together. All of this is to say that any single hour we spent together was very different from the last. We constantly reinvented what the meetings looked like, which kept them fresh and pedagogically effective.

This is my best attempt to recreate my experience of working in such a close mentoring relationship with Eric. Obviously, the collaboration relies on an extremely low student-to-teacher ratio: I can imagine this same approach working very well for a handful of students, but this work required a lot of individual attention that would be hard to sustain for larger classes. One idea for scaling the process up might be to divide a course into groups, being training one, and then have students later in the process begin to mentor those who are just beginning. Doing so would preserve what I see as the main advantage of this approach: it helps to collapse the hierarchy between student and teacher and engage both in a common project. Learning takes place, but it does so in the context of common effort. I’d have to think more about how this mentorship model could be adapted to fit different scenarios. The work with Eric is ongoing, but it’s already been one of the most valuable learning experiences I have had.