JSTOR Data for Research (DfR) is designed for text mining the publications in the JSTOR archive. The site is significantly more difficult to use than the standard JSTOR interface but is suitable for various levels of undergraduate assignments. DfR is free and provides data for the entire JSTOR archive and not just the publications to which the library subscribes.
DfR is an excellent tool for building data sets to identify previously unknown patterns and relationships among articles. DfR offers two levels of data:
- Web interface with charts, key terms, top words, N-grams of each article
- Downloads of citations, metadata, word frequencies, key terms, and N-grams for up to 1,000 documents.
Web interface of DfR
DfR opens with a typical search bar at the top and displays the results of a default search. Just below the search screen is a dropdown menu for limiting a search to either title, author, abstract, caption, key terms, or references. On the left side is a menu for narrowing search results. Refining the search results is very important when submitting a request for downloading the data since you will want to narrow your results to less than 1,000 documents. The search box also offers Boolean operators and wildcard characters, as well as the possibility of excluding terms in certain fields. Details of the search query syntax can be found under the About menu of the DfR site. On the right side of the screen, just below the Next page button, are two icons for switching between the Charts View and the Search Results View.
Do a search and select the Charts View. JSTOR provides two charts depicting the pattern of publications by year and by subject group. Both charts allow for zooming, which is particularly useful for the Year of Publication. To zoom: position your mouse inside the chart, press the left mouse button, and draw a rectangle over the area you want to zoom. To revert from the zoomed view, simply double click anywhere inside of the chart. You can narrow the search results by clicking through any of the options in the left menu. For instance, in this search for “Jorge Luis Borges” we can limit to the article type of book reviews by clicking on Article Type and then Book review. This narrowed the search results from 7,325 items to 1,673. Remember that JSTOR does not archive recent years, which is plainly visible in the dropoff in the year of publication chart. At the bottom of the Charts View screen is an option to Download data for year chart. This link will download the data (year and article-count) in a CSV format, which can be pulled into Excel and other tools. To switch to the actual search results, click on the “Search Results View” icon in the upper right.
Error messages when searching
Occasionally when using JSTOR DfR, especially when narrowing a search, you will get a cryptic series of error messages. DfR is beta software. When you get those messages, wait a few seconds and reload the screen. In most cases, the error will go away.
Understanding Search Results
Results default to sorting by relevance, but can be sorted by date or CiteRank. The citation title (displayed in red) is a hyperlink to the full-text of the article. Since W&L does not subscribe to the full JSTOR archive, some titles may not display the full-text. In those cases a button connecting to the library’s linking system provides an option for checking for the full-text in another resource or obtaining the article via interlibrary loan. Book reviews are often listed as [untitled], as in the second citation in the above example. The author name is a hyperlink that searches JSTOR for other articles by that author that meet the same search criteria. For example, clicking on James E. Irby in the first citation above will return search results for book reviews by him that mention “Jorge Luis Borges”. The hyperlink for the subjects in the citation work the same way as the author hyperlink: the result is a search for the same criteria within that subject. Key terms are automatically generated by a word weighting algorithm and do not include keywords assigned by the author or publisher. The More Info link under each citation displays the top N-grams for each article. N-grams are a list of word/phrase frequency in a text. A bigram is two words that appear together, a trigram is three words, etc. Downloading data sets When you have revised your search results to a satisfactory degree, then you might want to download a data set. The downloaded data does NOT include the fulltext of the articles. To download: from the top menu, select Dataset Requests and Submit new request. Important: you can only download up to 1,000 items in a single request. If your search is beyond 1,000 items then the data set will be a random sample of all items in the search. If you have the need for larger data sets then you can contact JSTOR and provide a project description and the number of items needed. Before requesting a larger data set you will want to work with a smaller set or a sample of the data first in order to verify that the data will meet your research needs. A simple way of reducing the data set to below the 1,000 item threshold is to limit by publication year. Requests can be made for the same search that span different time periods. After downloading several data sets in this manner, then you can combine the data into a single large data set.
What’s in a data set?
All data sets come with citations by default. You also can get additional data such as word counts, bigrams, trigrams, quadgrams, and key terms. Downloads of data sets are not instantaneous. The request is submitted and you will be notified by email when the data set is ready. It may take hours or days for the data set to be ready. However, most data sets are ready in less than hour.
What format is the data?
Data is available in either XML or CSV. The choice depends upon your needs, particulary the types of tools you are using and the type of content in the data. For most people, CSV is a format that works well since it can be easily imported into Excel and many other tools. However, if your content is foreign language material then you may want to choose XML. The diacritical marks in foreign language text will be garbled if you choose CSV, whereas XML retains the diacritics. You can convert XML to CSV by using OpenRefine. Be aware that Excel needs extra caution when importing text files that uses diacritics. Examples of data sets for research JSTOR DfR datasets lend themselves to visualization with a variety of tools. The graphic above represents the results of a search for book reviews and the writers Jorge Luis Borges and Julio Cortázar. On the left are the reviews that mention only Borges, and on the right are reviews that mention only Cortázar. In the middle are reviews that mention both writers. (Click on the image for a closer view.) The visualization was generated with the Palladio tool. One of the most common uses of JSTOR DfR data sets within the digital humanities is topic modeling, which will be the focus of a future tutorial.
For futher information If you’re W&L faculty or students, contact the Digital Humanities Action Team for assistance in using JSTOR Data for Research by sending email to firstname.lastname@example.org.