Tuesday, June 23, 2009

DH09: Animating the Knowledge Radio; Text analysis using High Throughput Computing; Appropriate Use Case Modeling for Humanities Documents

This session is chaired by Chuck Bush. The first presentation is by Geoffrey Rockwell (Alberta) and Stefan Sinclair(McMaster). 'Hacking as a Way of Knowing, 2009 presented earlier this year. Wanted to create a literary technology of a sort. (See NiCHE website). Questions posed: What opportunities for use in HPC in humanities? Examples of good practices? Barriers to humanists using HPC facilities? Concrete steps for outreach to DH community?
Recommendations: introductory documentation on HPC for Humanists, training opportunities, fellowships, etc. and visualization is the key.

Big See (website?) created a prototype in DH/HPC visualization, scalable dataset, processing, display weighted centroid in 3D with time. The Big See took a 2 D text work and converted into 3 D art in real time. The final product was a visualization that could be understood as how it was created. Animation ca stand in as a way of understanding of what the final visualization represents. Knowledge Radio goes beyond the Big See in the following way. Instead of a static corpus, why not use a dynamic corpus like streaming data or playable data? Example, a blogger could to examin how a particular word evolves over time over the web. (Code_Swarm project). Voyeur Prototype Radios: three examples: Voyeur website: load your own text box. Second example, LAVA website, line up the use of words in various documents to compare them. The third example is the Balls Reader interface.

Question from audience: this is fun but what interesting questions in the humanities can we answer with these tools?

The second presentation, 'Text analysis of large corpora using High Throughput Computing,' by Mark Hedges, Tobias Blanke, Gerhard Brey, and Richard Palmer, all from King's College in London. HiTHeR: Hight ThroughPut Computing in Humanities e-Research. funded by JISC ENGAGE development program (http://www.engage.ac.uk/). Create a campus grid for high throughput computing and connect to the NGS (National Grid Service). Early project: Nineteenth-Century Serials Edition (NCSE) http://www.ncse.ac.uk This is a digital edition of 6 newspapers and periodicals(3526 issues, 98, 000 pages 427,000 articles). Used Olive's Viewpoint software from microfilm and print works from the British Library. Search by names, places, institutions, subjects, events, genres, and document similarity. Document similarity has advantages unsupervised technology, well established methods. From a user perspective, it is intuitive and supported by ideas from prototype theory. Tested to approaches; TF-IDF and Latent Semantic Indexing. LSI builds relationships based on co-occuring words in multiple documents, exposing underlying relationships. Problem was OCR quality using 19th century newspapers on microfilm was a challenge. Overcame problem by using tokens and their character n-grams as basic textual units. To compare all the articles (longer than 300 characters) would require over 950,000 comparisons: we do not have the time! After three days, only 350 articles had been compared. HPC--relies on specially designed systems whereas HTC (High Throughput Computing)--loosely computing in the form of grid computing environments. Tools used: Condor Toolkit from Scout, and other tools mentioned.

Evaluation of the HiTHeR project. Problems: no TREC (Text retrieval ....), we can identify false positives, we will almost never know about false positives, we will never know what we didn't find, and what is 'similarity'? Similarity depends on the user's perspective.

The third presentation is 'Appropriate Use Case Modeling for Humanities Documents', Aja Teehan and John Keating at the National University of Ireland, Maynooth. The An Foras Feasa (website) Institute for Research in Irish Historical and Cultural Traditions. 400 meter scroll with about a metre wide. Used Activity Theory, considers human activities as a complex, socially situated phenomena. These activities result in tools. AT is a useful tool to model the activities of the digital humanities. HCI has 'ignored the the study of artifacts, insisting on the mental representations as the proper focus of study; AT is seen as a way of addressing this deficit.' An example would be data encoding., where it involves modeling raw data in order to manipulate and query it. 'How do we know for which information we will be able to mine?' XML:TEI is a single use case text encoding method. We are interested in producing Human Usable Texts which require a different kind of encoding to XML:TEI since our activities are different. Document encoding as a community activity should be encouraged. A process of Document Encoding was illustrated by Aja Teehan. physical-->logical-->interaction. Encoding occurs on each level, the choice of tool changes depending on the human activity. Software engineers use Use Cases to model Activity. 'Primary Use Cases are those that are aligned to the original reason for encoding the data.' Example, medical records should support querying for medical data for primary concern and for secondary concern, prosopography searches. Example, 'The Chymistry of Sir Isaac Newton' contains the notebooks of Newton. A primary Use Case would be a query about 'how many experiments did Newton conduct using copper?' Conclusions: we adopted whole system approach using AT Conversational Framework, GOMS--Goals, operators, methods and Selection, SFL--to examinie the activities of the communities. We developed text encoding systems that support human usable texts. The identification, selection and realization of appropriate Use Cases in necessary for document encoding and the production of human usable texts. Demo: The alcala account book project.

No comments: