Thursday, June 25, 2009

DH09: Funding the Digital Humanities

'Funding the Digital Humanities', a roundtable discussion. DH2010 in London (www.cch.kcl.ac.uk/dh2010) and DH2011 in Stanford, CA.

Office of Digital Humanities at NEH by Bret. Start Up Grants program; seed money for innovation. Indiana Philosophy Ontology Project. Published many journal articles in science and humanities journals. Get small start up grant--be successful and larger grants may come your way. Also, methodological training. If your institution is good at something, give seminars and train other people; we will fund you to do so. 'Digging into Data Challenge'.

Helen Cullyer, 'Building Digital Environments for Scholarship: the IDP' at Andrew W. Mellon Foundation. Funds the Bamboo Project. DDBDP, HGV, APIS are all dated papyri databases; the IDP takes these three projects and incorporates them into a new project.

Funding is:
1. Driven by scholarly needs
2. Incorporates existing electronic resources into an interoperable environment
3. Developing editing software than be reused in other projects
4. A standards-based approach that facilitates interoperability with other resources beyond papyri
5. A staged approach to building a digital environment
6. Resources and tools developed and sustained by a number of funding agencies and institutions, of which Mellon is just one.

IMLS by Rachel Frick. Funds many categories of funding. Collaborations is an important theme in getting funded. DigCCurr II: Extending and International Digital Curation Curriculum.

SSHRC (Social Sciences and Humanities Research Council of Canada) by Murielle Gagnon. Thematic Research

Priorities:
image, text, sound and technology
environment and sustainability
north
aboriginal research
colutural, citizenship and identities
management, business and finance

For DH: image, text, sound and technology (ITST) , new and creative applications. Digging Into Data Challenge.

National Science Foundation by Stephen Griffin.
Deutsche Forschungsgemeinschaft by Christoph Kummel.
Arts and Humanities Research Council of UK by Shearer West.

DH09: History and Future of the Book, three presentations

1) 'History and Future of the Book' by Christian Vandendorpe from the University of Ottawa. Author of 'From Papyrus to Hypertext.' In the discrete nature of the written word, readers want more control over their reading matter. The shift from verbal exchanges to written messages: SMS, twitter.... Nicholas Carr, 'Is Google Making us Stupid?,' he can no longer do deep readings; he is always distracted after a few pages. The capacity to concentrate should not be equated with the intimate reading of a novel. McLuhan wrote that 'the mere fact of reading is itself a lulling and semi-hypnotic experience'.(From Cliche to Archtype). Ergative reading is becoming the default mode of reading. The book as a sacred entity is disappearing, since the 1960's as theorized by Foucault and Derrida. From the reader to the wreader (Proust and the Squid). A wish list by Vandendorpe: hyperlink passages and establish relations between all the books of the Universal Library (Kevin Kelly, Scan this Book, in NY TImes, May 14, 2006). The ability to denote basic semantic relations between documents; the ability to visualize those relations in semantic maps that would allow to make immediately known perceptible mass trends.


2) 'Expanding Text: D.F. McKenzie's 'text' and New Knowledge Environments' by Richard Cunningham and Alan Galey. inke is a project designed to implement replacements for the book. D.F McKenzie, 'Bibliography and the Sociology of Texts', page 13. (read).

Architectures of the Book: ArchBook Proposal
Will consist of four research groups: Textual studies, reader studies, interface design, information management.
Goals Sought:
Accessibilityand ubiquity of Wikipedia, scholary like the 'History of the Book in Canada,' as an essential introducction to textual scholarship, and have the visual richness of digital resources like the British Library's 'Treasure in Full' (http://www.bl.uk/treasures/treasuresinfull.html) and McMaster University's 'Peace and War in the 20th Century' (pw20c.mcmaster.ca).


3) 'NewRadial: Revisualizing the William Blake Archive' by Jon Saklofske. 'Songs of Innocence and of Experience'. HyperPo , Many-Eyes tools. PaperScope (for astronomy). The NewRadial is in beta but is going to be released soon.

DH09: The Digital Classicist presentations: on reusing data

The Digital Classicist: Re-use of Open Source and Open Access Publications in Ancient Studies'. Three presentations:

1) 'Linking and Querying Ancient Texts (LaQvAT)' by Gabirel Bodard et al. (digitalclassicist.org/wip/seminar.xml) makes raw data available to be reused. OGSA-DAI is concerned with sharing structured data. OGSA-DQP-- distributed query processing.

LaQvAT data resources: HGV: Filemaker Pro; German-use views to translate them; Projet Volterra: Access database with Perl script based publication; mainly text-based searches and IAph XML databases: XML data source in EpiDoc--overlap in time with Volterra. http://laquat.cerch.kcl.ac.uk/

2) 'Data and Code for Ancient Geography' by Thomas Elliott. Project called 'Pleiades.' Open source software is the backbone of Pleiades; Pleiades wouldn't exist without it. Demos of the Project. Live Twitter feeds appeared on screen as demos presented. Data from Barrington Atlas. http://atlantides.org/trac/pleiades/wiki/PleiadesSoftware http://openlayers.org, www.plone.org trac.gispython.org/lab/wiki

e.g. Termassos in the Pleiades Beta Portal. Use Google Maps.
License: http://atlantides.org/trac/pleiades/wiki/PleiadesContributorAgreement.

3) 'Recent work at the Center for Hellenic Studies (CHS): reuse of digial resources' by Neel Smtih of College of the Holy Cross. Three projects: Homer Multitext Project, First Thousand Years of Greek Project(FTYG), and the infrastructure for other projects (CITE architecture). The mission of CHS is to share and expand knowledge of classical studies. Reuse is the mandate.

Licenses are important; proprietary licenses like TLG complicate reuse of other material (poorly written licenses complicate things). Recognized open licenses simplify collaboration: multitext expereince with Marciana (sight visited in Italy that had Homer texts).

Data: in well-known and/or well-specified formats; data models; data structures; identifiers. Data in familiar formats in FTYG. the Project used TLG word list, Perseus' morphological parser, TEI-compliant LSJ from Perseus, and analytical indices. Sharing data models: rich markup bibiliofraphical model of versions in hierarchical relation (bibliographers) Textual model in CTS: combines in: four structural properties of text, defines functional equivalence of different implementations; realized in two distinct data models (XML texts and TextInventory and tabular strcuture of nodes).

Identifiers in FTYG: inventory of extant Greek poetry known from MS transmission: where possible, reuses identifiers from TLG canon (not always possible); inventory of Greek lexical entities: where possible, reuse identifiers from Perseus edition of LSJ.

Shared software: source code, running services with specified APIs, and [something else]
Source code can be modified: tlgu, Epidoc transcoding transformer and Diogenes software. Reusable modules and code libraries (Epidoc transformer is useful (instead of dumping data on the web, this app can be downloaded to use with the data), CTS URN manipulation with CTS library).

Texts and specifications: simple XML structure encodes request; expected replies encoded, etc.

CTS (Canonical Text Service)

DH09: 2 Presentations: Automatic Conjecture Generation in the Digital Humanities; Co-word Analysis of Research Topics in the Digital Humanities

1) 'Automatic Conjecture Generation in the Digital Humanities' by Patrick Juola and Ashley Bernola. 'Computers are useless because they can only give you answers' (Picasso). 'I believe that we can use computers to generate questions.' Computer should perform light analysis as well as data organization, farm out 'routine' reading to computer, and intervene only when somethin 'unordinary' occurs. Structure of Graffitti has: a list of graph invariants(max degree, girth, # of nodes...), template-based conjectures, program generates random graph, and if conjecture passes many tests, publish for mathematicians to prove. Epistemological background: graffiti generates ideas. Program called 'the Conjecturator': apply graffiti-like structure to generation of text analytic conjectures; generate random testable hypothesis, etc. Components include template, thesaurus, database and metadata, statistics, prototype, explanations of conjecture outputs, etc. See: www.twitter.com/conjecturator for examples of conjecture that mostly turn out to be untrue and/or uninteresting. What makes "interesting" conjectures? (i.e. conjectures worth working on?). How can this be improved? The conjecturator can generate evidence and a reading list for further research.


2)'Co-word Analysis of Research Topics in the Digital Humanities' by Xiaoguang Wang and Mitsuyuki Inaba. How do we define 'digital humanities?' Get a clear view of the structure and evolution of digital humanities research based on metholodgy of scientometorics. Use a second order study for cognizing research questions.

A variety of methods were discussed. One method : co-word analyisis explores conceptual networks in various disciplines, such as scientometorics. Used the coefficient index; data collection from 600 articles from DH journals, ALLC, ACH, LLC, DHQ. Listed top 36 high frequency keywords, temporal change of term frequency. For example, contrasting the use of digital humanities and humanities computing, one see the increase of use in the former and a decrease use in the latter. Using a cluster map, one can see humanities computing moving from the center of the map as a large node in each passing year to a peripheral, smaller mode in 2008. Futher discussion: full text analysis vs keyword analysis, combining keywords vs splitting keywords, synonymous term vs different term, and inductive approach vs deductive approach. Conclusion: enhanced understanding of direction development of the term 'digital humanities,' no clear subdisciplines in digital humanities, and digital humanities can be considered a gateway for emerging research topics.

Wednesday, June 24, 2009

DH09: 3 Presentations: Corpus Analysis and Literary History; Predicting New Worlds from Newer Words; Our American Archives Partnership

1) 'Corpus Analysis and Literary History' by Matthew Wilkens at Rice University. What is literary criticism for? Old view: the true, the good, and the beautiful. New view: social systems, symptoms and structures.

What effects result from this shift? New objects, old methods: media studies, cultural studies, canon changes; theory.

Certain tensions: close analysis methods don't scale well; theoretical methods require extrapolation from limited cases.

Example: periodization and literary history
Why study it? Historical connections to particular changes; metaphorical similarity to other kinds of change. What's the standard view? Punctuated equilibrium, Kuhnian paradigms.

Test case: Model System and result
American Late Modernism: 25 years after WWII. Allegory and event: allegory as mechanism of paradigm change; explains how events stabilize and propagate.

Allegory and events--problems: selection bias and generalizable.

Corpus analysis: advantages
historical breadth
genre and domain breadth
In principle, no selection bias

Three potential corpora: MONK Project, Project Gutenberg, and Open Content Alliance/Google.

Classical allegory is rhetorically and grammatically simple, short and limited set of tropes.

Tag the MONK corpus using MorphAdorner.

Future research areas:
larger corpora, metadata difficulties, word and n-gram frequencies, machine learning.

Matthew Wilkins website: http://workproduct.wordpress.com


2)'Predicting New Worlds from newer words: Lexical borrowings in French' by Paula Horwath Chesley, R. Harald Baayen.

Of all the new words out there, which of these new words will stick around in the lexicon? We did a case study in lexical borrowing in French. Le Monde corpus has .081% of all tokens are new borrowings. We compared Le Monde and Le Figaro. Conclusions: role of frequency and dispersion in predicting lexical entrenchment, lexical entrenchment of borrowings is highly context-dependent, and lexical entrenchment of borrowings is predictable. I have a handout for interested parties.


3) 'Our Americas Archives Partnership(OAAP): charting new cultural Geographies' by Lisa Spiro from Rice University. Collaboration among scholars, librarians, IT folks, etc. Transnational level: what roles of collaborative, comparative, border-crossing research play in this reconfigured field? Access to open collections, new research tools, and an interactive research community, and pedagogical innovation. The partners are Rice University, University of Maryland (MITH) and Instituto Mora, Mexico City. The collections: Early Americans Digital Archive (based at MITH), Americas Collection 1811-1920, and the Instituto Mora has social history collections. Federation of content is utilized. Tools being developed: view search results in SIMILE's Timeline interface. Challenges are to get other institutions to participate. Rice is currently hosting Inistuto Mora's digital collections.

DH09: 3 Presentations: The Thomas Middleton Edition; a modern genetic edition of Goethe's Faust, An E-Framework for Scholarly Editions

1) 'Experiences of the Previous Generation: The Thomas Middleton Edition' by John Lavagnino at NUI Galway and King's College, London. Print edition finished publishing in 2007.

Some problems considered

Exchanging text problem
Rebirth of lightweight markup: Wikipedia, markdown (e.g. jottit.com). The problem with this approach starts out great but gets overloaded. Linking is a tedious chore, especially without software support. Losts of it in scholarly editions, in the Middleton Edition(60,000 links).

Line-number problem
line numbers for editions ususlaly done by many graduate students.
solution two the teo problems: preparea draft version, print line numbers key notes to texts link both works. Advantage: check for lemmas and line numbers. Pitfalls: notes to small common words can be mislocated, editors want to have notes to parts of words, and the effort to build it.

The machine problem
A machine transforms the XML into printed pages, and it must keep running in order to prepare new reprints. Currently built on top of TeX, shared body of code, EDMAC, its distance from the full system needed, and the diversity of scholarly editions.

The update problem
Outdated things like DSSSL, pre-Unicode, need to keep machine running, maintenance, the delusion of continuous expansion, potentially a bigger problem for online editions.

The variety of texts: editorial methods include multiple versions, parallel texts of various sorts,; unusual works: six-column parallel layout, marginal notes that run down the page; Editorial conventions: decimal line numbering of stage directions. Example, the MGH has editorial footnotes.

2) 'Requirements and tools for a modern genetic edition Goethe's Faust' by Fotis Jannidis. Earlier historical-critical editions: Weimarer Ausgabe. Goals: new critical edition, diplomatic transcription of all source texts, genetic view of all text to see their sources.

Problem 1: text- and document-centered view; text-centered view, drama, parts, acts, etc. vs the document-centered view, rendering the witness as closely as possible. Possible solutions: render the document using SVG and encode the text as standoff markup and secondly, encode the text and add selected document information.

Problem 2: genetic encoding: requirements involve linking all text in all documents to the final version of the text, include all temporal information. Solution: TEI SIG manuscripts--working on genetic editions. Working in the area of genetic editions: Elena Pierazzo, Malte Rehbein and me.

Outline:
Textual alterations
time
critical apparatus
genetic relations between and within documents
document editorial decisions

Problem 3: Interoperability
Requirements for open access, creative commons, access to the xml-encoded text, persistent builder. The solution involves linking to XML-version in html header, linking to canonical reference schema (part, act, scene, verse), DNB; and last, Open for PI for which parts/views of the edition?

Relevant websites:
www.faustedition.de
www.textgrid.de
wiki.tei-c.org/index.php/Genetic_Editions

3) 'An E-Framework for Scholarly Editions' by Susan Schreibman of the DHO. Why an E-Framework? The crisis in scholarly publishing is acute. Yet, very little has been done in the last 15 years. The digital silo approach is philosophical and pragmatic approach yet is increasingly expensive and leaves large legacy issues (upkeep, migration, preservation). Aacdemic credit is another issue: no universally recognized peer review system to evaluate e-scholarship (NINES is investigating a model); larger diversity of scholarly outputs than in analogue form; humanities have not come to terms with collaboration. Is the digital a solution? Digital is not a cheap solution; issues of sustainability, etc.

Models: scholarly communities
TextGrid and NINES are good examples.

Models: university presses
University of Virginia Press Scholarly Imprint
Rice University Press

Principles for emerging systems for scholarly publishing (2000).


The Idea of an Irish Digital Scholarly Imprint, March 2009.
Taking the best of these models: infrastructure, social structures and preservation.

  • Types of text to be published: traditional scholarly forms, backlists, reprints out of print texts, runs of texts; new forms: online scholarly editions, thematic research collections, new types of editions based on primary sources of print editions.
  • Economic considerations must have a cost recovery model: print on demand, subscription, pay per view, who pays for making avialable texts for the public good?
  • Issues of collaboration: how to acknowledge and give credit for, roles what can be expected from the authors in these new roles?
  • Possibilities for collaborations: libraries, tourism, cultural heritage.
  • Technical issues: permanent identifers, migration, new compilations, authenticity of texts to allow multiple editions to be displayed simultaneously.

See the same day blog entry for the DHO for how the DHO is solving these problems.

DH09: Panel discussion: The Digital Humanities Observatory(DHO): Building a National Collaboratory

'A Potted Pre-History of the DHO' by Jennifer Edmond, Trinity College, Ireland. Edmond gave a historical overview from the early 1990's. 2004 OECD Report on Higher Education, recommendation from this report: 'That greater collaboration between institutions be encouraged and incentivised through funding mechanisms in research....' HSIS--Humanities Serving Irish Society. This is a platform coordinating research.

'The DHO: a 21st Century Collaboratory' by Susan Schreibman. Part of the HSIS consortium to develop an all-island inter-institutional research infrastructure for the humanities. Defining best practices for digitisation--curation--discovery--presentation. The DHO is coordinated by the Royal Irish Academy. The DHO makes recommendations but cannot enforce them. Not an institution of higher learning so not eligible for funding through traditional routes. Three years of funding only. It provides data management, curation and discovering services. How do you empower a community scattered abroad and having nascent interest in DH AND with only three years of funding? Challenges: not enough institutional capacity in critical areas, not enough DHO staff to fill in gaps or work closely with each project; and ideally, the DHO would have had to begun a year before project partners to have infrastructure and policies in place. Worskhops work better than lectures, small group seminars on strategic issues, outreach events to build capacity at partner institutions. There is collaboration beyond HSIS partners (DHO Recommended Metadata Standards, first draft expected August, 2009), Text-Image Linking Environment(TILE), SEASR, Digital Narrative Discussion Group, and others.

'DHO Outreach Activities' by Shawn Day. Connect with many Irish educational institutions. An evolving mission from digitization to collecting, encoding, standards, tools, and moving in the third year to dissemination. The audience is humanists with digital potential as well as digital humanists, existing projects at partner institutions and cultural and heritage institutions. DHO events: partner clinics, workshops, symposia, consultations, and raising the DHO presence on the island.

'Metadata in support of a National Collaboratory' by Dot Porter. Metadata in support of outreach and IT initiatives (DRAPIer and Fedora Repository). In outreach activities, the metadata recommendations will depend on the specific needs of the individual project: MODS, CDWA, VRA Core. May provide recommendations for controlled vocabularies, taxonomies, ontologies. (e.g. CIDOC CRM). DRAPIer (Digital Research and Projects in Ireland) allows for digital humnaists to know who is working on what. Some scholars are unaware of who is working on something at their own institution!

The final speaker was Don Gourley. How do you build infrastructure to support collaboration? We are defining the IT deliverables: community portal(DHO news, evenets, etc. prifles, forums, wikis, access to projects, e-resources), DRAPIer (databases of projects and methods), and a demonstration repository(demonstrator project requirements, generalize useful servcies, create a laboratory for experimenting).

DHO Strategies
Strategy 1: build on open source software (Drupal, Fedora, Php, MySQL, PostgreSQL, Linux).
Strategy 2: use rapid application development.
Strategy 3: partner with Irish digital initiatives.
Strategy 4: employ object orientated design patterns
Strategy 5: integrate tools for multiple use cases and skill sets

Questions from audience: how do you monitor or measure your success? We measure by the number of deliverables (which we have exceeded). You are well funded, how did you get it? Quote of the day by Susan Schreibman: 'The Celtic Tiger has been shot and the kitten is dying somewhere in the room.' Funding continues to be a challenge and the Irish goverment funds only once (this results in many organizations mutating or evolving into other types of institutions to continue funding and consequently, survival).

Addendum: Susan Schreibman et al, have given a lecture on the crisis in the Digital Humanities back in January, 2009.

DH09: TEI Encoding Projects: In the Header, but Where?; A Tool Suite for Automated TEI Encoding; Modernist Magazines Project

1) 'In the Header, but Where?' by Syd Bauman (Women Writers' Project) and Dot Porter. Problems with TEI Guidelines? One problem: encoder is instructed to put something in the header, but where? In the guidelines (1500 pages document), they located the instances where 'header' was mentioned, where they are clear and precise and where they need work (too ambiguious). The TEI-C needs an editorial document to guide the encoding of new sections and chapters--quite possibly in the ODD for the TEI Guidelines.


2) 'A Tool Suite for Automated TEI Encoding' by Laura Mandell, Holly Connor, Gerald Gannod. Established CHAT(Collaboratory for Humanities, Art and Technology) at Miami University in Ohio. Goal: create tools to enhance the end-user experience without the end-user needing to learn code. Existing Tools: Standalone Desktop applications--Oxygen, X-Metal, etc. Some examples from CHAT were shown. Last, it is noted that Word 2007 has many XML features that can be used, better than Open Office. Blog: http://miamichat.wordpress.com/


3) 'Modernist Magazines Project' by Federico Meschini. Three printed volumes created so far (OUP and website). The original aim of the project was to complement the print publication with a bibliographical index and selection of the facsimile reproduction but has moved to a project 'in itself.' How to encode indexes (toc) of serial publications using currently available metadata standards, TEI, DublinCore, MARC[XML] or MODS? How to manage this? Second problem: indexes compiled by modernist scholar not versed in library and information science. Teaching the basic of information literacy was quite easy. Not the same as XML encoding! Two solutions: batch processing or interactive form(web or stand alone app). Went with the first option. OOXML (Excel) and XSLT (1.0) for version 1. Version 2 had Excel and Java (Apache PPOI/JJDOM). Processing involved Excel tabular processing vs XML tree one. Examples shown. What can you do with XML files?: eXist, XQuery/XSLT/Javascript for almost anything.

Tuesday, June 23, 2009

DH09: Christine Borgman, Scholarship in the Digital Age: blurring the boundaries between the sciences and the humanities

Digital Humanities Manifesto 2.0 illustrates the difficulty of defining the 'humanities'. Whose problem is this to organize digital information? Answer: 'Humanists must plan their digital future' which is also the title of a recent article in the Chronicle of Higher Education by Johanna Drucker.

Outline of talk:
  • scholarly information infrastructure
  • science (and, or, vs ) humanities
  • call to action

scholarly information infrastructure

Infrastructure must be convergent. Cyberinfrastructure, eScience, eSocialScience, eHumanities, eResearch, are all used to describe this phenomenon. Goal: enable new forms of scholarship that are emergent(information-intensive, data-intensive, distributed, collaborative). We must build a research agenda for digital scholarship.

science (and, or, vs ) humanities

a) Publication practices
Why do scholars publish? Legitimization, dissemination, and access, preservation, curation. Things in the humanities are out of print before they are out of date. This is the opposite to the sciences. ArXiv.org received 5000 articles for publishing; over a million hits per day.

b) Data
Data in digital scholarship is becoming scholary capital. They are no longer thrown away but keep and reused, posted and shared. Also, asking new questions with extant data, e.g. computational biology, collaborative research. This is coming to the humanities! What is data? observational, computational experimental, and records, mostly applying to science (long-lived data, NSF, 2005). Are data objective or subjective? Data can be facts, 'alleged evidence' (Buckland, 2006). Examples: Scientific data weather, ground water, etc. Examples in the social sciences include opinion polls, interviews, etc. Humanities and arts data examples would be newspapers, photographs, letters, diaries, books, articles marriage records, maps, etc. (A comment from the audience afterward mentioned that these aren't our data; it is arguments, patterns, etc.). Sources can be libraries, archives, museums, public records, corporate records, mass media, acquire from other scholars, and data repositories. (Read 'The end of theory: the data deluge makes scientific method obsolete' in Wired, 16.07). What is data in the humanities will drive what is captured and reused and curated in the future!

c) Research methods
The research labs in the future of humanities will be blended, common work places. New problem solving methods in the sciences: empirical, theory, simulation, and the current one is data (Jim Gray). y=e(x squared). E.g. Sloan Digital Sky Survey provides open access to data in astronomy. e.g. Life Under Your Feet. e.g. Rome Reborn Project

d) Collaboration
Science is mostly collaborative; humanities is not. 6-10 years working alone on a dissertation makes it difficult to work in a collaborative environment. CENS (Center for Embedded Networked Sensing). What are CENS data? e.g. Dozens of ways to measure temperature.

e) Incentives (Chapter 8 of her book)
Motivations include coercion, open science, etc. Rewards for publication, effort to document data, competition, priority of claims, intellectual property issues like control over data.

f) Learning
'Fostering Learning in the Networked World', 2008. Cyberlearning is the use of networked computing and communications. A couple of recommendations: build a cyberlearning field and instill a 'platform perspective' like Zotero; enable students to use data; promote open educational resources. Why openness matters: interoperability trumps all, add value; discoverability of tools, reusability.

Call to action

Publication practices: increase speed and scope of dissemination through online publishing and open access.
Data: define, capture, manage, share, and reuse data.
Research methods: adapt practices to ask new questions, at scale, with a deluge of data.
Collaboration: find partners whose expertise complements yours, listen closely, and learn.
Incentives: identify best practices for documentation, sharing and licensing humanities content.
Learning: build a vibrant digital humanities community starting in the primary schools.
Generally: err towards openness, reusability, and generalizability.

Future Research questions/problems:

What is data?
What are the infrastructure rwequirements?
Where are the social studies of digial humanities?
What is the humanities laboratory of the 21st century?
What is the value proposition for digital humanities in an era of declining budgets?

DHO9: 3 presentations on the Archimedes Palimpsest, Language and Image: T3=Text, Tags and Trust, and Mining Texts for Image Terms: the CLiMB Project

1) 'Integrating Images and Text with Common Data and Metadata Standards in the Archimedes Palimpsest' by Doug Emery and Michael Toth of Toth Associates and Emery IT.

a) Image registration
Reconstructed underwriting was begun with multiple overlapping photographs were taken to get 'one' image. The technique chosen so that two images were combined to create a level of contrast between the two different writings. LED illustration Panel was used in 2007 to capture entire leaf in one shot. The result was a series of shots along the visible light wavelength. In some cases, the text was simply gone and using rigging lights, one was able to see the indentation in the palimpsest. There was six additional texts in addition to Archimedes. One author has 30% of his known writings are found in this palimpsest. Question: Ink to Digits to ??? Once this knowledge is captured, how do you perserve it for prosperity given the issues of digital preservation? Goals of the program were to set up a standards for authoritative digital data, provide derived information, etc. 2.4 TBs of raw and processed data were created once the manuscript was digitized. Every image is registered with 8160 x 10880 pixels, TIFF 6.0, eight bits per sample, resolution is 33 per pixel(?).

b) Project Metadata Standards
Contents standard for Digital Geospatial Data were used. Desired long term data set viability beyond current technologies. 6 types of metadata elements used. Metadata stored in TIFF description standards. Uses OAIS. Working on a hyperspectral imaging of a Syriac Palimpsest of Galen. Also, see Walters Digital Tool.



2) Language and Image: T3=Text, Tags and Trust by Judith Klavans, Susan Chun, Jennifer Golbeck, Carolyn Sheffield, Dagobert Soergel, Robert Stein. Practical problem addressed: the limited subject description of an image of Nefertiti as an example. Viewers tag 'one eye, bust of a woman, Egypt, etc.' T3 Project is driven byt this philosophical challenge to show the image and the words to describe it. Text: Computational Linguistics for Metatdata Building (CLiMB), Tags: Steve. Museum collects tags, and Trust: a method of computing interpersonal trust. Each 'T' had a different goal: CLiMB was used for text while working with catalgoers, Steve (Art Museum Social Tagging Project) was working with the public to find the images they wanted through tagging; and Trust, amiguity and synonymy are the biggest hurdles. Trust (http://trust.mindswap.org) is useful as a weight in recommender systems. Some research questions posed: How to compute truste between people from their tags? Can clustering trust scores lead to an understanding of users? What types of guidance are acceptable? Does 'guidance' lead to 'better' tagging?
Current work of CLiMB:
1. analyze tags from steve with CL tools (50,000 tags): morphology and ambiguity issues (e.g. gold as metal or color?)
2. experiments on trust in the T3 setting: how to relate trust in social networks
3. consideration of tag cloud for ambiguity: blue (the color or your mood?)
4. review of a guided tagging environment


3) Mining Texts for Image Terms: the CLiMB Project by Judith Klavans, Eileen Abels, Jimmy Lin, Rebecca Passonneau, Carolyn Sheffield, Dagobert Soergel. Judith Klavens of the University of Maryland also gave this presentation. Image catalogers-->catalog records-->image searchers. Terms are informed by art historical critieria, ability to find related images and leverage meaning from the Art and Architecture Thesaurus (AAT). One technique to solve disambiguation is to use SenseRelate to find correct meaning of the head noun, compare the correct meaning's definition to the definition of all the AAT and compare the results.

DH09: Three presentations on Reading Ancient, Medieval and Modern Documents

1) Towards an Interpretation Support System for Reading Ancient Documents (missed part of this presentation, see abstracts). Concepts and web sites mentioned:

ISS prototype
Evidnce based decision process
EpiDoc
Contextual coding markup
index searcher (Ajax Live Search)
RESTful Web Service
Vindolanda Knowledge Base Web Service (Vindolanda Tablets)
eSAD Project

2) 'Image as Markup: adding Semantics to Manuscript Images', presented by Hugh Cayless, New York University.

img2xml Project (began June 1, 2009) uses SVG tracing of text on a manuscript page as a basis for linking page images to transcriptions, annotations, etc. The goal is to produce a web publishing environment that uses an open source, open standard stack to integrate the display of text/annotation/image. Test case is a 19th century journal, prose of a student James Dusedery(sic) from the 1840's. Contrasted raster, zoomed raster and vector imaging. The vector version is 'a shape with a black fill' rather than the pixelization of raster images. In the SVG, 'words' consists of combinations of shapes. The realtionship between shapes is based entirely on their position. To align the image with the transcription, we must make the SVG explicit. "OCHO" could be done, using groups () but is a non-starter due to overlapping handwritten words (descenders -eg. letter 'f' descending into the word below it). Used Drucker and McGann (Images as the Text: Pictographs and Pictographic Logic). In descrbing an entity, image-->abstract entity(concept)-->lingusitic entity(word). Not only symbols on page but structure as well. e.g. articulating the relationships among entities is important too.

Conclusions: the act of reading rescues us from attending to the presentational logic of the text (Drucker and McGann), you can do many things with an XML sketch of a text, machine actionable language to run this (RDF perhaps?), add ids to the SVG of annotations with TEI, you can create URLS to any text point. img2xml: http://github.com/hcayless/img2xml/tree/master

Interesting web site for transforming bitmaps into vector graphics: Potrace

3) The third presentation was titled 'Computer-aided Palaeography: Present and Future' by Peter Stokes, Cambridge University. Palaeolography defined as 'the study of medieval ancient handwriting. ' Problems with Palaeo: manuscripts are dificult to read; multiple authors on the same page written over time. Paleaographers offer subjective opinions rather than objective analysis. Truth...'depends on the authority of the palaeographer and the faith of the reader.' We must replace the qualitative data provided by the palaeographer with with quantitative ones. Were these two writings written by the same author? Instead ask, 'is it even possible to objectively decide whether these two pieces of writing were written by the sam person?' There is some objective basis that this process can be quanitified. Develop fully automatic systems can identify accurately 95% of the time (Srihari 2002, 2008) . Computational Palaeography. See: The Practice of Handwriting Identification, The Library (2007), p266, n 27. What can we do with this material? Requirements: 'cross-examinable' including interpretable, reproducible, communicable, and allow variation and flexibility in analyzing handwriting. Software: The 'Hand Analyser': both data and the process is recorded, data and processes can be shared (Java), and the system will never be 'finished,' but extensible with plugins.

DH09: Animating the Knowledge Radio; Text analysis using High Throughput Computing; Appropriate Use Case Modeling for Humanities Documents

This session is chaired by Chuck Bush. The first presentation is by Geoffrey Rockwell (Alberta) and Stefan Sinclair(McMaster). 'Hacking as a Way of Knowing, 2009 presented earlier this year. Wanted to create a literary technology of a sort. (See NiCHE website). Questions posed: What opportunities for use in HPC in humanities? Examples of good practices? Barriers to humanists using HPC facilities? Concrete steps for outreach to DH community?
Recommendations: introductory documentation on HPC for Humanists, training opportunities, fellowships, etc. and visualization is the key.

Big See (website?) created a prototype in DH/HPC visualization, scalable dataset, processing, display weighted centroid in 3D with time. The Big See took a 2 D text work and converted into 3 D art in real time. The final product was a visualization that could be understood as how it was created. Animation ca stand in as a way of understanding of what the final visualization represents. Knowledge Radio goes beyond the Big See in the following way. Instead of a static corpus, why not use a dynamic corpus like streaming data or playable data? Example, a blogger could to examin how a particular word evolves over time over the web. (Code_Swarm project). Voyeur Prototype Radios: three examples: Voyeur website: load your own text box. Second example, LAVA website, line up the use of words in various documents to compare them. The third example is the Balls Reader interface.

Question from audience: this is fun but what interesting questions in the humanities can we answer with these tools?

The second presentation, 'Text analysis of large corpora using High Throughput Computing,' by Mark Hedges, Tobias Blanke, Gerhard Brey, and Richard Palmer, all from King's College in London. HiTHeR: Hight ThroughPut Computing in Humanities e-Research. funded by JISC ENGAGE development program (http://www.engage.ac.uk/). Create a campus grid for high throughput computing and connect to the NGS (National Grid Service). Early project: Nineteenth-Century Serials Edition (NCSE) http://www.ncse.ac.uk This is a digital edition of 6 newspapers and periodicals(3526 issues, 98, 000 pages 427,000 articles). Used Olive's Viewpoint software from microfilm and print works from the British Library. Search by names, places, institutions, subjects, events, genres, and document similarity. Document similarity has advantages unsupervised technology, well established methods. From a user perspective, it is intuitive and supported by ideas from prototype theory. Tested to approaches; TF-IDF and Latent Semantic Indexing. LSI builds relationships based on co-occuring words in multiple documents, exposing underlying relationships. Problem was OCR quality using 19th century newspapers on microfilm was a challenge. Overcame problem by using tokens and their character n-grams as basic textual units. To compare all the articles (longer than 300 characters) would require over 950,000 comparisons: we do not have the time! After three days, only 350 articles had been compared. HPC--relies on specially designed systems whereas HTC (High Throughput Computing)--loosely computing in the form of grid computing environments. Tools used: Condor Toolkit from Scout, and other tools mentioned.

Evaluation of the HiTHeR project. Problems: no TREC (Text retrieval ....), we can identify false positives, we will almost never know about false positives, we will never know what we didn't find, and what is 'similarity'? Similarity depends on the user's perspective.

The third presentation is 'Appropriate Use Case Modeling for Humanities Documents', Aja Teehan and John Keating at the National University of Ireland, Maynooth. The An Foras Feasa (website) Institute for Research in Irish Historical and Cultural Traditions. 400 meter scroll with about a metre wide. Used Activity Theory, considers human activities as a complex, socially situated phenomena. These activities result in tools. AT is a useful tool to model the activities of the digital humanities. HCI has 'ignored the the study of artifacts, insisting on the mental representations as the proper focus of study; AT is seen as a way of addressing this deficit.' An example would be data encoding., where it involves modeling raw data in order to manipulate and query it. 'How do we know for which information we will be able to mine?' XML:TEI is a single use case text encoding method. We are interested in producing Human Usable Texts which require a different kind of encoding to XML:TEI since our activities are different. Document encoding as a community activity should be encouraged. A process of Document Encoding was illustrated by Aja Teehan. physical-->logical-->interaction. Encoding occurs on each level, the choice of tool changes depending on the human activity. Software engineers use Use Cases to model Activity. 'Primary Use Cases are those that are aligned to the original reason for encoding the data.' Example, medical records should support querying for medical data for primary concern and for secondary concern, prosopography searches. Example, 'The Chymistry of Sir Isaac Newton' contains the notebooks of Newton. A primary Use Case would be a query about 'how many experiments did Newton conduct using copper?' Conclusions: we adopted whole system approach using AT Conversational Framework, GOMS--Goals, operators, methods and Selection, SFL--to examinie the activities of the communities. We developed text encoding systems that support human usable texts. The identification, selection and realization of appropriate Use Cases in necessary for document encoding and the production of human usable texts. Demo: The alcala account book project.

Monday, June 22, 2009

Digital Humanities 2009 Conference: opening speaker Lev Manovich

Digital Humanities 2009 is underway! Lev Manovich is the opening keynote speaker. His talk entitled 'Cultural Analytics: theory, methodology, practice'. He discussed three things tonight regarding analysis and visualization of large cultural data: theoretical implications, interfaces/visualization techniques, and finally methods for analysis of visual media and born-digital culture.

The availability of large cultural data sets from museums and libraries, digital traces and self-presentation, cultural information--web presence of all cultural agents, the tools already employed in sciences to analyze data, and the techniques developed in new media art, make feasible new methodologies for studying.

Manovich used the analogy or parallel with neuroscience fMRI of a global 'cultural brain'. We "need to start tracking, analyzing, and visualizing larger cultural structures, including connectivity, dynamics, over space and time." See: world heat map made up from 35 million Flickr geo-coded photos (http://www.cs.cornell.edu/~crandall/photomap/).

We must have anew scale: digitization of exisitng cultural institutions; instant availability of cultural news. See the Coroflot Portfolios website.

In studying cultural processes and human beings, distinguish between shallow data and deep data. Shallow data is about many people and objects (e.g. statistics, sociology). Deep data is about a few people and objects (e.g. psychology, psychoanalysis, ethnography), the 'thick description'--the humanities. The development of high performance computing, social computing, destroyed this division.

Cultural Analytics

Why cultural analytics? Knowledge discovery, from data to knowledge, google analytics , web analytics, business analytics, visual analytics--the science of analytical reasoning facilitated by 'interactive visual interfaces.'

What types of interfaces would you want in cultural analytics? Example would be the Platform for Cultural Analytics Research environment: HIperSpace (287 megapixels). Other examples: Barco's iCommand, AT & T Control Center.

Such interfaces would have multiple windows, a long tail, e.g. looks on lookbook.nu (created by softwarestudies.com)

What to use in analytics? The following three items:
INTERFACES which combine media browsing and visualization to enable visual exploration of data.
GRAPHS which are from actual media objects.
VISUALIZATION of a structure of a cultural artifact--borrowing and extending techniques from media art and 'artistic visualization.'

Dr. Manovich used many examples from his lab. Examples of works of art used include Rothko's Abstractionism patings to create new images and analyze the constituents of the paintings, manipulating the entire text of Hamlet by examining blocks of text for visual patterns, and finally, manipualting images of Betty Boop.

Theoretical Issues

Jeremy Douglass, Dr. Manovich's postdoctoral student, presented at this point. He used small sets of types of data,
webcomics, 'juxtaposed images' e.g. A Softer World, and FreakAngels. In quantizing, count the number of panels over hundreds of pages, we could treat it as a series of sequences, see panel types, aspect ratios hues and saturation of colors similar to DND sequences, the brightness of each page. OS X Finder can be used as a visual browser of data. Two other examples of manipulating large data sets include GamePlay--graphical template of joystick moves coordinated with the video music(Zelda); and 40 hours of playing Knights of the Old Republic. e.g. Markov chains can be created.

Back to Dr. Manovich. Some theoretical issues around cultural data mining/cultural visualization:

  • culture does not equate cultural artifacts
  • statistical paradigm (using a sample )vs data mining paradigmanalyzing the complete population)
  • pattern as a new epistemological object--from meaning to pattern we know the interpretation of meaning of a cultural artifact but we don't know the larger patterns they form.
  • new digital divide, those who leave digital traces and those who do not.
  • from small number of genres to multidimensional space of features where we can look for clusters and patterns.

Analysis of Born Digital Culture

  • new technologies offer the possibility of analying interactions between readers and cultural objects/processes.
  • 'big data' has a new meaning in the case of interactive digital media (e.g. one single video game but analyze many users in one session).
  • digital objects have a self-describing property.
  • digital objects are sets of possibilities. Every moment of interaction can be tagged through structures of interaction. In the case of time-based procedural media, such range of possibilities will change over time.

Dr Manovich's web site for Cultural Analytics:
softwarestudies.com
Free downloads to experiment with media.