Problems and challenges in the digitisation and markup of immigrant letter collections
Over the past few decades there has been a growing interest in the immigrant letter and how correspondence collections might inform our understanding of social history during the postal era of globalization. However, whilst their value as socio-historical artifacts is generally accepted, what to do with these letters remains a subject of debate. Central to these discussions is the issue of digitisation. What, for example, is meant by the ‘digitisation of letter collections’? how can letter collections speak to one another? how can they speak to different audiences? and how can they be made more accessible for research?
Typically, when textual objects, such as letters, are digitised they are OCR scanned, or transcribed and then saved in an electronic format. This process makes the letter more accessible and to a certain extent more searchable; however, the type of search that can be carried out is often very restricted (usually to a single word) and the search criteria are generally quite limited. To give an example, a search for the word ‘homesick’ in a small corpus of Irish immigrant correspondence did not produce any results. That is not to say, of course, that homesickness is not expressed in any of the letters; rather, it tended to be expressed in different ways - through phrases such as ‘I think about X’, ‘I dreamt about Y’, ‘I remember Z’. A closer examination also showed certain gender differences between the use and frequency of these phrases.
With the right system of markup, it would be possible for this type of information to be represented, allowing the letter collection to be explored in more useful, meaningful and creative ways, both quantitatively and qualitatively. A search on the theme of ‘homesickness’, for example, would produce the phrases described above, and it would be easy to then refine that search to look only at letters written by men or letters written by women.
Markup language provides a way of describing an object (a poem, novel, letter etc.); it allows various layers of meaning to be added, so that the object can be looked at in different ways and in lots of ways at the same time. The level and depth of markup will be driven by the type of research questions being asked and will reflect our ideas about the document. Researcher’s, coming from a range of disciplinary perspectives, will have different research aims and different ideas about how the letter should be represented and what features should be drawn out. Decisions relating to markup, therefore, should always be a collaborative effort and adopt an interdisciplinary approach, if these collections are to speak to different research communities.
The TEI (Text Encoding Initiative) offers a potential framework for the markup of correspondence, which would enable both contextual information (such as age, gender, location, occupation, class, religious denomination etc.) as well as textual information (such as sentences, paragraphs, parts of speech, spelling variations etc.) to be represented. It also has the flexibility to allow thematic content (such as love, work, money etc.) as well as pragmatic features (such as apologising, requesting, promising etc.) to be annotated, which would potentially open up letter collections in new and exciting ways. Once fully marked-up the collections can be used with various corpus interfaces or visualization tools making the data more accessible to a wider range of stakeholders and allowing for comparative studies within one collection, or across many collections to be carried out.
Developing and agreeing on a system of markup for immigrant letter collections is a challenging aspect of the digistation process; however, it could potentially lead to fully integrated resources and more collaborative research opportunities.
University of Birmingham, Coventry University
Back to Forum index
Back to Digitizing Immigrant Letters index page.