Library Technology Guides

Document Repository

From disaster recovery to digital preservation

Computers in Libraries [May 2012]

Breeding, Marshall.

Copyright (c) 2012 Information Today

Image for From disaster recovery to digital preservation

Abstract: Over the course of my career, I've seen libraries increasingly involved in digital content. Today, hardly any aspect of what libraries do remains untouched. Collections have shifted from physical to digital forms, though not necessarily uniformly. This shift has reshaped many aspects of how libraries operate, with profound implications not only for how they provide access to materials but, especially, in how these digital collections will be preserved for future generations. Libraries face enormous challenges in finding ways to preserve their collections as they move more deeply into the digital arena.


Over the course of my career, I've seen libraries increasingly involved in digital content. Today, hardly any aspect of what libraries do remains untouched. Collections have shifted from physical to digital forms, though not necessarily uniformly. This shift has reshaped many aspects of how libraries operate, with profound implications not only for how they provide access to materials but, especially, in how these digital collections will be preserved for future generations. Libraries face enormous challenges in finding ways to preserve their collections as they move more deeply into the digital arena.

Disaster Recovery Versus Digital Preservation

The fragility of digital content cannot be understated. Lots of things can go wrong with great potential for catastrophic loss. Hardware will eventually fail, taking active copies of data with it. Software can malfunction in ways that can corrupt files. Malware can invade computer systems in ways that not only disrupt access but might also destroy data. In this day of activist hackers, any organization - even a library - can suddenly find itself the victim of intense and sophisticated attack.

Any organization with operations that depend on computer systems and their associated data will naturally implement procedures for disaster recovery. These procedures ensure that the organization can quickly recover from any sort of problem with its technical infrastructure, including restoration of any lost data. Organizations with great operational dependence on their computer systems, such as hospitals, financial institutions, or global internet-based businesses, will have one or more standby systems that can instantly take over should the primary systems fail. Organizations with globally distributed infrastructure, such as Google, have architectures in place with massive redundancy designed to work around failures of any given hardware or software component. Libraries and other organizations with more limited resources tend not to have this level of failover redundancy but rather focus on keeping up-to-date backups that can be restored once equipment has been repaired following a failure. Disaster recovery involves the ability to maintain the continuity of the organization, focusing on the restoring of data in its current state.

Digital preservation builds on a base level of disaster recovery, extending the scope of concern into the distant future. Digital preservation goes beyond addressing problems with restoring data to its current state to creating processes and infrastructure capable of carrying data forward hundreds of years, assuming that any formats, media, and equipment in place today will be obsolete and unsupported. The challenges of this long-term objective include not only creating highly resilient storage architectures but also maintaining metadata to support the re-creation of the content in future formats. Digital preservation includes an organizational strategic commitment to the forward migration of data through the inevitable cycles of technology. While disaster recovery ensures that a given organization can deal with any given failure, digital preservation also addresses the possibility of widespread and enduring failures, including disasters with extensive geographic reach and disruptions in communications that might endure for days, weeks, or years. The Open Archival Information System (OAIS) reference model provides some guidance for the design of repositories for digital preservation.

The relative strategic positions of the commercial sector and libraries come into play when considering disaster recovery versus digital preservation. In a business environment where strategies focus on the shorter term, society cannot necessarily count on publishers and other content providers to invest in strategies beyond those related to disaster recovery and business continuity. While they certainly have interests in longer-term preservation of their content, it is much more likely that forward migration through formats and other key processes will be deferred until times of more immediate need. Society has a broad interest in the preservation of the scholarship and cultural heritage represented in the products of publishers, and it may be up to libraries, rather than the primary publishers, to place these materials into the realm of long-term digital preservation. This reality isn't different from the print realm where libraries and archives play the lead role in conservation and preservation, more so than the original publishers, many of which may no longer be in business.

Challenges of Each Collection Component

The digital shift impacts each type of library material differently, creating an incredibly complex scenario for access and preservation. Books, periodicals, and special collections have their own trajectories in the shift from print to digital formats, and they also bring different challenges for digital preservation.

Scholarly journals. Journal collections have transitioned almost entirely from print to digital. From the earliest volumes of the backruns to current receipts, libraries routinely access scholarly journals and newspapers, as well as scholarly and professional magazines electronically. Access to these materials generally happens through the platforms maintained by publishers and aggregators, with mechanisms in place to limit access to the patrons associated with subscribing libraries and not to the general public. We can have confidence that the publishers and providers have solid processes in place for disaster recovery.

Libraries, however, have taken significant measures to create an additional layer of protection for these materials to cover additional contingencies and to address concerns of digital preservation. In addition to any efforts that the publishers may take, the library community has collectively developed and operates a digital preservation strategy for these materials, called LOCKSS ("lots of copies keep stuff safe"), in which copies of each issue are maintained across a number of low-cost servers distributed across many libraries that communicate with each other to replicate and validate multiple digital copies. When the material is no longer available from the publisher, either through business failure or technical problems, libraries can activate access to this material in LOCKSS. Beyond restoration of access for short-term events, LOCKSS also addresses principles of forward format migrations and other concerns of digital preservation as specified in OAIS. LOCKSS provides a good example of libraries taking the initiative for the preservation, not necessarily exclusive ownership, of content in which they have a strategic interest. A related project, CLOCKSS (controlled LOCKSS), is a nonprofit organization of publishers and libraries that maintains its own LOCKSS environment. CLOCKSS provides a digital preservation environment into which publishers can deposit content, which will be released to the public following specific triggering events, such as discontinuation of publication. Currently more than 7,400 e-journal titles have been deposited in CLOCKSS.

Newspapers, popular magazines, and professional journals. As periodicals shift from print to web publishing models, new complications for digital preservations arise. As traditional publications morph from discrete issues to constantly updated websites comprising feature stories, blog posts, comments, and related video, they bring in a much more complicated set of digital preservation issues. Yet, future scholars will require access to these materials despite their less structured and diffuse publication models.

Books. Likewise, books are under a major transformation into the digital age. Although change has been underway for many years, only recently has it crossed thresholds that stand to make a systemic impact on libraries. Today, lending of ebooks in public libraries is considered a routine service that complements, but does not replace, the circulation of physical materials. Academic libraries now enjoy more opportunities to work with digitized versions of their print collections, both for enhanced discovery through full-text search mechanisms and for online reading. The availability of book material today is quite uneven, with access constrained by gaps in digitization, copyright restrictions, and publisher participation.

Ebooks, a Particular Challenge

The issues with books are more complicated, with different scenarios that apply to ebooks offered by publishers and for those created through scanning projects. In contrast to journals, interest in print books remains strong. In the last few years, ebooks have entered the scene, both in general consumer society and in libraries. Availability of moderately priced readers, such as the NOOK, Kindle, and iPad, and plentiful titles spark an ever-increasing trend toward ebook reading. Even though ebooks are now available in greater numbers, so far there has not been a drastic shift away from print. In public libraries, the circulation of printed books and other physical materials remains vigorous. Libraries naturally want to provide similar services in lending ebooks as they have for printed ones.

The role of libraries in ebook lending has been quite controversial in recent times. Hot-button topics have restrictions on the numbers of times an ebook might be lent by a library, publishers that choose not to offer their titles for library lending, threefold price increases, and the question of whether Amazon can mediate lending.

With the current controversies roiling regarding libraries' access to ebooks, digital preservation of these materials has not necessarily been a top priority. As long as print versions continue to be produced in sufficient quantities, the content isn't necessarily in serious jeopardy. But libraries do have an interest in preserving their licensed digital versions and will want to ensure long-term preservation through programs such as LOCKSS and CLOCKSS. In the current climate, in which interest by publishers in making ebook titles available to libraries remains unsettled, participation in digital preservation programs may not necessarily stand out as a top priority for these materials, especially when print copies are available. But should the time come when publishers begin to issue at least some titles as ebooks only, the digital preservation issues will become more critical.

In addition to current ebook offerings from publishers, mass digitization of library print collections is underway. A number of efforts, including the Google Books project, the Open Content Alliance, and many other projects around the world, have made enormous progress toward digitizing the realm of print. Once thought an impossible undertaking, it now seems plausible that the entire body of printed material can be digitized within this generation.

Digitized books present different preservation challenges than ebooks issued by publishers. Fortunately, many of these issues have been addressed through HathiTrust. This repository was created by a consortium of libraries, led by the University of Michigan, as a preservation and access platform for books digitized through the various mass digitization projects (www.hathitrust.org). Books digitized from the collections of libraries present a complex matrix of access and preservation scenarios, depending on whether a given title has past thresholds of copyright and lies in the public domain, if it is a work that may be out of print and the owner of the copyright cannot be identified (orphaned works), or if it remains under full copyright protection. Even if the work itself cannot be made available to be fully read online or as a downloadable ebook, the full text of a digitized book may be indexed to enhance its findability in resources such as Google or HathiTrust, along with small snippets of text. Apart from issues of indexing and access, these scanned books can be preserved in dark archives for posterity. In most cases, at least some copies of the original printed versions will also be preserved.

Digitizing Special Collections

Libraries and others have also been busy for the past decade or so digitizing photographs, manuscripts, audio recordings, videotape, film, and other items of interest from their special collections. While preservation and access to the body of published material falls within initiatives or projects with opportunities for participation among publishers and broad library organizations, the digitization of special collections relies primarily on the initiative of individual libraries. Libraries of all kinds, but especially national and major academic or public libraries, routinely produce digital collections of the most valuable and interesting materials from their special collections. These unique materials will only rarely be digitized as part of national initiatives, leaving it up to the holding libraries to take the initiative. Digitizing provides broad access to the materials, as well as providing an additional layer of preservation.

Although physical items will be preserved when possible, there are times when the original may already be compromised and the digital version will have to stand as the best path for preservation. I've been involved in at least two projects in which the product of digitization became the primary copy. When we digitized the collection of the Vanderbilt Television News Archive, the videotapes from which we created the digital video files in MPEG2 format were considered an obsolete format 10 years ago when the project began, and every year fewer playback machines remain. It was important to digitize the collection while equipment was still available and before the videotapes further degraded. Although we chose to retain the videotapes after the digitization was complete, we have far more confidence in the quality and integrity of the digital files.

I've also been involved in a project that focused on digitizing materials from church and civic archives in Latin America to help trace the migrations of slaves through the New World (www.vanderbilt.edu/esss/). The digitization of these materials has been carried out by historians using handheld cameras to photograph records that were often hundreds of years old and held in archives with little or no environmental protection. This meant the materials were deteriorating rapidly through moisture, mold, and insect damage. Access to the archives was only temporary, with no opportunity to remove the materials or administer proper conservation processes. These "guerrilla digitizing" projects often had to be done quickly and with minimal equipment. In at least some cases, the original archive material no longer exists.

It's also interesting to consider the ways that technology has changed the challenges that libraries face as they accept the collections from retiring scholars or literary figures, ti times past, such a collection sold or donated to a library or archive would consist of books and boxes of notes, manuscripts, correspondence, and other physical artifacts. Today, retiring scholars might also include their word processing files, research data sets, or even the computers themselves. The files might reside on media long obsolete. How many libraries, for example, still have equipment that can read 5.25" floppy diskettes that were standard in the early days of computers and word processors? Or magnetic tapes from a 1970s vintage mainframe? Processing such collections may involve increasing measures of digital archaeology.

In these days when many writers and scholars may use cloud-based tools for their work, such as Google Docs and Gmail, another set of complications may arise as their "papers" enter libraries and archives. There may not be any physical media to turn over, posing interesting challenges for how the content and correspondence can be transformed into a curated collection.

For me, these locally digitized collections raise the most concerns regarding preservation. Where published materials have many opportunities for broad initiatives to support long-term preservation, libraries draw against their own limited resources as they digitize their own special collections. These resources may be sufficient to support the initial digitization and for platforms to provide access. These local digital collections often have processes in place for disaster recovery, but not necessarily for long-term preservation. A true OAIS-compliant, trusted digital repository to support longterm preservation exceeds in cost and complexity what the typical library can support. Many libraries are able to depend on repositories operated by larger bodies, such as regional or statewide consortia, national libraries, or other organizations, such as DuraSpace, which offers the DuraCloud managed service to help libraries archive their digital collections (http://duraspace.org).

Toward Long-Term Preservation

I don't see any slowing in the movement toward greater involvement with electronic and digital materials in library collections. While books, journals, and special collections each take a different path and at a different pace, each seems to be heading in the same direction. While the initial focus in this digital transition tends to center on delivering access to library users and providing disaster recovery, the next step of responsible management advances into long-term digital preservation. Despite cost and complexity, we owe future generations of scholars our best efforts in passing along the cumulative body of knowledge entrusted to libraries, even as it takes form in fragile and ephemeral digital media.

Publication Year:2012
Type of Material:Article
LanguageEnglish
Published in: Computers in Libraries
Publication Info:Volume 32 Number 04
Issue:May 2012
Publisher:Information Today
Place of Publication:Medford, NJ
Notes:Systems Librarian Column
ISBN:1041-7915
Permalink: http://librarytechnology.org/ltg-displaytext.pl?RC=16821
Record Number:16821
Last Update:2013-03-17 19:19:50
Date Created:2012-05-19 14:49:51