Saving the Whale and Saving the Plankton: Digital Preservation and the Environment

 

William Kilbride (DPC)


"The file is not necessarily the atomic unit of data so preservation approaches based on files alone are never likely to be sufficient. Some data sets were always complex aggregates: even the most rudimentary GIS projects are dependent on complicated networks of data drawn from data sources that proliferate and change through time."

 

william3

tiny_bar

I no longer know what data is. That's a problem for someone who thinks some of it is valuable and (a proportion) is going to be valuable in the long term.

I've always side-stepped sterile discussions about data's empirical role within research. For the record, I've never thought that data could speak for itself nor for that matter that objectivity was anything other than a chimera. But I'm also clear that there's enough in our deeper intellectual traditions to prevent a slide to relativism. In any case I've avoided these self-regarding debates by adopting a more practical, if prosaic stance. Data, I've told myself, is everything that's not hardware or software. I'm not sure this naive distinction works any more.

It never really did work. A decade or more ago I was asked to comment on a draft digital preservation strategy for a large government department. Emailed to me as a PDF, the text was encoded in a series of bespoke fonts which some tech-savvy mandarin had decreed would lend authenticity and authority to official documents. Of course these special fonts were not for distribution and for whatever reason my computer, in fact the whole of the University Network, decided that wingdings would be an appropriate replacement. !^&*? as Asterix might say. In the end my government contact, safe inside his departmental computing environment with fancy bespoke fonts printed the document and posted it to me. The text, that is to say the data, was fine. It was the external dependency that confounded us.

This was before the invention of PDF/A and all that clever work to reduce external dependencies. But reducing external dependencies is not fashionable these days: if anything maximising our dependencies is the zeitgeist and for good reasons. If our entire digital infrastructure is packaged as a service then it means every tool and application we would ever need is at our desktops as and when we need it. The highly distributed architecture means services are maintained and distributed by the right people in a timeframe that is highly responsive to needs. So we no longer need to wait for a new release to fix that bug or add that extra bit of flexibility. Because we can all distribute all our fixes the old intermediaries are done for. It's the practical manifestation of globalisation and just-in-time business, delivered to our desktops on a daily basis. Maximising our dependencies means we can concentrate on the part of the business that really matters.

I get it. I even like it. But if I'm struggling to know the edges of data in a PDF document attached to an email circa 1999 then I'm going to be totally flummoxed by data floating on an open ocean of tools and services.

PDF was specifically designed to be shared from one printer to another (more accurately as an "electronic representation of a page-oriented aggregation of text and graphic data, and metadata used to identify, understand and render that data, that can be reproduced on paper or optical microform without significant loss of its information content"; see ISO 2005). It's ubiquitous and should be simple. You don't have to look far for more complex examples: CAD drawings have always relied on sophisticated libraries; audio and video rely upon codecs before they can be rendered. File-based systems have almost always assumed interdependencies.

The file is not necessarily the atomic unit of data so preservation approaches based on files alone are never likely to be sufficient. Some data sets were always complex aggregates: even the most rudimentary GIS projects are dependent on complicated networks of data drawn from multiple sources that proliferate and change through time; complex volatile relational databases are scarcely encountered except through highly tailored queries or views; email servers are really only the base stations for transfers between numerous packaging and unpackaging tools. And these things - electronic documents, relational databases, digital sound and vision, GIS and email - are pervasive. In some senses they are legacy tools that have existed for decades. My point: if the preservation community has found it hard to define the edges of data in a computing paradigm where our tools and data sets were relatively self-contained then we're really going to struggle when they are highly distributed.

More importantly, if remembering is not built into the infrastructure then for the time being we live on the edge of forgetfulness and manipulation: authentic fragments of antiquity may be simply lost, or worse supplanted with dissembling figments of iniquity. Data loss is not the problem. It's the resulting futility, confusion, and deceit that keeps me awake.

Let me replay this in plainer terms. In a world where services and data are managed remotely then the consumer is dependent on remote producers and suppliers. These may depend on other services and producers, who may in turn rely on others. This creates a long chain of interdependency. But each supplier is at liberty to update, alter or withdraw a service, meaning that the supply chain may change without the consumer knowing. That's fine so long as it always produces the same results. Changes - whether subtle variations or gross deviations - will only become apparent after the fact, and potentially after it is too late. So the Internet is an Internet of Dependencies and, except for very specialist circumstances, this means that if we are concerned to preserve authenticity, we need to think about more than data.

Advocates and close followers of TIMBUS will understand this already and would point to things like context capture, environment description, risk assessment and legalities lifecycle management as key tools to help us address the topic. For my part I wonder what this means for some of the assumptions that frame digital preservation.

For one thing, it suggests that much of our previous concern about file types and obsolescence is misplaced. It's taken me a long while to understand this point. Others have argued this more coherently for longer and I can certainly see the capacity issue when I look at the growing quantities of data that we face. But a few years ago those that dismissed concerns about file types seemed to be tilting at windmills. Perhaps it's the emergence of the computing environments which they accurately predicted which makes it seem more real to me now. Any repository whose workflows assume that files are the basic unit of ingest and access will need to shape up soon, and their preservation planning will need to adapt pretty quickly too.

If we're spending less time worrying about file types in future, then it will be because we're more worried about managing relationships between files, services and applications. It's not that the contents or structure of files will become unimportant: instead we will need an extra layer of skills to manage and authenticate the environments in which they make sense. It's not that migration will cease to be important, rather that emulation and virtualisation tools will become more important than they've done for a decade or so.

These long chains of dependency also mean that we will soon be testing the coherence of the repository as a concept. To the dismtay of colleagues I've already argued that the idea of the trusted digital repository is beginning to look a bit jaded. Digital preservation is not a place, it's an activity, so if we concentrate our efforts on a 'repository' then we ignore all the complex interactions and tools that have to go on around it. It's not the repository we trust; it's the staff and the tools and the peer review and the standards that matter. And in the context of preservation micro-services in which 'everything as a service' holds we need to fin better ways of managing trust through the same sorts of dependencies that are described above.

But there's more to it than that. There's a limit to the amount of material we can reasonably expect to place in a repository. The idea of a single place to store and secure digital content works fine where you can replicate, fix and transfer relatively self-contained content and components. But if you want to preserve an entire ecosystem then a repository is only one part of the solution and not necessarily the most important part. We need to add a more active field work component to our understanding of digital preservation. The result is closer to nature conservation or heritage management than archiving. It's more like designating ancient monuments or protecting nature reserves where the specific highly valued services can be managed in situ. You can't save the whale without also saving the plankton.

I am not sure that I know what data is, but I can now see my way through the undergrowth to know that the data/environment combination which used to trouble me can be managed. It just needs a more subtle approach. TIMBUS shows the way. If we can no longer tell the difference between data and service, then we can no longer assume that the repository will be the solution to our digital preservation needs. If we're serious about saving data we're going to have to save the environment too: and that means venturing out of the repository.

(ISO 19005-1: 2005, Document management - Electronic document file format for long-term preservation - Part 1: Use of PDF 1.4 (PDF/A-1))

You have no rights to post comments