Thesis: Information System Agnostic Ancestry for Digital Objects

Quick Wins

Abstract

More and more information is becoming available in digital form, most of it derived from digital sources. Digital information is made available as digital objects composed of a sequence of bits and managed by information systems. To date, these digital objects have no independent identification which can be referred outside of a specific information system. However, they normally outlive these systems and can be copied to other systems. In order for the ancestry of digital information to span more than one generation, each source, and therefore each digital object, must provide information on its ancestry. This information needs to be accessible independent of a specific information system.

In this thesis, we present the theoretical foundation for the identification of digital objects independent of information systems and for enhancing digital objects with ancestry information. The ancestry information allows the identification of digital objects used in the creation of the digital object and its ancestors. We discuss the existing definition of a digital object, its equality relationships, and its identification through intrinsic properties. For the representation of ancestry relationships we propose the application of the graph theory and discuss possibilities of embedding digital object identification in existing identifier systems.

We specify requirements for the information system agnostic integration of digital object ancestry with the digital object and propose a method which implements these requirements. Our method is called SIMPLE (Simple Identifiable Metadata with Persistent Lineage Embedding) and integrates the digital object with its metadata so that these metadata become an integral part of the digital object independent of the information system.

The claims are validated with a prototypical implementation of SIMPLE and its deployment in a case study with the State Archives of Appenzell Ausserrhoden. In this case, the ability to document the ancestry of a digital object results in transparency and traceability in the processing of the digital object.

Back to the ^top^

Alternative Abstract by google Gemini 1.5 pro

This thesis addresses the challenge of establishing and preserving the ancestry of digital objects, independent of the information systems managing them. The inherent malleability of digital information makes it difficult to track its origins and transformations, hindering the ability to answer the fundamental question: "Which digital objects were used to create the digital object at hand?"

This work proposes a novel approach called "Digital DNA," which focuses on identifying digital objects based on their intrinsic properties, specifically their bit-level content. It defines a context-free notion of digital object equality and introduces identification functions, leveraging cryptographic hash functions to generate unique and verifiable identifiers. This approach enables the identification of digital objects regardless of their location or the information system managing them.

Furthermore, the thesis introduces a method called SIMPLE (Simple Identifiable Metadata with Persistent Lineage Embedding) to integrate lineage metadata with digital objects. SIMPLE embeds a self-documenting XML container within the digital object, containing information about its ancestors and ensuring the inseparability of metadata and data. This container can be accessed and interpreted independently of the digital object's format, facilitating lineage tracking across different information systems.

A prototype implementing the SIMPLE method was developed and evaluated in a case study at the State Archives of Appenzell Ausserrhoden. The prototype successfully facilitated the traceable migration of scanned family records, demonstrating the practical applicability and benefits of the proposed approach.

This thesis contributes to the field of digital preservation by providing a theoretical framework and practical tools for establishing and preserving the ancestry of digital objects. It lays the foundation for future research on digital object autonomy, integrity, referenceability, and ancestry, ultimately enhancing the long-term value and trustworthiness of digital information.

Back to the ^top^

Evaluation by Prof. Dr. Maria Guercio, University of Urbino

Report on thesis of Stephan Heuscher "Information System Agnostic Ancestry for Digital Objects" by the co-examiner of the thesis Prof. Dr. Maria Guercio, University of Urbino (published with the approval of the author).

1. Introduction

The topics of the integrity and identity of the digital objects and the capacity of tracking origins and provenance have been largely discussed in the literature but the outputs have not been till now convincing with reference to the long term preservation, the persistency of the digital resources themselves and of the related contextual information. The thesis here discussed has developed an original approach to the basic requirements of unique identification of the resource and control of its integrity over time in a way to be accessible independently of any specific information systems. As better discussed in the final part of this report and as clearly described in the OAIS model the role of persistent identifiers of digital objects at various level of granularity can be crucial in the chain for digital preservation and able to provide relevant information for the history of custody and provenance.

2. The structure of the thesis

To evaluate the structure of the thesis, it is important to stress that the main effort concerns the theoretical foundation for the identification of digital objects independent of information systems because of its capacity to hold "its easily identifiable ancestry" and thanks to the technical and conceptual possibility of embedding digital object identification based on intrinsic properties in existing identifiers systems. All the main aspects relevant for this specific research and for the prototypical implementation and the case study are consistent and well detailed. From a general point of view the structure is clear and well balanced By the way, a better attention could have been dedicated to the interconnection with the related work in the field, specifically with reference to the overall picture of the question, as in the case of the exam of the literature dedicated to the significant properties and to the persistent identifiers. In particular, with reference to the definitions and the motivation of the research, the crucial aspects are always well identified and consistently described, but not explicitly discussed with reference to the interrelations between related works and this specific investigation. One of the reasons could be the fact that the most important definitions and concepts present in the main standards and research projects similar for the granular approach - as PREMIS and OAIS - are not in contradiction with the outputs of this work. The explicit mention of this contiguity would have improved the comprehension of the whole product. It has to be noted that the analysis of specific sectors (i.e. the concept of ancestry and data lineage or the metadata adhesion), the confrontation is more direct and very exhaustive and detailed. The chapter 6 ("Limitations and Future Work") and the conclusions clearly summarize the open questions and testify the awareness of the author on the complexity of the whole problem and on the specific role the research outputs could play and their future potentialities.

3. Analysis of the hypothesis, methods and critical assessment

The thesis presents an innovative perspective in the sector of digital preservation and curation based on the simple concept of identification of digital object through its 'intrinsic properties'. . The literature and the research related to the unique persistent identifier have usually focused on the principle of external metadata and on significant properties. According to this point of view, as the author clearly has underlined, the system for identification is and will be always "bound to an information system the digital object can no longer be identified by its associated metadata […] when removed from this identifying information system". The risk is immediately evident in case of medium or long-term preservation. On the opposite side, the capacity of managing the basic contextual elements of the digital object independently of the information system provides the repositories with a strong control on their digital asset, in one of the most critical aspects of the chain of digital preservation, that is the maintenance of persistent and independent links among the original objects in the course of the preservation function. The research concentrates its attention to the specific area of handling digital object as transformed copy of a previous original and to the capacity of proving its integrity by controlling the changes occurred to the object. Of course, this is only one aspect of the whole contextual relevant information, but the context made explicit by the ancestry relations is the basic one, even if not inclusive for the interpretation of the object to be preserved.

The strengths of the thesis are the following:

  1. the consistency of the approach and the exposition, including the recognition of the limitation of the solution provided,
  2. the possibilities for practical implementation as testified by the testbed in a sector where the efforts for applications are relevant and frequently sustained by research funds but the concrete results are rare and difficult to use,
  3. with reference to the scientific community involved in this field the capacity to provide an independent and general framework for the issues related to the identification and the integrity of digital objects,
  4. the simplicity of the system put in place able to be commonly used without further expensive investments by the users and the researchers,
  5. the awareness of the advantages and the limits of the solutions and the required implementations to be further developed,
  6. the convincing idea that the digital object should be seen as the package that contains its history (even if at the moment very partially): this is the same ideas recently developed in international projects like CASPAR and is crucial according to the most important model for digital preservation approved as an ISO standard (ISO 14721 – OAIS Model) and considered the basic requirement for the certification of digital repositories,
  7. even beyond the scope of the thesis, the concept of authenticity and the capacity of its presumption could be positively supported by the methods developed in the thesis, with specific attention to the traceability of the handling of the digital objects and the documentation of its history,
  8. the decision of not focusing the solution of integrity on the use of cryptographic signature because of its useless and exacting complexity specifically in case of small archives.

As already mentioned, a minor limit of the thesis concerns the lack of explicit interactions between the analysis of the main and general literature and standards in the field and the research core outputs. This limitation can partially lower the potentiality of the solution here developed, for instance with reference to the most crucial aspects like authenticity (see above point 7). From this point of view a deeper exam of the relevant concepts in the digital environment (like the concepts of original and copy) could have provided a more comprehensive vision of the whole question and given to the work closer and direct relationship with the international debate in this area, like the discussion on significant properties and on the “degree of loss” to be considered acceptable in the preservation processes. By the way, as already mentioned, the author discusses the specific standards (like METS or PREMIS) when he describes the single aspects of his research.

Back to the ^top^

More Metadata

The thesis is a doctoral thesis for the degree of a doctor in informatics at the Faculty of Economics, Business Administration and Information Technology of the University of Zurich by Stephan Jakob Benedikt Heuscher from Herisau, AR, Switzerland. Accepted 2010 on the recommendation of Prof. Dr. A. Bernstein and Prof. Dr. M. Guercio.

The Faculty of Economics, Business Administration and Information Technology of the University of Zurich herewith permits the publication of the aforementioned dissertation without expressing any opinion on the views contained therein.

Zurich, April 14. 2010

The Vice Dean of the Academic Program in Informatics: Prof. Dr. H. C. Gall

Back to the ^top^

Wie würden Sie diese Seite verbessern? Teilen Sie Ihre Ideen!
Zum Menu