ISC 404: DIGITAL LIBRARIES
DEPARTMENT OF LIBRARY AND INFORMATION SCIENCE
KENYATTA UNIVERSITY
LECTURE10: STORAGE FORMATS FOR DIGITAL OBJECTS
Storage File Formats
Standards are developed to ensure the seamless operation of any product so that it can be
interpreted, manipulated, built upon and recognized by certain protocol that allows users to
view, interpret and use information. Hence, when a product is developed for consumer use, it
must adhere to a standard that is recognized throughout the industry. It is the same case with
digital collections. Adherence to standards is essential for long term preservation of digital
objects. There are two tasks in creation of digital projects: that is, digitizing the actual material
for online presentation and preserving the material for long term archiving which necessitates
adhering to standard file formats. Libraries intending to convert their collections in digital
format need to define storage formats that meet the needs of the collections as such standard file
formats ensure information is not lost. It is necessary for librarians to decide whether their
digital libraries will house documents or multimedia objects or a combination of both. Storage
formats for various data type are indicated in the table below:
No. Content Type Format
1 Alpha numeric data e.g unstructured textual data American Standard Code
for Information
Interchange (ASCII) or
Extensible Markup
Language (XML)
2 Image data e.g digital or scanned line drawings, illustrations, Tagged Image File
paintings, maps Format (TIFF) and
Joint Photographic Expert
Group (JPEG)
ISC 404: DIGITAL LIBRARIES
ACKNOWLEDGEMENTS: Notes derived from a module by Dr. C. Mutwiri. Do not re-circulate Page 1
3. Audio data e.g Music, Oral histories Audio Interchange File
Format (AIFF)
4. Video i.e moving image files Moving Picture Expert
Group (MPEG) and
Quick time
5. Web documents Hyper Text Markup
Language (HTML) or
Extensible Markup
Language (XML)
6 Music Music Instrument Digital
Interface
References
1. Witten, Ian H, Bainbridge, David and Nicholas, David P.(2002). How to build a digital
library. 2nd ed. USA: Elsevier
2. Arms, William y. (2001). Digital Libraries. Mass: MIT press
10.5 Self Test Questions
ISC 404: DIGITAL LIBRARIES
ACKNOWLEDGEMENTS: Notes derived from a module by Dr. C. Mutwiri. Do not re-circulate Page 2
1. Discuss with justifications, the most appropriate format for archiving photographs
2. Explain why it is necessary to use standard file formats in archiving of materials in a
digital library system
LESSON 11: PRESERVATION OF DIGITAL COLLECTIONS
11.1 Introduction
Jeff Rothenberg compared digital content to ‘’invisible ink’’ because such content is
written in the language of machines and requires the right machine resources to read it.
This lesson will look at preservation which is a commitment to try and do as much as
possible to ensure future generations are able to use digital content created today,
considering the changes in technology. Various methods of preservation will be
discussed.
ISC 404: DIGITAL LIBRARIES
ACKNOWLEDGEMENTS: Notes derived from a module by Dr. C. Mutwiri. Do not re-circulate Page 3
11.2 Objectives
By the end of the lesson you should be able to:
Discuss the approaches to preservation of digital content
Distinguish between various methods or strategies of preservation
11.3 Digital Preservation Approaches
Digital collections are large databases of information that is accessed through a computer
terminal. Digital preservation is the process of maintaining a condition suitable for use,
materials produced in digital formats and is inevitable particularly when a project goes online.
11.3.1 Migration
This is the periodic updating or rewriting of old data to run on new configurations or platforms.
File formats change continually; current computers cannot run programs for some computers
that existed a while ago. Hardware is often replaced and software systems are revised. The basic
principle of migration is that formats and the structure of data may be changed but semantics of
underlying content must be preserved. In migration, a digital object is transferred from one
software or hardware to another. Migration is also known as normalization where digital
information is copied from old formats into newer formats. This approach ensures that digital
objects are kept in current file formats. For example institutions that had encoded their files
using Standard Generalized Markup Language (SGML) migrate them to Extensible Markup
Language (XML) which is a more current format. Migration as a preservation strategy is
prompted by technological changes. In the past the floppy diskette was used to preserve content
but nowadays very few machines support the floppy diskettes forcing individuals and
organizations to adopt devices that are compatible with the technology in use.
ISC 404: DIGITAL LIBRARIES
ACKNOWLEDGEMENTS: Notes derived from a module by Dr. C. Mutwiri. Do not re-circulate Page 4
11.3.2 Refreshing
Digital libraries must refresh their collections periodically, that is, move data onto new storage
media or from one medium to another. For example from Compact Disk (CD) to Digital
Versatile Device (DVD). That is done to ensure that data is always stored in a medium that is
readable by current technology.
11.3.3 Emulation
This involves mimicking older platforms, so that the operating environment for obsolete
software and data files can be recreated. (Replica of the original computing environment is
supplied to interact with a resource as intended by creators of the resource). Emulation therefore
tries to keep digital objects into their original data formats but recreates some or all the original
processes enabling the object to be recreated on current format. The behavior of the hardware
and software in the future environment is reconstructed so as to recreate the look and feel of the
original object in its old environment. This involves cooperation between hardware and
software developers to provide access to proprietary information.
11.3.4 Universal Decoding
This strategy relies upon a set of widely promulgated and machine independent applications that
can unlock files and run application software at any time in the future. As content is created, a
copy is saved in a format that can be decoded
11.3.5 Encapsulation
This preservation method refers to bundling together digital information resources, the
preservation metadata associated with them and even the software required for access. This is
providing digital objects with instructions on how to recreate the platform needed to enable
their use.
ISC 404: DIGITAL LIBRARIES
ACKNOWLEDGEMENTS: Notes derived from a module by Dr. C. Mutwiri. Do not re-circulate Page 5
11.3.6 Replication / Multiple Strategic Backups
Important data that exists only in a single copy on one computer is vulnerable. This is because
hardware can fail or dishonest employees can remove data. Replication involves creating copies
of data in more than one system. Therefore, by creating duplicate copies on many systems,
information will not be vulnerable to software and hardware failure. In addition, in the event of
accidental deletion information can still be retrieved from other systems
11.4 References
1. Graham, P.S (1995). Long-term intellectual preservation. Available:
http://aultnis.rutgers.edu/texts/dps.html
2. RLG (1995) Preserving digital information: The report of the Task force on
archiving of digital information. Commissioned by the commission on preservation
and access and the research libraries group.
Available: http://www.rlg.org/ArchTF/tfadi.index.htm
3.DIGITAL
ISC 404: Gladley, H.M (2006). Principles of digital preservation. ACM
LIBRARIES
ACKNOWLEDGEMENTS: Notes(2000).
4. Granger, Stewart derivedEmulation
from a module by Dr. C.preservation
as a digital Mutwiri. Do strategy.
not re-circulate
D Lib Page 6
magazine
11.5 Self Test Questions
1. Explain three types of preservation with regard to digital materials and the
challenges associated with each of them
2. What cost connected with preservation of content and metadata is the content owner
(digital library owner) expected to bear?
LESSON 12: CHALLENGES IN ESTABLISHING DIGITAL LIBRARIES
12.1 Introduction
Creating effective digital libraries poses serious challenges because of the integration of
digital media into traditional collections. Digital information is unique, in that it is
easily copied and remotely accessible by many users simultaneously. The challenges
facing development of digital libraries are discussed in this lecture.
ISC 404: DIGITAL LIBRARIES
ACKNOWLEDGEMENTS: Notes derived from a module by Dr. C. Mutwiri. Do not re-circulate Page 7
12.2 Objectives
By the end of the lesson you should be able to:
Outline some challenges encountered in establishing a digital library
Outline solutions to some of the challenges
12.3 Challenges in Creating Digital Libraries
12.3.1 Technical architecture and lack of common standards
To accommodate digital materials, libraries must upgrade current technical architectures to
accommodate digital materials. The architecture for digital libraries includes components like
high speed local networks, fast connections to the internet, relational databases and full text
search engines that provide access to resources. In addition, a variety of servers such as web
and FTP servers are required. Hence architecture for digital libraries are a collection of separate
systems and resources connected through a network and integrated through one interface (web
interface)
In such a digital library scheme, common standards are needed to allow the digital library to
interoperate and share resources. However, the problem facing digital libraries is that there is a
wide variety of data structures, search engines, interfaces, controlled vocabularies and
document formats. That makes it difficult to achieve interoperability.
12.3.2 Building Digital Collections
This is a major issue in creating digital libraries. The degree to which libraries will digitize
existing materials and acquire original digital works is an issue because of such other
requirements as long-term access and preservation. Digitization, one of the primary methods of
digital collection building, is a very expensive process.
ISC 404: DIGITAL LIBRARIES
ACKNOWLEDGEMENTS: Notes derived from a module by Dr. C. Mutwiri. Do not re-circulate Page 8
12.3.3 Metadata
This is another central issue to the development of digital libraries. Metadata is important in
digital libraries because it is key to resource discovery and use of any document. Metadata
records are very time-consuming to create and require specially trained personnel. Hence,
simpler metadata schemes are being sought. In addition to existence of complex metadata
schemes, many metadata schemes exist and the most famous one is the Dublin core metadata
scheme. The lack of common metadata standards is another barrier to information access and
use in a digital library.
Other issues related to metadata include duplication, inaccurate metadata or missing metadata.
Duplicates can arise during the scanning process. Scanning a work involves image processing
and quality assurance processes which are time consuming hence duplication cannot be
afforded.
Most resources contain metadata from libraries and archives. For others, metadata has to be
created and fed into the computer manually hence prone to errors. In accurate metadata hinders
fruitful search and retrieval of materials. Missing metadata on the other hand renders resources
lost as they cannot be retrieved.
12.3.4 Naming, Identifiers and Persistence
This issue is related to metadata. Names are strings that uniquely identify digital objects. Names
are part of a documents metadata. In a digital library, names are very important just like an
International Standard Book Number (ISBN) is important in a traditional library. Names are
needed to uniquely identify digital objects for purposes of information retrieval, citations, make
likes to objects and manage copyright.
The system of naming used must be permanent. The name must not be bound up with a specific
location. The name and location must be separate. However, the use of Universal Resource
Locators (URLs), which is the current method used to identify objects on the internet is a very
bad name. This is because it consists of several items in one string which should be separate i.e.
the method by which the document is accessed (http), machine name, document file name and
document path (location)
ISC 404: DIGITAL LIBRARIES
ACKNOWLEDGEMENTS: Notes derived from a module by Dr. C. Mutwiri. Do not re-circulate Page 9
URLs are bad names because when a file is moved, the document is entirely lost. There is need
for a global scheme of unique identifiers which are not tied to specific locations or processes.
Such names must remain varied whenever a document is moved from one location to another or
migrated from one storage medium to another.
To solve the problem of persistent identifiers, the following schemes have been proposed:
Persisent Uniform Resource Locators (PURLS)
These are persistent URLS developed by Online Computer Library Center (OCLC). The
attempt to separate the document name from its location hence increasing the probability of its
being found. If a document moves, the URL is updated but the PURL stays the same. A user
retrieves a document through PURL and the PURL server looks up the corresponding URL in a
database. However, PURLs are not also good names because they still confound a name with an
access method
Uniform Resource Name (URN)
This is not a naming scheme; it’s a framework for defining identifiers. They contain naming
authority identifier and an object identifier. Like PURLs, URNs must be resolved through a
database or such system into actual URLs. Unlike PURLs, URNs can be resolved through more
than one URL e.g. one for each of several formats
Digital Object Identifier (DOI) System
It is an initiative by the Association of American Publishers and Corporation for National
Research to provide a method by which digital objects can be reliably identified and accessed.
This provides publishers with a method by which intellectual property rights associated with
their materials can be managed
ISC 404: DIGITAL LIBRARIES
ACKNOWLEDGEMENTS: Notes derived from a module by Dr. C. Mutwiri. Do not re-circulate Page 10
NB: A system to handle names is possible and unique identifiers require an institution that can
take up the responsibility for their management and migration from a current technology to
succeeding generations of technologies.
12.3.5 Copyright / Right Management
This is a great barrier to digital library development. Digital objects are remotely accessible to
multiple users simultaneously but easily copied. Unlike publishers who own their information,
libraries are caretakers of information because they don’t own copyright of the material they
hold. This limits the ability of libraries to freely digitize and provide access to copyrighted
materials available in their collections. The solution for libraries to deal with copyrighted works
is to develop mechanisms for managing copyright. Such mechanisms allow them to provide
information without violating copyright.
12.3.6 Preservation
Keeping digital information available in perpetuity is another major issue. Preservation of
digital materials is a real issue because of technical obsolescence which is like deterioration of
paper in the paper age. Preservation of digital information requires libraries to constantly come
up with new technical solution.
Data migration, which is transferring data from one format to another, is one solution to
preserving the ability of users to retrieve and display the information content. However, the
difficulty associated with migration is that data migration is too costly. Besides, there are as yet
no standards for data migration. In addition, every time data is migrated from one format to
another, there is usually distortion or information loss
12.4 Summary
Libraries all over the world have been creating digital libraries even with the above
challenges. Many digital library projects are ongoing. With the initial enthusiasm by
ISClibrarians
404: DIGITAL LIBRARIES
to develop digital libraries, most of them have given the thought a second chance.
ACKNOWLEDGEMENTS: Notes derived from a module by Dr. C. Mutwiri. Do not re-circulate Page 11
This is because majority of them are realizing that investing in digital technology is more
difficult that they envisioned especially if they must overcome the constraints discussed
12.5 References
1. Cleveland, Gary. (1998). Digital libraries: definitions, issues and challenges. Available:
http://www.ifla.org
2. Edward, A F (1999). The digital libraries initiative: update and discussions, bulletin of
the American society of information science vol 26. No. 1
12.5 Self Test Questions
1. What can libraries jointly do in a coordinated scheme to tackle some of the
challenges discussed in lesson 12 above?
2. What models and other national initiatives already exist in other countries do
deal with copyrighted works in digital library projects?
ISC 404: DIGITAL LIBRARIES
ACKNOWLEDGEMENTS: Notes derived from a module by Dr. C. Mutwiri. Do not re-circulate Page 12
Discuss the methods used in preservation of digital collections of materials.