Import and Export With Dspace: Itemexport
Import and Export With Dspace: Itemexport
Author: Jon Bell Date: 12/9/05 This report investigates the import and export of items using DSpace. It will describe the DSpace export facilities and will consider any matters arising. There are two alternative export classes, ItemExport and METSExport, that are concerned with exporting actual items. In addition, there is an OAI-PMH mapping, concerned with the harvesting of DSpace metadata. This report will describe each of these in turn.
ItemExport
This is a java class, in org.dspace.app.itemexport.ItemExport. It is, in fact, the only item in that package. There is a corresponding import package that imports material into DSpace, so this package can be used for transferring items (or collections) from one DSpace repository to another. It generates a simple AIP of the DSpace content. There is a section on information packages at the end of this piece, as a sort of appendix. The ItemExport can be run from the command line, using dsrun. It can export either a single item or a collection. The syntax for running ItemExport using dsrun is dsrun org.dspace.app.itemexport.ItemExport t <ITEM or COLLECTION> i <id> -d <target_directory> -n<sequence>. These elements are: -t the type, either ITEM or COLLECTION -i the id of the item or collection. It works with the database item id or the items handle, in the form 123456789/36. -d the target directory. The exporter creates an individual subdirectory per item, starting with the sequence number (see below). So the first item might be in <target>/0. If the subdirectory exists, the export will fail and throw an exception. -n the sequence number. The number used as the name of the first subdirectory for exporting items. In addition there is a h option for getting some rather sketchy help printed to the screen. Exporting an item results in a subdirectory in the specified target directory which contains the following: contents a text file that lists the content information in the AIP. dublin_core.xml an XML file for the descriptive metadata, listed by qualified Dublin Core elements. It does, of course, use contributor.author as does DSpace, in preference to creator. handle a text file showing the items handle. files for the content. In the simple examples, there are two such files, the original file (e.g. the PDF) and the licence text file. The contents file lists the content files in bundles. In the simple cases tried, there are simply the two, license and original, each with one file listed. The Dublin Core for my thesis as exported by the ItemExporter is reproduced at the end of this report.
Incidentally, running the exporter with items submitted while the Tapir was installed failed. The Dublin Core was written OK, but the content file was not found, so a FileNotFoundException was thrown. As the Tapir is no longer installed, it is impossible to check whether exporting Tapir submissions can be done with the Tapir available. When exporting an item with several files, the result is similar, a directory with each item content file, a contents list, the original licence file and the file that includes the items handle. There is a discussion of both exporters handling of items that contain multiple files later in this report.
METSExport
Another Java class, this is the only class in the org.dspace.app.mets package. The purpose of this class is similar to the ItemExport class, but the metadata is serialised in METS format. It can also be run from the command line using dsrun. The options available are: -i, export an item. -c export a collection. -a export all items in the archive. -d the destination directory. It identifies the item (or collection) by the handle, not the id, and uses the handle to name the items subdirectory in the target directory. There is therefore no need for an equivalent of the sequence number used by ItemExport. The tests indicate that the items subdirectory contains two items, the METS metadata in XML format and an unidentified file, named using the items checksum, apparently, that contains the original. Attempting to open this file using Acrobat Reader worked (given that the original was a PDF). The METS XML is, as might be expected, a good deal more complex than the ItemExport. This might be argued to make it less appropriate for use when all that is wanted is export of the descriptive metadata for the Voyager. The mets.xml file for my thesis is reproduced at the end of this report. Multiple files are exported as new files with no type extension and apparently named using the checksum, as with items having one file. One difference between item export and METS is that the METS exporter appears not to export the licence file. Instead, it is reproduced (binary coded) in the mets.xml file.
The output returned on passing a GetRecord request for my thesis is appended to this report. It will be seen that the fields do not use qualified Dublin Core, so the distinction between the three dates, for example, is lost, as is the nature of Neal Snookes contribution. On the other hand, it does replace DSpaces contributor. author field with the more generally preferred creator. One point that barely affects our bridge but that surprises me a little is that EThOS seem to want to use the qualified Dublin Core based UK-ETD metadata set and also use OAI-PMH without the qualifiers. It is not impossible, I suppose that they can restore the lost information on importing the metadata, but it does seem odd that they are apparently accepting this loss in the first place. Communications from EThOS seem to suggest that they are aware of this and they suggest that a repository can provide richer metadata. Cranfield have an alternative schema for metadata harvesting in their DSpace, and doubtless we either use that (if they make it available) or develop something similar. There seems no great difficulty using OAI as the basis for finding new theses for export. The type field (assuming the UK-ETD metadata set) includes the word thesis (or dissertation) so this can be used to identify the thesis. As that field also includes the level and degree, it can also be used, at least to some extent, to sort PhD and Masters theses. Maybe we could distinguish between research and taught masters in this field if that helps selection. There are three date fields, unidentified because of the loss of the qualifier. Presumably the date.available qualified Dublin Core field is the one to use. I guess the obvious approach is simply to use the latest of the three dates as the basis for selection. Of course, an alternative approach is to identify new theses by their handles not being recognised but this does require checking the persistent ids in the NLW repository. The OAI vocabulary allows the possibility of selecting item for harvest both by collection (which approach EThOS likes) and by date, so it can be used to ask for a listing of items in a thesis collection added since the previous harvest.
Metadata
The table shows the descriptive metadata fields from both exporters, taken from the listings of the xml files at the end of this report. ItemExport dc value METS tag Comments contributor.advisor mods:roleTerm advisor contributor.author mods:roleTerm author UK-ETD uses creator date.accessioned mods:dateAccessioned date.available mods:dateAvailable date.issued mods:dateIssued identifier.uri mods:idenitifer type=uri description.abstract mods:abstract description.provenance mods:note type=provenance format.extent mods:extent format.mimetype mods:internetMediaType language.iso mods:languageTerm publisher mods:origin / mods:publisher subject mods:subject / mods:topic title mods:titleInfo type mods:genre It will be seen that, based on this example, there is no descriptive metadata included in one exporter and not the other, though it is not clear that this would apply to fields
not used in this example (such as subject classifications). Apparently Fedora has a limited set of Dublin Core values. this raises the question of whether the descriptive metadata that is not included in these few fields is stored in some other way in the Fedora record or whether it is lost.
Licence files
On submitting an item, DSpace adds a license.txt bitstream to the item, alongside the content bitstream(s). This is exported alongside the actual content by the ItemExport
tool, though not by the METS tool. How this is managed on exporting items needs some consideration. Either the license bitstream should be copied across with the item itself, in which case, naively, the NLW's items will have UWA license files (rather than the right ones!) or new licence files need adding, in which case the submitters agreement is needed. One possible solution is to add an NLW license to be agreed on submission, so the final licence file covers both institutions. This does, of course, mean that no item will ever consist of a single bitstream, which will slightly complicate the export process. Another approach is to have the NLWs license agreed on submission, but rather than copying the licence file over, having the NLW associate the imported item with a copy of its own licence. This has two consequences. One is that having the agreement of the submitter to the NLWs licence is treated as a condition of importing the item, the other that one NLW license file is suitable for all imported theses. This might also be affected by restrictions on individual theses, of course. The METS exporter includes the licence text in the mets.xml metadata file. It is under the rightsMD tag. The licence text is labelled as being text/plain, but is then binary coded. This needs investigating. Of course, it is not clear that it is the right text!
Information packages
As the documentation of the export tools mentions AIPs, it seemed worth adding a brief explanation of AIPs and other Information Packages. An information package contains the Content Information (which I take to be the item itself and possibly its
descriptive metadata) together with information required for the items preservation (known as the Preservation Description Information). There is associated Packaging Information to delimit and identify the Content Information and the Preservation Description Information. It is assumed that there is a place for conventional descriptive metadata as well. There are three kinds of information package: Archive Information Package (AIP) The Information package as preserved in an OAIS. Dissemination Information package (DIP) The information, derived from one or more AIPs that is received by a user in response to a request from the repository. It will, I imagine, include information on how to read the item. Submission Information Package (SIP) The information package delivered by the producer of the information to the repository and used in the production of one or more AIPs. The definitions are derived from the glossary in Attributes of a Trusted Digital Repository: Meeting the needs of Research Resources an RGL-OCLC report. It is available on the Web at http://www.rlg.org/longterm/attributes01.pdf.
Metadata listings
Below is the dublin_core.xml file from the item exporter for my thesis, altered only by shortening the abstract.
- <dublin_core> <dcvalue element="contributor" qualifier="advisor">Snooke, Neal</dcvalue> <dcvalue element="contributor" qualifier="author">Bell, Jonathan</dcvalue> <dcvalue element="date" qualifier="accessioned">2005-0826T10:36:56Z</dcvalue> <dcvalue element="date" qualifier="available">2005-0826T10:36:56Z</dcvalue> <dcvalue element="date" qualifier="issued">2005-08</dcvalue> <dcvalue element="identifier" qualifier="uri">http://hdl.handle.net/123456789/35</dcvalue> <dcvalue element="description" qualifier="abstract">While most work in the qualitative and model based reasoning community has been concerned with simulation, this is arguably only one of two aspects of model based design analysis. The main focus of work has been in devising appropriate methods of simulation to enable knowledge of the behaviour of the system under analysis to be established from the available knowledge of its structure and the behaviour or function of its components and domain. However, for design analysis to be automated more fully, it is also necessary that the results of the simulation be interpreted in terms appropriate to the design analysis task being undertaken. </dcvalue> <dcvalue element="description" qualifier="provenance">Submitted by Jon Bell ([email protected]) on 2005-08-26T10:36:23Z No. of bitstreams: 1 Thesis.pdf: 1202337 bytes, checksum: 828cee64d3767b7f10ecc67dd98435b2 (MD5)</dcvalue> <dcvalue element="description" qualifier="provenance">Made available in DSpace on 2005-08-26T10:36:56Z (GMT). No. of bitstreams: 1 Thesis.pdf: 1202337 bytes, checksum: 828cee64d3767b7f10ecc67dd98435b2 (MD5) Previous issue date: 200508</dcvalue> <dcvalue element="format" qualifier="extent">1202337 bytes</dcvalue> <dcvalue element="format" qualifier="mimetype">application/pdf</dcvalue> <dcvalue element="language" qualifier="iso">en</dcvalue> <dcvalue element="publisher" qualifier="none">Univeristy of Wales.Aberystwyth.Computer Science</dcvalue> <dcvalue element="subject" qualifier="none">functional description</dcvalue> <dcvalue element="subject" qualifier="none">model based reasoning</dcvalue> <dcvalue element="title" qualifier="none">Interpretation of simulation for model-based design analysis of engineered systems</dcvalue> <dcvalue element="type" qualifier="none">thesis.doctoral.PhD</dcvalue> </dublin_core>
This is the corresponding mets.xml file from the METS exporter, again with a short abstract.
<?xml version="1.0" encoding="utf-8" standalone="no" ?> - <mets OBJID="hdl:123456789/35" LABEL="DSpace Item" xmlns="http://www.loc.gov/METS/" xmlns:xlink="http://www.w3.org/TR/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:mods="http://www.loc.gov/mods/v3" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/mets.xsd http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-0.xsd"> - <metsHdr CREATEDATE="2005-08-26T14:31:38"> - <agent ROLE="CUSTODIAN" TYPE="ORGANIZATION"> <name>Jon's DSpace testbed</name> </agent> </metsHdr> - <dmdSec ID="DMD_hdl_123456789/35"> - <mdWrap MDTYPE="MODS"> - <xmlData> - <mods:name> - <mods:role> <mods:roleTerm type="text">advisor</mods:roleTerm> </mods:role> <mods:namePart>Snooke, Neal</mods:namePart> </mods:name> - <mods:name> - <mods:role> <mods:roleTerm type="text">author</mods:roleTerm> </mods:role> <mods:namePart>Bell, Jonathan</mods:namePart> </mods:name> - <mods:extension> <mods:dateAccessioned encoding="iso8601">2005-0826T10:36:56Z</mods:dateAccessioned> </mods:extension> - <mods:extension> <mods:dateAvailable encoding="iso8601">2005-0826T10:36:56Z</mods:dateAvailable> </mods:extension> - <mods:originInfo> <mods:dateIssued encoding="iso8601">2005-08</mods:dateIssued> </mods:originInfo> <mods:identifier type="uri">http://hdl.handle.net/123456789/35</mods:identifier> <mods:abstract>While most work in the qualitative and model based reasoning community has been concerned with simulation, this is arguably only one of two aspects of model based design analysis. The main focus of work has been in devising appropriate methods of simulation to enable knowledge of the behaviour of the system under analysis to be established from the available knowledge of its structure and the behaviour or function of its components and domain. However,
for design analysis to be automated more fully, it is also necessary that the results of the simulation be interpreted in terms appropriate to the design analysis task being undertaken. </mods:abstract> <mods:note type="provenance">Submitted by Jon Bell ([email protected]) on 2005-08-26T10:36:23Z No. of bitstreams: 1 Thesis.pdf: 1202337 bytes, checksum: 828cee64d3767b7f10ecc67dd98435b2 (MD5)</mods:note> <mods:note type="provenance">Made available in DSpace on 2005-0826T10:36:56Z (GMT). No. of bitstreams: 1 Thesis.pdf: 1202337 bytes, checksum: 828cee64d3767b7f10ecc67dd98435b2 (MD5) Previous issue date: 2005-08</mods:note> <mods:physicalDescription> <mods:extent>1202337 bytes</mods:extent> </mods:physicalDescription> <mods:physicalDescription> <mods:internetMediaType>application/pdf</mods:internetMediaType> </mods:physicalDescription> <mods:language> <mods:languageTerm authority="rfc3066">en</mods:languageTerm> </mods:language> <mods:originInfo> <mods:publisher>Univeristy of Wales.Aberystwyth.Computer Science</mods:publisher> </mods:originInfo> <mods:subject> <mods:topic>functional description</mods:topic> </mods:subject> <mods:subject> <mods:topic>model based reasoning</mods:topic> </mods:subject> <mods:titleInfo>Interpretation of simulation for model-based design analysis of engineered systems</mods:titleInfo> <mods:genre>thesis.doctoral.PhD</mods:genre> </xmlData> </mdWrap> </dmdSec> <amdSec ID="TMD_hdl_123456789/35"> <rightsMD> <mdWrap MIMETYPE="text/plain" MDTYPE="OTHER" OTHERMDTYPE="TEXT"> <binData>TGljZW5zZSBncmFudGVkIGJ5IEpvbiBCZWxsIChqcGJAYWJlci5h Yy51aykgb24gMjAwNS0w OC0yNlQxMDozNjoyM1ogKEdNVCk6CgpOT1RFOiBQTEFDRSBZT1VSIE9XTi BMSUNFTlNFIEhFUkUK VGhpcyBzYW1wbGUgbGljZW5zZSBpcyBwcm92aWRlZCBmb3IgaW5mb3Jt YXRpb25hbCBwdXJwb3Nl cyBvbmx5LgoKTk9OLUVYQ0xVU0lWRSBESVNUUklCVVRJT04gTElDRU5TR QoKQnkgc2lnbmluZyBh bmQgc3VibWl0dGluZyB0aGlzIGxpY2Vuc2UsIHlvdSAodGhlIGF1dGhvcihzK SBvciBjb3B5cmln aHQKb3duZXIpIGdyYW50cyB0byBEU3BhY2UgVW5pdmVyc2l0eSAoRFNV KSB0aGUgbm9uLWV4Y2x1 c2l2ZSByaWdodCB0byByZXByb2R1Y2UsCnRyYW5zbGF0ZSAoYXMgZGVm aW5lZCBiZWxvdyksIGFu ZC9vciBkaXN0cmlidXRlIHlvdXIgc3VibWlzc2lvbiAoaW5jbHVkaW5nCnRoZS BhYnN0cmFjdCkg d29ybGR3aWRlIGluIHByaW50IGFuZCBlbGVjdHJvbmljIGZvcm1hdCBhbm
QgaW4gYW55IG1lZGl1 bSwKaW5jbHVkaW5nIGJ1dCBub3QgbGltaXRlZCB0byBhdWRpbyBvciB2a WRlby4KCllvdSBhZ3Jl ZSB0aGF0IERTVSBtYXksIHdpdGhvdXQgY2hhbmdpbmcgdGhlIGNvbnRlbn QsIHRyYW5zbGF0ZSB0 aGUKc3VibWlzc2lvbiB0byBhbnkgbWVkaXVtIG9yIGZvcm1hdCBmb3IgdGhl IHB1cnBvc2Ugb2Yg cHJlc2VydmF0aW9uLgoKWW91IGFsc28gYWdyZWUgdGhhdCBEU1UgbWF 5IGtlZXAgbW9yZSB0aGFu IG9uZSBjb3B5IG9mIHRoaXMgc3VibWlzc2lvbiBmb3IKcHVycG9zZXMgb2Y gc2VjdXJpdHksIGJh Y2stdXAgYW5kIHByZXNlcnZhdGlvbi4KCllvdSByZXByZXNlbnQgdGhhdCB0 aGUgc3VibWlzc2lv biBpcyB5b3VyIG9yaWdpbmFsIHdvcmssIGFuZCB0aGF0IHlvdSBoYXZlCnR oZSByaWdodCB0byBn cmFudCB0aGUgcmlnaHRzIGNvbnRhaW5lZCBpbiB0aGlzIGxpY2Vuc2UuIFl vdSBhbHNvIHJlcHJl c2VudAp0aGF0IHlvdXIgc3VibWlzc2lvbiBkb2VzIG5vdCwgdG8gdGhlIGJlc3 Qgb2YgeW91ciBr bm93bGVkZ2UsIGluZnJpbmdlIHVwb24KYW55b25lJ3MgY29weXJpZ2h0Lg oKSWYgdGhlIHN1Ym1p c3Npb24gY29udGFpbnMgbWF0ZXJpYWwgZm9yIHdoaWNoIHlvdSBkbyBu b3QgaG9sZCBjb3B5cmln aHQsCnlvdSByZXByZXNlbnQgdGhhdCB5b3UgaGF2ZSBvYnRhaW5lZCB0aG UgdW5yZXN0cmljdGVk IHBlcm1pc3Npb24gb2YgdGhlCmNvcHlyaWdodCBvd25lciB0byBncmFudCB EU1UgdGhlIHJpZ2h0 cyByZXF1aXJlZCBieSB0aGlzIGxpY2Vuc2UsIGFuZCB0aGF0CnN1Y2ggdGhp cmQtcGFydHkgb3du ZWQgbWF0ZXJpYWwgaXMgY2xlYXJseSBpZGVudGlmaWVkIGFuZCBhY2tu b3dsZWRnZWQKd2l0aGlu IHRoZSB0ZXh0IG9yIGNvbnRlbnQgb2YgdGhlIHN1Ym1pc3Npb24uCgpJRiB USEUgU1VCTUlTU0lP TiBJUyBCQVNFRCBVUE9OIFdPUksgVEhBVCBIQVMgQkVFTiBTUE9OU09SR UQgT1IgU1VQUE9SVEVE CkJZIEFOIEFHRU5DWSBPUiBPUkdBTklaQVRJT04gT1RIRVIgVEhBTiBEU1U sIFlPVSBSRVBSRVNF TlQgVEhBVCBZT1UgSEFWRQpGVUxGSUxMRUQgQU5ZIFJJR0hUIE9GIFJFV klFVyBPUiBPVEhFUiBP QkxJR0FUSU9OUyBSRVFVSVJFRCBCWSBTVUNICkNPTlRSQUNUIE9SIEFH UkVFTUVOVC4KCkRTVSB3 aWxsIGNsZWFybHkgaWRlbnRpZnkgeW91ciBuYW1lKHMpIGFzIHRoZSBhd XRob3Iocykgb3Igb3du ZXIocykgb2YgdGhlCnN1Ym1pc3Npb24sIGFuZCB3aWxsIG5vdCBtYWtlIGF ueSBhbHRlcmF0aW9u LCBvdGhlciB0aGFuIGFzIGFsbG93ZWQgYnkgdGhpcwpsaWNlbnNlLCB0byB 5b3VyIHN1Ym1pc3Np b24uCg==</binData> </mdWrap> </rightsMD> </amdSec> - <fileSec> - <fileGrp USE="ORIGINAL"> - <file ID="123456789_35_1" MIMETYPE="application/pdf" SIZE="1202337" CHECKSUM="828cee64d3767b7f10ecc67dd98435b2" CHECKSUMTYPE="MD5" OWNERID="http://ISSVSTAFF075.staff.aber.ac.uk:8080/dspace/bitstrea m/123456789/35/1/Thesis.pdf" GROUPID="GROUP_123456789_35_1">
<?xml version="1.0" encoding="UTF-8" ?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2005-09-08T09:08:13Z</responseDate> <request identifier="oai:ISSVSTAFF075.staff.aber.ac.uk:123456789/35" metadataPrefix="oai_dc" verb="GetRecord">http://iswsstaff075:8080/dspaceoai/request</request> <GetRecord> <record> <header> <identifier>oai:ISSVSTAFF075.staff.aber.ac.uk:123456789/35</identifier> <datestamp>2005-08-26T10:36:57Z</datestamp> <setSpec>hdl_123456789_6</setSpec> <setSpec>hdl_123456789_22</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:contributor>Snooke, Neal</dc:contributor> <dc:creator>Bell, Jonathan</dc:creator> <dc:date>2005-08-26T10:36:56Z</dc:date> <dc:date>2005-08-26T10:36:56Z</dc:date> <dc:date>2005-08</dc:date> <dc:identifier>http://hdl.handle.net/123456789/35</dc:identifier> <dc:description>While most work in the qualitative and model based reasoning community has been concerned with simulation, this is arguably only one of two aspects of model based design analysis. The main focus of work has been in devising appropriate methods of simulation to enable knowledge of the behaviour of the system under analysis to be established from the available knowledge of its structure and the behaviour or function of its components and domain. However, for design analysis to be automated more fully, it is also necessary that the results of the simulation be interpreted in terms appropriate to the design analysis task being undertaken. </dc:description> <dc:format>1202337 bytes</dc:format> <dc:format>application/pdf</dc:format> <dc:language>en</dc:language> <dc:publisher>University of Wales.Aberystwyth.Computer Science</dc:publisher> <dc:subject>functional description</dc:subject> <dc:subject>model based reasoning</dc:subject> <dc:title>Interpretation of simulation for model-based design analysis of engineered systems</dc:title> <dc:type>thesis.doctoral.PhD</dc:type> </oai_dc:dc> </metadata> </record> </GetRecord> </OAI-PMH>