Digitization Guidelines

These are the technical guidelines for materials originating from print (digitized from hard copies or microform) to be included in SAOA’s digital collections. Digitization providers (commercial entities as well as academic institutions) will be expected to conform to these specifications to ensure consistency of the digital materials for ingest into the SAOA digital asset management system. The following are the ideal specifications for ingesting image-based material into SAOA’s collections.

  1. At the outset of each project, the digitization provider or content contributor should provide SAOA with the total number of images, total number of volumes (for serials and multi-volume monographs), and total file size (in MB, GB, or TB).
  2. Descriptive Metadata – for cases where SAOA requests metadata from a content contributor, metadata should:
    1. Use one of the following metadata schemes: Dublin Core, MARC21,

    2. Be provided in one of the following metadata/catalog record file formats: MARC XML or CSV,

    3. Use the correct standard metadata template (for example, for monographs vs serials),

    4. Include accurate holdings information for serials or multipart titles,

    5. Have been provided in a sample set of records for SAOA staff to review during the proposal phase, as specified above.

  3. Structural Metadata – if available, appropriate structural metadata should be provided to help SAOA organize the image files and to allow navigation within the item (for example, by chapter).

  4. Asset File Types – when requested, the following file types for each image of a given title should be provided:
    1. Master image files for preservation: TIFF images,
    2. Access files (image surrogates): JPEG or JPEG2000 (JP2),
    3. OCR files (where available):
      1. .txt and,
      2. OCR XML or HOCR.
  5. Image Capture
    1. TIFF master image files
      1. Resolution: 400 ppi for new digitization (300 ppi may be acceptable in some cases if that is what is already available).
      2. Uncompressed, TIFF 6.0 images, in either “little endian” (IBM PC) or “big endian” (Mac) byte order.
      3. All files should be able to pass JHOVE format validation as valid and well-formed.
      4. 24-bit color for new digitization (8-bit grayscale may be acceptable for items already digitized or with no color content. Either no gray profile, or Gray Gamma 2.2). No proprietary scanner profiles.
      5. One page per image (in exceptional cases, two pages per image).
    2. JPEG or JP2 Access files (image surrogates)
      1. Resolution: keep surrogate resolution the same as master TIFF file.
      2. Compression level: on average the access files should compress to about 500 kilobytes (compression level between 10:1 and 15:1, depending upon size and color of the original).
    3. Image quality: images should meet the following characteristics, many of which may be available as automated settings on the scanner as part of the image capture option (e.g. microfilm scanning). In exceptional cases, manual post-processing or correction might be necessary:
      1. Achieve desired tone distribution
      2. Sharpen images to match appearance of the originals
      3. Crop and/or deskew the images, oriented to the text (not to the page)
  6. File Naming
    1. Monographs (Single Volume)
      1. Format: titleID_YEAR_sequential image #.tif
      2. Example: 986786411_1915_00135.tif
        1. This would be for a monograph (single volume) published in 1915, 135th consecutive image.
        2. For the Title ID, assign the OCLC# if available.
        3. Allow for 5 digits for sequential image numbering.
    2. Monographs (Multi-Volume)
      1. Format: titleID_YEAR_VOLUME #_sequential image #.tif
      2. Example: 990512780_1918_003_00115.tif
        1. This would be for a monograph (multi-volume) published in 1918, volume 3, 115th consecutive image.
        2. For the Title ID, assign the OCLC#.
        3. Allow for 5 digits for sequential image numbering.
    3. Serials
      1. Format: titleID_YEAR_VOLUME #_ISSUE #_sequential image #.tif
      2. Example: 990312980_1915_002_001_00253.tif
        1. This would be for a serial published in 1915, volume 2, issue 1, 253rd consecutive image.
        2. For the Title ID, assign the OCLC#.
        3. Allow for 5 digits for sequential image numbering.
    4. FOR ALL THE ABOVE: file naming of the derivative access files must follow the same pattern. REQUIRED: the .jp2 or .jpg derivative must have precisely the same filename as its corresponding master .tif file, except for the filename extension, i.e. "990512780_1918_003_00115.jp2" is derived from (corresponds to) the image of "990512780_1918_003_00115.tif", and so on.
  7. File Transfer
    1. Acceptable methods of file transfer are via hard drive, USB drive, FTP, Dropbox, Google Drive, and CD.

Featured: Kaiser-I-Hind

Founded in 1882 in Bombay, Kaiser-I-Hind was the voice of the fledgling Indian National Congress and the Parsi community in the 19th and 20th centuries and is now available in the World Newspaper Archive’s new South Asian module.