In this series of articles about the digital preservation effort going on at the John F. Kennedy Presidential Library and Museum, I've yet to describe the actual hardware configuration that is currently in operation.
I've covered the EMC software being used, the archivist's process being used, the crucial role played by Documentum, and the conformance to standards such as OAIS, NARA, and Dublin Core.
This post more fully describes the EMC hardware architecture put in place to make it all happen.
Update: since this article was first published a hardware and software refresh has occurred (2010).
The inventory is as follows: 2 scanners, 2 scanner workstations, four servers, one CLARiiON CX300i, and a Centera. Not pictured are a backup infrastructure (tape), KVM switch, the details of the IP networks, and various other servers (e.g. CLARiiON system management, additional workstations). What is shown above is enough to map the process and the software onto the hardware.
Scanners and Scanner Workstations
At the top of the diagram are the scanners and the workstations they are attached to. The workstations are Dell GX620s. The scanners are Fujitsu fi-5750C devices. These are the primary interface devices for the archivists and interns working at the JFK Library. Installed onto the workstations are the Documentum products, including AppGen, ApplicationXtender Document Manager, and ApplicationXtender Image Capture. As described previously, the applications running on these workstations set up permissions, interface with the scanner, and enable metadata cataloguing. Note that metadata cataloguing can occur from other workstations (not pictured).
ApplicationXtender Servers
At the center of the diagram four servers are depicted. These servers all interact with Documentum ApplicationXtender for different purposes, yet this interaction is seamless to the archivists (while providing an extremely rich archiving implementation). The four servers are all Dell PowerEdge 1850 with 3.2GBhz CPUs and 2GB memory. They are all running Windows 2003. They are all connected to both a front-end IP network, and they are all connected to a CLARiiON via a back-end iSCSI network. PowerPath (described in the CLARiiON section) is installed on all servers. The responsibilities of these servers are listed below:
- A database server (Microsoft SQL Server 2005) is being used to store ApplicationXtender's database (DocBase).
- A WebXtender server will ultimately be used to display (via the web) the digital assets currently being scanned into the archive.
- A DiskXtender server runs the Legato DiskXtender software. ApplicationXtender stores scanned images to this server, and they all initially land on the CLARiiON. At the end of each day DiskXtender transparently moves them to Centera.
- A Verity server runs the optical character recognition (OCR) software used to parse the scanned documents and add additional metadata for a richer search experience.
CLARiiON CX300i
The CLARiiON device (located at the bottom of the diagram) contains fifteen drives and doles out storage between all of the servers described above. Initially each server received a minimum of 80 Gb of iSCSI LUN space to support application operation. The DiskXtender server was also allocated an additional 500 Gb to store newly-scanned images. PowerPath is running on all servers attached to the CLARiiON. If a failure in any network path occurs, PowerPath can work around it, whether it be in the server, in the network, or in a CLARiiON network port.
The 500 Gb of DiskXtender storage is one of the keys to the entire archiving solution, and one of the reasons that the solution is state-of-the-art for digital archiving. The sheer volume of scanned objects in the JFK collection can fill up a 15 drive CLARiiON in a matter of weeks. CLARiiON is an easy-to-scale product, with a modular, building-block hardware architecture. However, to ask a digital archivist to continually carve out new iSCSI LUNs and configure new file systems is not a good use of their time. When new images land on the CLARiiON they are transparently moved (daily) to the much larger Centera archive. Small stub files, known as content addresses, are left behind. It will take a long time to fill up 500 Gb with stub files (sizes on the order of bytes). When that time does come, a CLARiiON expansion will be performed.
Centera
If Documentum is the centerpiece of the front end of this solution, Centera is the centerpiece of the back-end. An initial 5TB of Centera storage was installed at the site. All of the scanned images eventually land on the Centera. The content addresses left behind on the DiskXtender server serve two purposes: (1) they allow Legato to re-stage content back onto the CLARiiON (should a user wish to view a given document or photo), and (2) they authenticate that the content is genuine (content addresses contain hash values based on every byte of the content).
Clearly the Centera will fill up, especially given the pace of the archivist's process. In the first 6 months, 70 photos and 70,000 documents caused the Centera to fill. A "full system" is another reason why Centera shines as the centerpiece for storage in a digital archive. Why does it shine? Because Centera allows for the dynamic addition of storage with zero need for reconfiguration by the customer. There are no LUNs to create, and no file systems to deal with. Centera presents a completely flat address space. Legato DX doesn't "place" the content into a specific Centera location, it simply "throw it over the wall" and receives a content address in return. Centera automatically takes care of recognizing and initializing new storage.
Of course, Centera's retention feature prevents accidental (or intentional) deletion of the content.
An Elegant Solution
I enjoy watching this system in operation. So do the archivists at the JFK Library! After all, they have MILLIONS OF DOCUMENTS to scan into the system!!! The fact that capacity upgrades of Centera are straightforward is a big win.
Another requirement for the JFK Library was related to backup. The library wants three copies of JFK's documents: (1) the original, (2) a digital copy on Centera, and (3) a digital copy on tape. I hope to more fully discuss the currently functioning tape backup solution in a future post.
Steve
Comments
You can follow this conversation by subscribing to the comment feed for this post.