Library Journal Mobile
Log In  |  Register          Free Newsletter Subscription
Subscribe to LJ Magazine

Breaking Through the Invisible Web

Mark Ludwig discusses the University at Buffalo's attempts to move its catalog content to the web's surface

Mark Ludwig (netConnect) -- netConnect, 1/15/2003

More and more web content, including library content, never sees the light of day. It is hidden behind proprietary database interfaces where it can't be found by popular Internet search engines. Librarians must bring the deep, invisible web to the surface so that our public can discover our resources through a query on a popular search engine. While using proprietary database technology to automate content generation and manage lists is efficient, it ultimately undermines the openness and ease of access that once made the web great.

Our students would be better served by library content on persistent pages that continue to exist beyond the end of a database search. This would help them find information where they look for it-online. We can accomplish this by using web site development products that allow for publishing web sites and database packages that enable data export. We can update, change, and maintain content in background databases and still offer a persistent presence for our content on the web server. But if we want the open, visible portion of our web sites to offer meaningful content, we also have to leave our pages in the surface web so that link-following spiders can easily find and crawl through them.

There are many initiatives today to bring library content to the surface web. Projects such as the Open Archives Initiative and products such as SFX seek to harvest and mine information from the depths of myriad databases. At the University at Buffalo (UB), State University of New York, we have concluded experiments that show millions of web pages can be extracted from databases-such as the catalog-and stored persistently on the web. This allows libraries to have the best of both worlds: database control and bookmarkable persistent documents. It will also allow students to use their interface of choice-that of a web search engine-to find authoritative materials.

Hide and seek

 

Link List

Inktomi Enterprise Search
www.verity.com

MarcEdit
www.onid.orst.edu/
~reeset/marcedit/html

SofTouch System's Crossplex
www.softouch.com

University at Buffalo's NetCatalog
libnet.buffalo.edu

How have libraries ended up with so much of their content hidden? Over the past four or five years, web site management has become a significant challenge as we accumulate the work of many librarians and web authors on our sites. Tools such as Dreamweaver, ColdFusion, Microsoft SQL Server, and Oracle are being used to deal with issues such as constantly changing content, new material, and the need to globally control and mass-change web sites.

In many cases, we are building databases-like our catalogs-that dynamically produce search results and content pages in response to user queries. In addition, much of our content comes from third-party sources; we essentially offer menus and services of other web sites. All this leaves valuable library content locked away.

Original web paradigm

If we consider the evolutionary development of the Internet and the World Wide Web, we can trace a certain synergy. Originally, a few simple concepts evolved into sophisticated components that make everything work harmoniously. At its core, the web delivers files from one computer to another over a common network in a standard way. Despite the advent of graphics, image, sound, and text, we still have file transfers.

For some time, most of us have risen above bits and bytes to deal instead with characters, fonts, HTML tags, pages, and web sites. This is a powerful improvement over the old, grueling way of building computer applications. Almost anyone can now add content-often very sophisticated content-to web sites using pages and sites. Around the mid-1990s many people caught the vision of the web: a new place to store all kinds of textual documentation, where anyone and everyone could retrieve it. The storehouse was relatively easy to set up. After purchasing and installing the server hardware and software, any number of web pages could be added at little or no additional cost.

The paradigm shifts

It didn't take long before we developed a strong desire to do all of our computing on the web. During the late 1990s we brought interfaces to just about any system one could imagine. The web became more than a purveyor of file content. It became a transport mechanism to the web façades of seemingly every business system on the planet.

Just as brokerage houses interfaced their host-based securities trading systems to the web and universities interfaced their mainframe registration systems to the web, library vendors interfaced their old transaction processing systems to the web. The data did not move to the web, only the user interfaces did. The web browser became a nice façade to the backend systems, rather than a new system.

The web was used to connect hordes of new users to hordes of old systems. This gave vastly more people access to services. But many online catalogs exhibited the same limitations of Z39.50 interfaces because they were only front ends to old systems. Libraries like UB implemented web-based catalogs over their traditional integrated library systems (ILS). At UB, we used a middleware product, Crossplex from SofTouch Systems, to screen scrape the old mainframe NOTIS catalog and give it a respectable web look and feel.

UB's library web designers have begun to build a database of databases just to keep track of the licensing and funding of hundreds of subscriptions to online databases, indexes, and resources. In the past, we relied upon static pages to describe resources and guide users to the products. Now we are considering replacing these product descriptions with a separate searchable database. Thus, users will search a database of databases before they even begin their original search. Our deep web is getting deeper and deeper by levels. Scripting languages, interfaces, and web development tools have moved us further away from the free and easy paradigm of the early World Wide Web. UB's web manager, Gemma DeVinney, recently noted that while we once gained visibility for a resource when we put it on the web, now we often lose it when we place it there.

Attracting spiders

Rather than burying all the material, why not bring it forward on a web site? At UB, we wondered if a web server could handle an unlimited number of persistent web pages. Many web sites contain tens of thousands of pages, but could a web site handle millions of pages? Modern disks certainly have the required capacity.

We got to find out when, as part of beta testing an ILS conversion project, we needed to verify the transfer of MARC data from one system to another. We set out to find a way to retrieve individual MARC records and view their state before loading them into the new vendor's database. As a proof of concept test, and to meet the practical needs of conversion testing, we made a file for each of our 2.2 million MARC records. We extracted exactly one MARC record in each file and named the files in a consistent way that included the unchanging, unique record number from the mainframe system. Not only did the web server handle this without a problem, but with the help of a free program called MarcEdit, we could download any record at will and display or edit it on a desktop computer.

If we could put two million files on a single web site, why not more? It is possible, under the UNIX operating system, to configure a file system to hold hundreds of millions of files. We set up another disk and used a local C program to convert the mainframe NOTIS MARC data to HTML. This was quite easy since the MARC tagging is more granular and detailed when compared with the relatively simple HTML tags. You can tag title and author with a large font on a shaded background and use various colors for certain types of tags to achieve an interesting display. In short order we had 4.4 million files, 2.2 million MARC records, and a corresponding HTML page for each record that contained a link to the MARC record. Then we decided to index the HTML pages. Would Google crawl through the web site and make us a free catalog?

After several weeks, a Google spider found our site and crawled away. Unfortunately, Google only picked up our 80,000 directory entries, crawled about 20,000 of our HTML MARC pages, and then stopped crawling. Apparently, too much content from one site is to be feared. We did get to see what a Google library catalog might look like. More importantly, we proved that it is possible to create many catalog pages, index them, and search them with an Internet search engine. This can be done without an ILS and without a relational database.

An XML catalog: NetCatalog

Our experiments demonstrated that the old web paradigm of putting lots of files directly onto the web site is still a useful scenario. Size does not mandate a relational database, and since the mainframe system can handle updates to the data, we can achieve the required dynamism by promulgating transactions from the ILS. A new paradigm, based on the old web paradigm, began to emerge. We realized we can build almost any library information system in several easy steps: construct catalog pages that contain one result each; place them on a directory under a web server; tag interesting data elements; index the pages; and use an Internet search engine(s) to search and retrieve result pages.

It seemed too simple to be true, so we decided to try another experiment. By using XML (eXtensible Markup Language), we could maintain the richness of the MARC tagging. By using an XML-aware search engine, we would be able to support even more serious searching.

For our next experiment, we converted mainframe NOTIS MARC records to XML catalog pages that contained bibliographic information, holdings information, and more. The nature of XML allows the tagging and organization of multiple digital objects in a file. Since we were developing a web catalog and not a full ILS, we implemented a 'lite' version of Stanford's XMLMARC. This standard uses data elements like v245 for title and v100 for author. This work predated the Library of Congress scenario, whereby all elements are tagged with a numeric attribute to specify the MARC field. Stanford's tagging methodology gave us a way almost to make the data 'tag itself' upon conversion to XML catalog pages.

Using an XSL style sheet with our properly tagged XML files, we gained control of the page's look and feel and preserved all of the data in the MARC records. We displayed and labeled citation elements of interest to the general user, such as author, title, imprint, call number, and subject. We kept all of the elements but did not, by default, display elements such as ISSN, ISBN, and LC card number. Those elements are searchable and displayable in the source of the catalog page. The actual layout and display format were the work of a combined committee of catalogers, reference librarians, and systems staff.

Further experiments with web spidering engines confirmed that it is possible to index two million XML catalog pages. There are scalability problems, however. We found one free engine that handled XML beautifully but was memory bound while building the index and thus limited to about 100,000 of our pages. Another free engine successfully indexed the whole two million-record site but exhibited slow response time when searching.

The Inktomi Enterprise Search Engine had the scalability and XML capability we were seeking. This is a commercially available product, used by thousands of commercial and academic web sites to provide basic search capability. It has recently been sold to Verity Corporation and will become part of its web software product line. This search engine supports the definition of multiple XML maps, allowing us, for example, to define author as referring to the appropriate MARC tag set, i.e., 100, 110, 700, 710, etc. The keyword engine offers true author searching, subject searching, title searching, and many other specific types of searches we set up. The engine can also handle multiple collections in the same search interface, with different tagging maps. We added an index to our web site as an additional collection.

The simple interface

The latest generations of college students have grown up with the web and are highly familiar with Internet search engines. While we hammer away perfecting library web sites, our students are off using Google and Yahoo. Putting our content directly onto the surface web gets us onto the results sets the students see.

Some librarians are defensive about competing with these popular search engines. They miss the browse lists of authorized headings pointing out correct terminology-the mainstay of traditional OPACs. They know the work that goes into cataloging and the value it adds.

In the NetCatalog, we have added live links to generate searches on authorized name and subject headings; these can be launched from results pages. Inktomi offers a 'find similar' button after each entry in a results list. We are anxious to see if the artificial intelligence behind that function is appreciated by our users. We are planning to monitor and test user satisfaction with the new technology, and several projects are in development to analyze the effectiveness of the Internet search engine methodology. We will tune and adjust the search engine and look to future experiments with authority records.

Librarians working closely on the design of the NetCatalog have already pointed out some benefits. These stem from the fundamental design model. Because all of the metadata is tagged and included on a results document, any and all information can be combined in a search. Novel combinations of search criteria allow slicing and dicing that was formerly impossible in NOTIS. For example, location-based searching and call number searching can be combined with author, title, and other bibliographic selection. Limits by language, material type, and even physical attributes like book cover color are now possible. Searches and results pages can be easily saved and rerun. Web servers run 24/7, unlike mainframe systems, so for the first time we have round-the-clock operation of the catalog.

Future implications

The NetCatalog will be released on UB's web site as an experimental catalog for our university community during the spring 2003 semester. We will use it as an interim system until we finally implement our new ILS. NetCatalog may be fed from the new ILS, depending upon evaluation and popularity.

Beyond the catalog, we have already installed an electronic reserve system that provides controlled access to XML documents representing lists of class readings. This functionality is absent in the new ILS, which offers a web catalog architecture almost parallel to our old system. Evolution of the NetCatalog, as well as the new ILS, will ultimately set our course for implementing an official catalog.

Additional projects that might benefit from this methodology are being evaluated. After all, any catalog project could employ a similar paradigm. We are already considering digital objects beyond the bibliographic catalog.

We are on the threshold of a new systems development era that can readily speed up catalog deployment by using standards like the Library of Congress's MARCXML and Dublin Core along with developing XML technologies and free or low-cost tools. UB's experiments have shown that as disk capacities grow, we can return to a synergistic web paradigm and the original vision of a web for storing and retrieving information. If we model all content as tagged documents, in collections that are indexed on the open Internet, we can get beyond relational databases, proprietary ILS software, and even the impenetrable and buried deep web.


Author Information
Mark J. Ludwig (uldmjl@buffalo.edu) is Library Systems Manager, University Libraries, University at Buffalo, State University of New York

Talkback

We would love your feedback!

Post a comment

» VIEW ALL TALKBACK THREADS

Related Content

Related Content

 

By This Author

There are no other articles written by this author.

Sponsored Links




 
Advertisement
Sponsored Links

More Content

  • Blogs
  • Podcasts
  • Photos

Blogs


Sorry, no blogs are active for this topic.

» VIEW ALL BLOGS RSS

Photos

  • Design Institute 2007
    December 11, 2007 at Chicago's Harold Washington Library Center:Design Institute 2007
  • Learning Gardens
    New York's GreenBranches program links the library to the street.
  • Green Picks: LBD May 2007
    Want to reduce your library's carbon footprint? Join the Cradle-to-Cradle revolution. Helen Milling shares the green products her firm is using.
Advertisements





LJ NEWSLETTERS

Click on a title below to learn more.

LJ BookSmack
LJXPRESS
LJ ACADEMIC NEWSWIRE
LJ REVIEW ALERT
CRÍTICAS
©2009 Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy
Please visit these other Reed Business sites