Library Journal Mobile
Log In  |  Register          Free Newsletter Subscription
Subscribe to LJ Magazine

Integrating Internet Content

Kevin Broun shows how RSS feeds can bring content into the library system

Kevin Broun (netConnect) -- netConnect, 10/15/2003

Just as much of the Internet sprang from grass-roots efforts, so grass-roots technologies for information management and delivery have become important tools for librarians to enhance service to their users. One now-mature technology is RSS, which provides web sites and web-based services with a way to disseminate simple bibliographic information about their content.

At the National Cancer Institute (NCI) of the National Institutes of Health (NIH), Bethesda, MD, we have used RSS both to integrate Internet content into the NCI library system and to make content from the library system available on our intranet in the form of RSS news feeds. This new content makes our library system a more useful and timely resource, allowing us to better 'feed' the information appetites of our clients, whose jobs require that they keep up with cancer and healthcare news, events, research, and politics. After the initial investment of time and technology, the information flows without requiring hands-on staff effort.

The NCI library is not a scientific library—the NIH library serves that role. Instead, we focus on the institute's communications functions, support such clients as the NCI Director's Office and the NCI Press Office, and document the history and activities of NCI. The collection holds over 62,000 items, including books and journals, published and internal documents, and artifacts.

RSS in brief

RSS is a somewhat less formal standard than that developed through Internet RFCs. In fact, it is so informal there is no agreement on what the acronym stands for—Really Simple Syndication, RDF Site Summary, or Rich Site Summary. Whatever you call it, RSS is a simple dialect of XML, the eXtensible Markup Language.

As a grass-roots technology, RSS and its adoption have been driven by the growing community of web log creators, or 'bloggers.' Many blogging utilities support RSS as an easy way for bloggers to make their creations available beyond just their own web pages. The web log utilities create a corresponding XML file in a standard format, usually called a 'feed' or 'news feed,' which can be read and subscribed to across the Internet by others using 'aggregator' software that consumes one or more of the several RSS formats in use. As Roy Tennant, California Digital Library and LJ columnist, has described, this has become a standard medium for the Internetwide exchange of simple metadata, i.e., links, titles, and descriptions of items on web sites.

This allows a web site, especially one with frequent updates, to 'tell' anyone who is interested about the new information on the site. RSS lets users keep up with a site without actually having to browse through the various locations of potential new information on the site. When an item of interest shows up in the RSS feed, a user can then follow the included link to see the full item.

Because it is standard XML, RSS is easy for many kinds of software to process. This lets users choose from many aggregator programs to create their own view of the web. Further, it enables a custom web site (like the NCI library site described below) to work behind the scenes to gather information about new content and then make the links available to users in a familiar format.

The first RSS versions were developed at Netscape Communications as part of its My Netscape portal project (my.netscape.com), with the first standards published in 1999. Continuing in a somewhat fragmented way, there are seven standard formats currently in use. Most feeds are in the RSS 0.91, 1.0, or 2.0 formats. (For more about RSS standards and techniques, see Mark Pilgrim's 'What Is RSS?' and Ben Hammersley's Content Syndication with RSS.)

Homegrown advantage

Staff at NCI's library became interested in RSS because it would allow us to gather new content from around the Internet, enhance current awareness services, and make information about our collection available in a new way. In part, we were able to think about implementing RSS because of the library's unique technological environment.

In 2001, the NCI library undertook a modernization project with several goals: make the collection available online, facilitate the collection of electronic materials, offer personalized and proactive services, and increase usage. We couldn't find a commercial integrated library system (ILS) that offered the combination of personalized services and electronic document management we wanted. Given the size of our budget, collection, and user base, and because we did not require many typical ILS features (like circulation), we decided to develop our own system, LION (Library Online), a web application with a database management backend. The existing computing infrastructure at NCI facilitated the project, letting us build on NCI's Oracle RDBMS and ColdFusion web application servers without additional costs.

LION went live in early 2002, putting online NCI's collections and services, including an increasingly electronic collection (4000 local files and 2000 Internet links to date). Also, over 400 staff members are registered with LION and receive a variety of personalized services. Most heavily used are the daily current awareness service, NCI Current Clips, and our electronic license to the weekly Cancer Letter newsletter.

LION's open architecture and standard tools enable us to extend the system to provide new benefits. For example, with the creation of several ColdFusion templates to facilitate processing, LION now hosts NCI's database of cleared information—content that can be reused in public communications without going through additional approval processes.

Feeding the LION

What has made RSS compelling is not the spread of this technology among bloggers—their content is largely too ephemeral for our library. Rather, it's the adoption of RSS by commercial web site publishers. This makes content of interest to our users available in a new way. Even more attractive, this content is available for free—largely because the content is being distributed as metadata only, not full text. To see the full text, users follow the links back to the original sites, where publishers can fund this model through third-party advertising and marketing of their own 'premium' content, or simply meet their goals of increased site traffic.

On random explorations of the web, our staff discovered that Reuters and the BBC were making content available in RSS and that they had specific feeds available that focused on health. This motivated us to pursue this technology and discover additional feeds of interest. Using RSS aggregator software and online directories, including AmphetaDesk, NewsIsFree, and Syndic8, we selected an initial collection of feeds for LION. These included BBC News: Health, Moreover: Breast Cancer News, Moreover: Cancer News, New York Times: Health, and Reuters Health eLine.

In conjunction with collection development, a second track of the project was to develop LION's tools for processing the RSS feeds—to turn the raw XML data into information LION can store and our users can view. Processing XML is usually called 'parsing.' We needed to parse the information about web links—item titles, descriptions, and URLs—in the RSS feeds so that it was no longer in XML but instead in a simpler format LION could store in its database records and make available to our clients. This kind of parsing is similar to what a web browser does to take raw HTML code and display it as a nicely formatted page—though RSS parsing is much easier since only a few kinds of tags are allowed in an RSS document.

Parsing the details

Because LION's structure doesn't natively understand XML, we realized we had to parse the XML data into a simplified HTML sequence of text and links. The system would then update the relevant item's bibliographic record to include this HTML code and update the system's text search index, to make this new information searchable in the user interface. For more on the process, see the LION RSS Architecture diagram.

Many technologies and methods are available for XML parsing, including built-in features or add-in modules for common web application environments such as Perl, PHP, ASP, and so on. Because LION was developed with ColdFusion, we elected to use the XML parsing capabilities of the ColdFusion MX server. Earlier ColdFusion versions can support XML through various extension mechanisms, but the MX built-in support has given us more features and better reliability.

We went through a substantial testing and debugging process on our development server to create reliable processing routines, overcoming such issues as RSS 1.0 feeds having a significantly different structure than those in either the 0.91 or 2.0 formats. Once satisfied with the reliability and formatting of the results, we moved the processing scripts to the production server. The resulting records look and operate like any other item record in LION—they can be searched, browsed, displayed, and saved in users' personal areas. The key difference is that the information is dynamic; we retain only the current information, as archival links would be broken quite quickly on the publishers' web sites.

To keep the records continually up-to-date, we added the RSS processing to the ColdFusion server's scheduled tasks. This runs the update routine several times each weekday and provides quality control by generating an email message to the LION administrators whenever an error occurs. We have some assurance that if an RSS feed goes offline, changes its format, or otherwise fails to process, we will know about it quickly and can then modify its parsing or remove it from the system.

LION's RSS capabilities have recently gone online, so we are still marketing the service and educating our users. LION includes an online help system, so the start of our education efforts is the Newsfeeds Help page. This provides a quick overview of the RSS concept, links to the item records of all of our news feeds, and some ideas on how our users can keep up with the news feed content. We have also publicized the service through management updates that go out to about 150 communications staff members and include news feeds in the 'Introduction to LION' presentations given to workgroups around NCI.

Publishing LION content as feeds

Another exciting aspect of this project is the chance to use the RSS technology bidirectionally: in addition to bringing content into LION, we also publish or syndicate content from LION in RSS format out to our users. While the use of RSS news reader or aggregator software among NCI staff is presently minimal, we feel that this is an opportunity to be on the leading edge in demonstrating new technology. It also gives us the chance to learn more about XML.

When creating RSS feeds, the XML learning curve is quite modest because of the simplicity of the format (once you settle on which one to use!). We decided to work in the RSS 2.0 format, since it is current, supported by most RSS aggregators, and compatible with the earlier 0.91 format still in use. Considering how aggregators are used—primarily to keep up-to-date with content from other sites—we elected to create two feeds that offer 'current awareness' about content on the LION system: a snapshot of recently added items in our collection and the latest headlines from NCI Current Clips.

Developing the ColdFusion scripts that produce these feeds was relatively simple: we could reuse code, particularly SQL database queries, from other ColdFusion templates. For example, the script that produces the snapshot of recently added items uses basically the same query as the Recent Additions section of LION's homepage. Instead of wrapping the results in HTML tags, we put them in RSS 2.0 tags, with document headers and footers appropriate for RSS and XML, replacing LION's normal web layout. Likewise, the Current Clips RSS feed reuses the query of an existing template that sends out email notifications to our Clips subscribers. Samples of the RSS code produced for each feed are available; see the Link List for details.

As for bringing Internet feeds into LION, we are also trying to support users who want to access the new RSS-formatted LION feeds. The Newsfeeds Help page provides information about the LION feeds, including URLs, descriptions, and basic information about aggregator software. Though our role is not information technology support, as earlier implementers of RSS in the NCI environment we provide some user assistance. As part of upcoming marketing and research efforts, we plan to discuss these technologies with our clientele, to find and work with early adopters in the community, and to spread the word to others about the potential RSS offers to keep current.

The to-do list

As with most any project, development of LION's initial RSS implementations has fostered a substantial list of additional ideas and plans.

The search for additional content is ongoing. Given the lack of a single, authoritative directory of RSS feeds, we need to scour the Internet on a continuing basis for health- and cancer-related content. The blogging community commonly requests web sites to offer an RSS feed of their headlines or new content. We will request health-related sites not already offering RSS feeds to do so and will also encourage the managers of NCI's web site, cancer.gov, to make this service available.

On the technology side, we plan to do more to deliver the news feed information directly to our users. At this point, RSS news feed content is only available by coming to the LION site, while we email other information to our users on a proactive basis. We plan to offer a daily update of all the new headlines gathered from our RSS news feeds as a supplement to Current Clips. This push service, currently in testing on our development server, will align news feeds with our other current awareness offerings and allow us to educate users and comarket and increase usage of both Clips (as the 'hand-selected premium content') and news feeds (as the 'snapshot of news from around the Net'). This will require some additional architecture on the LION system, since we will build and maintain a database table of headlines as they arrive in the feeds, allowing us to send out updates that list only the new items.

Behind the scenes, we also want to capture data about the usage of links from the RSS items, as well as the other 2000 links in LION item records. As we already have with locally archived electronic files, we will be adding a click-through script and corresponding database tables, so that a user clicking on a LION link will record a usage tick in the database before being redirected to the intended location.

Finally, it is crucial that we keep up with developments in the RSS arena and in any related technologies. For example, ongoing controversies over the various RSS formats have spawned the development of a new format, variously called Atom, Pie, and Echo—all of which have turned out to be legally unsuitable names. This is such a grass-roots movement that, as of this writing, voting is taking place on what to rename the new format. Whatever the name, the format is expected to resolve bugs in and conflicts between the RSS formats, provide specifics on how information is encoded in a syndicated feed, and support multiple languages, among other improvements.

As we've seen repeatedly during the last decade of the web's development, library services can greatly benefit if we learn about and take advantage of new technologies as they evolve—even if we can't always anticipate where this evolution may take us next.


Link List
LION RSS Feed Samples

Health/Cancer Newsfeeds Selected for LION

More About RSS


Author Information
Kevin Broun (kbroun@nih.gov) is Senior Web Developer and Lead, Electronic Information and Library Services, Communication Services Branch, National Cancer Institute, Bethesda, MD


Acknowledgements
References to commercial products do not imply endorsement by the National Cancer Institute. Thanks to my colleagues Lori Frederick and Judy Grosberg for their comments on this article.

 

Vendor Support for RSS

A few commercial ILS vendors are beginning to support RSS for integrating content into their systems. In a recent survey of 12 large system vendors, three expressed interest. While Brodart has no current support for RSS, it is considering offering XML output from Amlib databases in a future release, with a timeframe of at least 12 months. Dynix responded by highlighting its Horizon Information Portal product, which includes XML support and could, in the firm's view, 'handle' RSS content.

The most substantial response came from Sirsi. Its Sirsi Rooms product can integrate RSS feeds with many other types of content into appropriate 'rooms,' which libraries using this software would typically organize by subject area. While Sirsi doesn't natively support RSS exports or publishing, the Unicorn application program interfaces (APIs) would allow local development of RSS feeds.

For vendors that want to be innovative by integrating this new kind of content, the potential to be on, or at least near, the leading edge is wide open.

Talkback

We would love your feedback!

Post a comment

» VIEW ALL TALKBACK THREADS

Related Content

Related Content

 

By This Author

There are no other articles written by this author.

Sponsored Links




 
Advertisement
Sponsored Links

MOST POPULAR PAGES

More Content

  • Blogs
  • Podcasts
  • Photos

Blogs


Sorry, no blogs are active for this topic.

» VIEW ALL BLOGS RSS

Photos

  • Design Institute 2007
    December 11, 2007 at Chicago's Harold Washington Library Center:Design Institute 2007
  • Learning Gardens
    New York's GreenBranches program links the library to the street.
  • Green Picks: LBD May 2007
    Want to reduce your library's carbon footprint? Join the Cradle-to-Cradle revolution. Helen Milling shares the green products her firm is using.
Advertisements





LJ NEWSLETTERS


Booksmack
LJXpress
LJ Academic Newswire
LJReview Alert
LJ Criticas Review Alert
SLJ Extra Helping
Curriculum Connections
SLJTeen
PWDaily
Children's Bookshelf
PW Comics Week
Cooking the Books
Religion BookLine
Please read our Privacy Policy
©2009 Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy
Please visit these other Reed Business sites