Delving into Data
Businesses have used data mining for years. Now libraries are getting into the act
By Kevin Cullen -- Library Journal, 08/15/2005
Corporations employ data mining to analyze operations, find trends in recorded information, and look for new opportunities. Libraries are no different. Librarians manage large stores of data—about collections and usage, for example—and we also want to analyze this data to serve our users better.
Analysts use data mining to query a data warehouse for patterns that a human couldn't manually spot. For example, an online vendor or credit card company can map product types against zip codes, shipping preferences, time of day, and card expiration dates to flag potentially fraudulent purchases. (For more on how businesses use data mining, see "We're Mining You," p. 31.)
Data mining is performed in a data warehouse. While an operational database system like an integrated library system (ILS) is optimized for processing transactions (circulation, purchases, cataloging, etc), a data warehouse is optimized for analysis. This makes it easier to find patterns and avoid bogging down the transactional system.
Why bother?Recently, Katharine Treptow Farrell and Marc Truitt wrote that the Princeton University Library could not answer the question, "What journals do we subscribe to that [a specific company] publishes and what exactly are their price increases over previous years?" ["The Case for Acquisitions Standards in the Integrated Library System," Library Collections, Acquisitions, and Technical Services, 27.(4), p. 483–492]. It's a fundamental management question, yet the answer is elusive.
Dan Walters, executive director of the Las Vegas–Clark County Library District, NV, points out the need to take this kind of question further by mapping data from the ILS against other sources. For example, Walters would like to know what kinds of ebooks are moving and to what types of users. But currently he can't map circulation by media formats against types of users.
Sam Clay, director of the Fairfax County Public Library, VA, also believes that libraries need to use more quantitative decision-making tools. Working as assistant to a city manager convinced him. "[We] have to apply business models to what we do," Clay says. "Don't think, feel, or intuit. Do it because of what you know." His library has gathered roughly 20 years of trend data that it uses to make decisions in an aggressive, entrepreneurial way.
The challengesData mining is just emerging as a library practice. Scott Nicholson of Syracuse University's School of Information Studies, NY, even coined the term "bibliomining" because a literature search with the phrase "data mining and libraries" doesn't return hits about data mining in libraries. Instead, searchers find technical discussions about code libraries, which software programmers use to perform data mining.
Terminology is not the only barrier to data mining within libraries (see "A Data Mining Glossary," p. 32). While some ILS vendors are creating data-mining solutions, others don't see the need. Innovative Interface's Gene Shimshock, VP of marketing, and Betsy Graha, VP of product management, feel their product has strong statistical reporting capabilities built in and that it's already easy to get good information. They're not hearing much demand for enhancements.
Nevertheless, built-in reports cannot always answer the questions likely to emerge in a dynamic modern organization. Las Vegas's Walters wants a broad variety of management data about his users and his inventory—to understand better how customers use resources. "The current products are very limited in that regard, but that's symptomatic of the thrust of most of the current systems," Walters says. "They don't try to maximize use of inventory."
Cracking the black boxCorporations automate the move of information from their enterprise resource planning (ERP) systems into data warehouses. While ERP systems make ILS software seem trivial in scope, libraries naturally look to the ILS as a starting point when planning a warehouse. An ILS holds data about our collections, circulation, users, purchases, database use, and even gate counts. Unfortunately, libraries often find that their transactional data is locked up in the ILS "black box" and can only be viewed with vendor-supplied reports. Joe Zucca, assessment, planning, and publications librarian at the University of Pennsylvania Library, said that without open database connectivity to their Endeavor Voyager system, they would have been "sunk" when trying to create their data warehouse.
Having an ILS built on a relational database doesn't necessarily make it simple to create a data warehouse. To make data mining feasible, the information must be accessible from external systems, or exportable on an automated basis and be in well-designed structures. Traffic on the discussion group for one vendor's product is full of complaints: for all the good the tables do, the information might as well be in flat files.
Andrew Pace, head of systems at North Carolina State University Libraries, Raleigh, attributes this "artificial data structure in tables" to the rush by vendors to implement relational databases after Endeavor built its Voyager system. Las Vegas's Walters says he doesn't think any of the ILS systems use robust relational databases and that vendors have only deployed relational database underpinnings because universities want uniformity across systems. "It's not that we intrinsically want a relational database but that we want certain functionality," Walters says. But "we haven't been effective in stating our needs."
Early implementersWhile library software vendors don't always make it easy to perform data mining, there are success stories.
Two Endeavor Voyager libraries, University of Waterloo, Ontario, and University of Pennsylvania, Philadelphia, have developed robust data warehouses that are solving real-world problems. Back in 1998, when the librarians at Waterloo learned the university had a site license for business intelligence software from Cognos, they started working with the system. Gail Sperling, systems analyst, explains that most of their data comes from Voyager's Oracle database. She extracts data weekly into Cognos PowerPlay's "power cubes," which can then create analytical reports from desktop and web clients. Waterloo staff also employ a separate Cognos web interface to run Impromptu® reports against the live Voyager database.
Sperling points out that PowerPlay cubes are static data, so staff don't hit the live Voyager system and reduce response time for users. Because the system is optimized for analytical data rather than transactions, they usually get subsecond response time on highly complex queries. Sperling likes PowerPlay's "Drill Through" feature. "Imagine you've looked at how many items in a certain location were added to the collection in a certain date range in the LC class QA and circulated more than ten times," she says. "PowerPlay would give a number of items that met the criteria. Drill Through allows you to view the items that actually meet the criteria, rather than just the number."
Waterloo's Linda Teather, manager of library systems support services, agrees that Cognos has been valuable. The data always was available but not at her fingertips. An example of valuable decision-making information that is easier to get with data mining? "The big one is external reports," Teather says, such as the annual insurance evaluation for each library building.
The Fairfax County Public Library is taking advantage of Director's Station, a data-mining and analysis tool that SirsiDynix developed with SwiftKnowledge. Clay uses Director's Station for management decisions, grant applications, and marketing. He says the library has begun integrating information from outside the ILS, including data from its accounting system and demographic information. Data is recoded so every expense can be charged back to a branch, allowing Clay to determine the cost effectiveness of various outlets.
Down on the farmThe University of Pennsylvania's Data Farm tries to have a body of unprocessed data available for "when the question comes up, without knowing the question in advance," says Zucca. The Penn Data Farm contains a wide variety of information about the organization and its activities, transactions, and users. Zucca and his colleagues can find elusive answers that allow the library to serve its users better, spend funds more efficiently, or just do more with less—a frequent expectation these days. Zucca also sees a future in using the Data Farm to perform more workflow analysis.
Zucca provided an entertaining example about his library's rental of lockers. Students complained that the library wasn't supporting their locker needs, and front-line managers were hesitant to do anything. Zucca's team separated trend data from the Data Farm's raw feed of circulation statistics. They performed a quick five-year longitudinal study that confirmed that use was increasing—but only a third of the locker key inventory was circulating and the number was going down. The numbers helped them discover that locker keys were disappearing. Time required for the initial analysis? Less than 30 minutes.
The vendor lineupSome vendors see the need for data mining, while others are more cautious. Dynix, now SirsiDynix, developed Web Reporter, which runs as a standalone application for both Horizon and Corinthian users. Web Reporter was built on software from MicroStrategy, a business intelligence vendor. Product manager Brian Rawlings said Dynix chose to partner with MicroStrategy for several reasons, including the ability to restrict access to nearly every element of the software, "boardroom quality" reporting, automatic report generation and delivery, and a web-based interface for end users.
As mentioned, SirsiDynix also has teamed up with SwiftKnowledge to create its Director's Station product, which is touted as a robust solution. Tom Gates, VP of marketing at SirsiDynix, said the company "didn't want to reinvent the wheel" and that SwiftKnowledge provided them with online analytical processing (OLAP) tools and data-cube technology. Director's Station migrates data out of the ILS and can include external information. According to Gates, "If the data can be mapped into the database, then it can be used."
SirsiDynix is also involved in the Normative Data Project (NDP), a way to merge data about many different libraries into a normalized database to see industry norms. For instance, the NDP could be used to plan a new branch library in a growing library district, allowing a director to find other libraries with similar demographics and identify the services and collections that have succeeded.
Innovative Interfaces customers are limited to Report Writer's use of data within the ILS. Innovative's XML Server has some potential to ease the process of automated data extraction, but it is currently limited to bibliographic and authority data. Innovative's move to MySQL databases for some statistics holds promise for customers willing to roll out their own data-mining solutions, but Ted Fons, product manager, says, "There are no specific plans" to open these databases to access from external programs.
Endeavor's partnership with business intelligence vendor Cognos is focused on its Meridian electronic resources management (ERM) product. At this stage, it's designed to load usage statistics from vendors and allow reports to be run from within Meridian. Sara Randall, director of strategic products, says that market research showed that statistics and usage were important as a part of ERM. "[The] initial license is for Meridian, but we plan to move it across all our products," Randall says.
It's easy to argue that vendors have reason to move slowly. Las Vegas's Walters is sympathetic with the vendor contention that there isn't a willingness to pay for the functionality we demand. "We're…seeing a squeeze on budgets, and the amount of money we're willing to invest in IT is under pressure."
From edge to centerThe immediate future may see data mining develop on the fringes, with an emphasis on analyzing usage of licensed resources. North Carolina's Pace says, "ERM is huge…. You need a system that does serious evaluation of serials because that's where we're spending our money."
For libraries with limited technology resources, products like the forthcoming usage statistics software from MPS Technologies (a Macmillan subsidiary) may help solve the ERM portion of the problem. Product manager Martha Sedgwick says the software will initially work with COUNTER-compliant statistics, but the company knows that "for many libraries, receiving consolidated reports that include non-COUNTER-compliant data is also very useful, and we are looking…to include these."
Pace thinks data mining is coming, however. He hopes that we develop standards compliance for the final warehouse format because "librarians are going to want to compare." Fairfax's Clay agrees, stating, "The major gap is industry standards. We don't have any."
Librarians want a solution that will integrate many types of information into a system that allows them to analyze their usage, expenditures, customer base, collections, and more. The ILS is just one source of that information, though it is clearly the largest source—and ILS vendors are the best situated to create data-mining solutions.
Until vendors come up with relatively easy solutions, data mining probably isn't for every library. When the time comes to consider a system, Fairfax's Clay offers some advice. Think about what you need to know more about. "What are your needs and requirements? Use that as a template for judging products. Finally, is it worth the investment?" Many, however, would argue that it's an investment we can no longer ignore.
| Author Information |
| Kevin Cullen is Digital Projects Librarian, Colorado State University, Fort Collins. In September, he will begin a year-long fellowship at the Republic of Ireland's Marine Institute, Oranmore, County Galway |
|







