Login  |  Register          Free Newsletter Subscription
Subscribe to LJ Magazine
Email
Print
Reprint
Learn RSS

Delving into Data

Businesses have used data mining for years. Now libraries are getting into the act

By Kevin Cullen -- Library Journal, 8/15/2005

Corporations employ data mining to analyze operations, find trends in recorded information, and look for new opportunities. Libraries are no different. Librarians manage large stores of data—about collections and usage, for example—and we also want to analyze this data to serve our users better.

Analysts use data mining to query a data warehouse for patterns that a human couldn't manually spot. For example, an online vendor or credit card company can map product types against zip codes, shipping preferences, time of day, and card expiration dates to flag potentially fraudulent purchases. (For more on how businesses use data mining, see "We're Mining You," p. 31.)

Data mining is performed in a data warehouse. While an operational database system like an integrated library system (ILS) is optimized for processing transactions (circulation, purchases, cataloging, etc), a data warehouse is optimized for analysis. This makes it easier to find patterns and avoid bogging down the transactional system.

Why bother?

Recently, Katharine Treptow Farrell and Marc Truitt wrote that the Princeton University Library could not answer the question, "What journals do we subscribe to that [a specific company] publishes and what exactly are their price increases over previous years?" ["The Case for Acquisitions Standards in the Integrated Library System," Library Collections, Acquisitions, and Technical Services, 27.(4), p. 483–492]. It's a fundamental management question, yet the answer is elusive.

Dan Walters, executive director of the Las Vegas–Clark County Library District, NV, points out the need to take this kind of question further by mapping data from the ILS against other sources. For example, Walters would like to know what kinds of ebooks are moving and to what types of users. But currently he can't map circulation by media formats against types of users.

Sam Clay, director of the Fairfax County Public Library, VA, also believes that libraries need to use more quantitative decision-making tools. Working as assistant to a city manager convinced him. "[We] have to apply business models to what we do," Clay says. "Don't think, feel, or intuit. Do it because of what you know." His library has gathered roughly 20 years of trend data that it uses to make decisions in an aggressive, entrepreneurial way.

The challenges

Data mining is just emerging as a library practice. Scott Nicholson of Syracuse University's School of Information Studies, NY, even coined the term "bibliomining" because a literature search with the phrase "data mining and libraries" doesn't return hits about data mining in libraries. Instead, searchers find technical discussions about code libraries, which software programmers use to perform data mining.

Terminology is not the only barrier to data mining within libraries (see "A Data Mining Glossary," p. 32). While some ILS vendors are creating data-mining solutions, others don't see the need. Innovative Interface's Gene Shimshock, VP of marketing, and Betsy Graha, VP of product management, feel their product has strong statistical reporting capabilities built in and that it's already easy to get good information. They're not hearing much demand for enhancements.

Nevertheless, built-in reports cannot always answer the questions likely to emerge in a dynamic modern organization. Las Vegas's Walters wants a broad variety of management data about his users and his inventory—to understand better how customers use resources. "The current products are very limited in that regard, but that's symptomatic of the thrust of most of the current systems," Walters says. "They don't try to maximize use of inventory."

Cracking the black box

Corporations automate the move of information from their enterprise resource planning (ERP) systems into data warehouses. While ERP systems make ILS software seem trivial in scope, libraries naturally look to the ILS as a starting point when planning a warehouse. An ILS holds data about our collections, circulation, users, purchases, database use, and even gate counts. Unfortunately, libraries often find that their transactional data is locked up in the ILS "black box" and can only be viewed with vendor-supplied reports. Joe Zucca, assessment, planning, and publications librarian at the University of Pennsylvania Library, said that without open database connectivity to their Endeavor Voyager system, they would have been "sunk" when trying to create their data warehouse.

Having an ILS built on a relational database doesn't necessarily make it simple to create a data warehouse. To make data mining feasible, the information must be accessible from external systems, or exportable on an automated basis and be in well-designed structures. Traffic on the discussion group for one vendor's product is full of complaints: for all the good the tables do, the information might as well be in flat files.

Andrew Pace, head of systems at North Carolina State University Libraries, Raleigh, attributes this "artificial data structure in tables" to the rush by vendors to implement relational databases after Endeavor built its Voyager system. Las Vegas's Walters says he doesn't think any of the ILS systems use robust relational databases and that vendors have only deployed relational database underpinnings because universities want uniformity across systems. "It's not that we intrinsically want a relational database but that we want certain functionality," Walters says. But "we haven't been effective in stating our needs."

Early implementers

While library software vendors don't always make it easy to perform data mining, there are success stories.

Two Endeavor Voyager libraries, University of Waterloo, Ontario, and University of Pennsylvania, Philadelphia, have developed robust data warehouses that are solving real-world problems. Back in 1998, when the librarians at Waterloo learned the university had a site license for business intelligence software from Cognos, they started working with the system. Gail Sperling, systems analyst, explains that most of their data comes from Voyager's Oracle database. She extracts data weekly into Cognos PowerPlay's "power cubes," which can then create analytical reports from desktop and web clients. Waterloo staff also employ a separate Cognos web interface to run Impromptu® reports against the live Voyager database.

Sperling points out that PowerPlay cubes are static data, so staff don't hit the live Voyager system and reduce response time for users. Because the system is optimized for analytical data rather than transactions, they usually get subsecond response time on highly complex queries. Sperling likes PowerPlay's "Drill Through" feature. "Imagine you've looked at how many items in a certain location were added to the collection in a certain date range in the LC class QA and circulated more than ten times," she says. "PowerPlay would give a number of items that met the criteria. Drill Through allows you to view the items that actually meet the criteria, rather than just the number."

Waterloo's Linda Teather, manager of library systems support services, agrees that Cognos has been valuable. The data always was available but not at her fingertips. An example of valuable decision-making information that is easier to get with data mining? "The big one is external reports," Teather says, such as the annual insurance evaluation for each library building.

The Fairfax County Public Library is taking advantage of Director's Station, a data-mining and analysis tool that SirsiDynix developed with SwiftKnowledge. Clay uses Director's Station for management decisions, grant applications, and marketing. He says the library has begun integrating information from outside the ILS, including data from its accounting system and demographic information. Data is recoded so every expense can be charged back to a branch, allowing Clay to determine the cost effectiveness of various outlets.

Down on the farm

The University of Pennsylvania's Data Farm tries to have a body of unprocessed data available for "when the question comes up, without knowing the question in advance," says Zucca. The Penn Data Farm contains a wide variety of information about the organization and its activities, transactions, and users. Zucca and his colleagues can find elusive answers that allow the library to serve its users better, spend funds more efficiently, or just do more with less—a frequent expectation these days. Zucca also sees a future in using the Data Farm to perform more workflow analysis.

Zucca provided an entertaining example about his library's rental of lockers. Students complained that the library wasn't supporting their locker needs, and front-line managers were hesitant to do anything. Zucca's team separated trend data from the Data Farm's raw feed of circulation statistics. They performed a quick five-year longitudinal study that confirmed that use was increasing—but only a third of the locker key inventory was circulating and the number was going down. The numbers helped them discover that locker keys were disappearing. Time required for the initial analysis? Less than 30 minutes.

The vendor lineup

Some vendors see the need for data mining, while others are more cautious. Dynix, now SirsiDynix, developed Web Reporter, which runs as a standalone application for both Horizon and Corinthian users. Web Reporter was built on software from MicroStrategy, a business intelligence vendor. Product manager Brian Rawlings said Dynix chose to partner with MicroStrategy for several reasons, including the ability to restrict access to nearly every element of the software, "boardroom quality" reporting, automatic report generation and delivery, and a web-based interface for end users.

As mentioned, SirsiDynix also has teamed up with SwiftKnowledge to create its Director's Station product, which is touted as a robust solution. Tom Gates, VP of marketing at SirsiDynix, said the company "didn't want to reinvent the wheel" and that SwiftKnowledge provided them with online analytical processing (OLAP) tools and data-cube technology. Director's Station migrates data out of the ILS and can include external information. According to Gates, "If the data can be mapped into the database, then it can be used."

SirsiDynix is also involved in the Normative Data Project (NDP), a way to merge data about many different libraries into a normalized database to see industry norms. For instance, the NDP could be used to plan a new branch library in a growing library district, allowing a director to find other libraries with similar demographics and identify the services and collections that have succeeded.

Innovative Interfaces customers are limited to Report Writer's use of data within the ILS. Innovative's XML Server has some potential to ease the process of automated data extraction, but it is currently limited to bibliographic and authority data. Innovative's move to MySQL databases for some statistics holds promise for customers willing to roll out their own data-mining solutions, but Ted Fons, product manager, says, "There are no specific plans" to open these databases to access from external programs.

Endeavor's partnership with business intelligence vendor Cognos is focused on its Meridian electronic resources management (ERM) product. At this stage, it's designed to load usage statistics from vendors and allow reports to be run from within Meridian. Sara Randall, director of strategic products, says that market research showed that statistics and usage were important as a part of ERM. "[The] initial license is for Meridian, but we plan to move it across all our products," Randall says.

It's easy to argue that vendors have reason to move slowly. Las Vegas's Walters is sympathetic with the vendor contention that there isn't a willingness to pay for the functionality we demand. "We're…seeing a squeeze on budgets, and the amount of money we're willing to invest in IT is under pressure."

From edge to center

The immediate future may see data mining develop on the fringes, with an emphasis on analyzing usage of licensed resources. North Carolina's Pace says, "ERM is huge…. You need a system that does serious evaluation of serials because that's where we're spending our money."

For libraries with limited technology resources, products like the forthcoming usage statistics software from MPS Technologies (a Macmillan subsidiary) may help solve the ERM portion of the problem. Product manager Martha Sedgwick says the software will initially work with COUNTER-compliant statistics, but the company knows that "for many libraries, receiving consolidated reports that include non-COUNTER-compliant data is also very useful, and we are looking…to include these."

Pace thinks data mining is coming, however. He hopes that we develop standards compliance for the final warehouse format because "librarians are going to want to compare." Fairfax's Clay agrees, stating, "The major gap is industry standards. We don't have any."

Librarians want a solution that will integrate many types of information into a system that allows them to analyze their usage, expenditures, customer base, collections, and more. The ILS is just one source of that information, though it is clearly the largest source—and ILS vendors are the best situated to create data-mining solutions.

Until vendors come up with relatively easy solutions, data mining probably isn't for every library. When the time comes to consider a system, Fairfax's Clay offers some advice. Think about what you need to know more about. "What are your needs and requirements? Use that as a template for judging products. Finally, is it worth the investment?" Many, however, would argue that it's an investment we can no longer ignore.


Author Information
Kevin Cullen is Digital Projects Librarian, Colorado State University, Fort Collins. In September, he will begin a year-long fellowship at the Republic of Ireland's Marine Institute, Oranmore, County Galway

 

We're Mining You

Data mining by corporations is commonly performed on information from customer relationship management and business to consumer e-commerce systems. Analysts connect each consumer's online and in-person purchases via hooks such as credit card numbers, name, phone number, and social security number.

Over time, this creates pictures of the buying habits of specific consumers. Organizations also record paths customers follow through web sites, items they add to and remove from shopping carts, and the points at which they abandon those virtual shopping trips.

By examining trends, corporations group customers into categories, predict what offers and products may appeal to them, and organize targeted campaigns.

Patterns flag customers who may be ready to sever ties, allowing companies to offer those customers special incentives or services.


A Data Mining Glossary

DATA CLEANING The process of manipulating transactional data into a format suitable for data mining. In the case of libraries, this may involve eliminating personally identifying information about users.

DATA CUBE A data storage format used for data mining and optimized for pattern finding and analysis.

DATA MART A small data warehouse for a specific topical area.

DATA MINING The process of finding patterns within large stores of data. The term does not refer to mining to get data (as it is sometimes used incorrectly). Think of open-pit mining, in which the open pit is the place where mining occurs. Nobody mines for open pits. The data is what is sifted through, rather than what is sought.

DATA WAREHOUSE A large store of data with a structure optimized for analysis and pattern finding. This is where data mining is performed.

OLAP ONLINE ANALYTICAL PROCESSING. Refers to a system designed for analysis and pattern finding rather than for recording transactions and/or inventory. A data warehouse is an OLAP system.

OLTP ONLINE TRANSACTION PROCESSING. Refers to a system designed for recording transactions and/or inventory. OLTP systems are much slower than OLAP systems for complicated queries.

Email
Print
Reprint
Learn RSS

Talkback

We would love your feedback!

Post a comment

» VIEW ALL TALKBACK THREADS

Related Content

Related Content

 

By This Author

There are no other articles written by this author.

Sponsored Links




 
Advertisement
Sponsored Links

More Content

  • Blogs
  • Podcasts
  • Photos

Blogs

  • Cheryl LaGuardia
    E-Views

    December 2, 2008
    Multimedia Support: When It Rains, It Pours!
    Participated in this NERCOMP SIG Workshop today, and, although it was slightly uneven, two of the pr...
    More
  • Cheryl LaGuardia
    E-Views

    December 2, 2008
    Mapping the African American Experience in Maryland
    I've been keeping an eye on what's happening at the Enoch Pratt Free Library for years, ever si...
    More
  • » VIEW ALL BLOGS RSS

Photos

Advertisements





LJ NEWSLETTERS

Click on a title below to learn more.

LJ BookSmack
LJXPRESS
LJ ACADEMIC NEWSWIRE
LJ REVIEW ALERT
CRÍTICAS
©2008 Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy
Please visit these other Reed Business sites