Library Journal Mobile
Log In  |  Register          Free Newsletter Subscription
Subscribe to LJ Magazine

Proof in the Pattern

Librarians follow the corporate sector toward more data-driven management, writes Scott Nicholson

By Scott Nicholson -- netConnect, 1/15/2006

Demands on libraries continue to grow, outpacing budget increases. More and more, librarians are forced to make difficult decisions about what materials and services stay and what go. Charles R. McClure has written that many librarians use an “adhocracy” method to make these decisions, relying on no data or simple aggregates in determining a course of action.

Others, however, have turned to a more data-driven approach that moves beyond generalized aggregations, such as running totals and overall means, to reveal underlying patterns in the data that clarify which services or materials are worth retaining.

The first challenge facing librarians in using data for decision-making is gathering the information. Most libraries are supported by myriad computer systems. The integrated library system (ILS) reduces this problem, but with interlibrary loan (ILL), websites or content management systems, digital reference and other electronic resources, community-based information such as blogs and wikis, and whatever comes next, librarians are left with a disarray of data sources and system logs.

Gathering the data

Others have faced this type of problem. In the 1980s, the corporate sector had similar data structures; each computer system produced a different type of data and worked independently. The phenomenon, called stovepiping, creates an overwhelming challenge in data-based measurement and evaluation across an organization.

Libraries have not yet fully realized what the corporate sector already knows in facing the stovepiping problem: there is a difference between data for operational systems and data for analysis and evaluation. Most systems supporting library services focus on the operational needs of the library. Once a transaction is complete, many systems either delete or hide the resulting data, as it is seldom needed for future operations. Personal information is a prime example. Once a patron completes a transaction, the system doesn't have an operational need to retain the data, so deleting it for reasons of patron privacy may seem the obvious thing to do.

But librarians are discovering uses for data from operational systems beyond the initial transaction. Unfortunately, library systems can make it difficult to extract, match, and clean the data needed for a specific project. When a librarian must make a decision quickly, such hurdles often mean falling back on McClure's “adhocracy.”

Figure 1: Choices from the Penn Library Data Farm
Data warehousing

How did the corporate sector address stovepiping? They created the data warehouse, a place to collect and normalize data from different systems. Extracted data is brought into the data warehouse and matched using common variables. The unified structure of the data warehouse makes future analysis easier.

For example, a digital reference service in an academic library has one data source about the users and another for the questions and answers. The university maintains the student data with current status information (major, grade level, etc.). The librarian wants to understand who's asking what questions and, more importantly, what resources are useful in providing answers to that cohort. If the librarian only matches the university's current status information to past questions and answers, the query won't reveal what the patron was like while using the service.

A data warehouse facilitates the capture of the transaction along with data about the patron's characteristics at the time. People change addresses, grades, majors, and positions. The data warehousing process takes a snapshot of the patron status, attaches it to information about resources used, and brings both into a separate data source. The data warehouse is always ready for analysis and provides a more accurate picture of library use.

Protecting patron privacy

Current legislation makes libraries increasingly concerned about the privacy of individuals using their services, and some libraries delete large amounts of data about library use. At the same time, libraries are called upon to justify themselves through more data and outcomes-based evaluation. Librarians must balance the desire to protect patron information with the need to be accountable to funders.

It's possible to accomplish both goals in a data warehouse through the use of demographic surrogates. Just as bibliographic surrogates represent materials within library systems, demographic surrogates represent users in a library data warehouse. The personal information about a user is scrubbed and replaced with demographic data that is attached to information about the item used. In building surrogates, library administrators can select demographic fields that best serve management needs. [For more information about creating demographic surrogates, see the author's “The Bibliomining Process” in Information Technology and Libraries, 22(4), 2003.]

The data warehouse has other advantages. It can integrate tools and common reports used for analysis so more staff in the library can explore the data. Since the data warehouse is separate from the operational systems, running routines that heavily tap into it will not interfere with library operations.

Figure 2: Data from the Normative Data Project for Libraries
The Data Farm

A leading data warehousing application is the Data Farm at the University of Pennsylvania, Philadelphia. Spearheaded by librarian Joseph Zucca, the Data Farm has grown over the past six years. Zucca started by bringing data feeds from different systems, both human and automated, into the Data Farm. He matched these feeds and cleaned the data, removing personal identifiers. The Data Farm captures these data in nonaggregated form to allow flexible exploration.

Zucca describes what's in the Data Farm by focusing on the concept of a service event, which he defines as “some user interaction with the library.” Service events include these data points: resource (book, database, web site, librarian); some activity (usually a server activity, like a checkout, but it could be a term paper clinic or another human activity); attributes of the resource (call number, language code, publication date, or the reference librarian's specialty or liaison role); attributes of the resource user (school, status, year, GPA, gender); environment (location, date/time, in/out of the library, in-person or by chat); and event outcomes. Zucca matches each of these data points to others as appropriate and strips personal identifiers to produce a record of the service event, which goes into the Data Farm.

Zucca's goal is to integrate assessment into daily library operations. As the Data Farm grows and more library staff uses it, the library employs more data-driven assessments in making decisions. “Assessment is not a series of one-off projects or the preserve of some special office in the library,” Zucca says. “It's a priority of the staff, so the staff needs to be equipped.”

For example, the University of Pennsylvania Library used the Data Farm to make decisions about moving works to off-site storage. After beginning the project, the librarians realized they needed to do mindful weeding, instead of just transferring books from one place to another. Without a data warehouse, the librarians could either have waited for systems staff to identify low-circulating works or just taken a guess and piled items onto the cart. The Data Farm changed that. “The religious studies selector can monitor the rate of intake and use within her stack ranges over different periods of times and devise a profile to help inform her transfer decisions,” explains Zucca. “All the data is up to the minute, quick to generate, structured in NATC [North American Title Count] schemes to make reports usable, and within easy reach of the managers.”

Figure 3: Multiple views of data through the Normative Data Project
ILS directions

The greater challenge faced by Zucca—and every librarian—is determining the impact of library services. We may know someone used a database or website, but what difference did it make? The rapidly growing field of outcome-based evaluation is focused on answering this question. (See “Beyond The Numbers” on p. 8 for more on outcomes.) There are few system-based measures that reveal outcomes, so these must come from users through other types of studies. Zucca is working to “create tools that help staff collaborate with faculty on outcomes assessment. In the end, library outcomes raise policy issues for the institution as a whole, issues that are themselves independent of management tools like the Data Farm.”

Most libraries don't have the resources to create a Data Farm. This opens the door for ILS vendors to get involved, and some are embracing the idea. (See “Below the Surface” on p. 12.) Like the Data Farm, SirsiDynix's Normative Data Project (NDP) gathers and normalizes data from different sources. NDP originated with data from libraries running the Unicorn ILS. The project matched Unicorn data with Census data, demographics from GIS data, and data about U.S. libraries from the National Center for Education Statistics (NCES).

Bob Molyneux, chief statistician for SirsiDynix, gets giddy when talking about the NDP. “The first time I saw it work I said to myself 'that's impossible' because I had spent 20 some years analyzing data, and I had never seen anything like it,” he exclaims. “Clearly, it is possible. I was looking at a revolution.”

Standards wanted

NDP's success brings up another important issue—the need for standards. Without standards in how data are collected, matched, cleaned, and kept, it's much harder to combine various data into a large database. (For more on standards initiatives, see “Bit by Bit” on p. 16.) In fact, the NDP is working to bridge different ILS systems by developing a schema to bring data into the NDP. The problem of matching data from different systems is complex, not only in use and patron data but also in merging locally customized cataloging data. NDP can serve as a model for library consortia and library networks wishing to share data across different automation platforms.

As projects like the Data Farm and NDP demonstrate, surfacing patterns of use means librarians need data at the individual use level. General aggregations hide underlying patterns that are available to resource vendors, who use them in making business decisions. Libraries may not be money-making propositions, but they need the same item-level data and reporting standards to make solid management decisions and respond to demands for accountability in using public funds.

Figure 4: Introductory Screen for OCLC’s WorldMap™
Exploring the data

The (sometimes unexpected) challenge after gathering data is finding interesting patterns and stories within it. While a data warehouse puts everything in one place, the amount of data is overwhelming. Corporate America discovered this, too. After building data warehouses, managers and other stakeholders did not see any benefit to the piles of data collected in the data warehouse. They lacked data mining techniques to find patterns in the data.

Data mining is the exploration of a large dataset for nontrivial, novel, and useful patterns, using different statistical, analytical, and visualization tools. The process starts by collecting relevant data on a particular topic. The collected data is matched into a single, large database and cleaned. Cleaning takes most of the time in a data mining process and often requires repeated attempts, as data mining tools highlight flaws in the data. Data mining programs, such as Clementine, SAS Enterprise Miner, or the open-source WEKA, offer many options that can be executed on the same data set.

Data mining options can be either descriptive or predictive. Descriptive tools help the librarian describe and compare the past and current patterns in the data. These include: traditional aggregates and averages; clustering and market-basket analysis to identify groups of items or users that belong together because of other aspects in the data; and online analytical processing (OLAP) to explore tables of data, clicking on headings and rows to “drill down” into the data or “rolling up” data into higher level categories to better understand groupings.

Predictive tools determine the unknown from what is known. For example, past patterns of use may predict future resource needs, which is valuable in purchasing decisions. Some predictive tools include correlation and regression to find variables that go up or down together or variables that go up as others go down; rule generation, or creating a series of if-then rules describing patterns in the data; and neural networks, which take in a large number of variables and predict a single result from past performance. Some descriptive tools can be used in a predictive manner and vice versa, and there are many other data mining tools besides these. [For an accessible introduction to data mining, see Michael J.A. Berry and Gordon S. Linoff's Mastering Data Mining: The Art and Science of Customer Relationship Management (Wiley, 1999).]

A decision support system

Some of the projects described earlier have dedicated data mining tools. The Normative Data Project, for example, includes OLAP and visualization options to allow exploration that goes beyond basic reporting.

The NDP toolkit is “liberating because it makes it possible for sophisticated analysis to be done without having to know how to program or about analytical techniques, as common sense will suffice,” Molyneux reports. “What we have here, then, is a database of information working librarians can use to make better decisions. It is a decision support system. Given that public libraries in the United States had an income of almost $9 billion and spent $8.3 billion on operations, according to the latest estimates, better decision information will result in better decisions and better use of that money, particularly in budget-constrained times.”

Figure 5: OCLC’s WorldMap™ Displaying Search Results
Visualizing data

Like Molyneux, OCLC Research seeks to leverage library data in new ways. Current data mining projects use WorldCat and library holdings, circulation and interlibrary loan (ILL) statistics, and system transaction logs to identify collection trends, themes, and search behaviors. (For more on OCLC's datamining efforts see “Making Data Work Harder,” LJ, January 2006, p. 40.)

One such project is OCLC's WorldMap™, which presents WorldCat titles and holdings geographically, by state, province, and country of publication. WorldMap™ also displays other statistics, i.e., number of libraries in a country, collection, and expenditure data. Users first see a world map, select a desired variable, and WorldMap™ shades countries with colors to reflect values for that variable (Figure 5).

The user can then select a country to see its underlying data, either displayed on the map or in a table. Rather than starting with lists and tables, OCLC wanted to provide a graphical entry point to the data. WorldMap™ is a combination of OLAP and visualization that results in a more natural way of exploring information.

Another data visualization project is the Public Library Geographic Database (PLGDB), part of the GeoLib Program, directed by Christie Koontz, College of Information at Florida State University. PLGDB is free and allows researchers and libraries to explore map-based library data from different sources. For example, a map can show library locations and the percentage of the local population under age five.

Nancy Smith, coordinator for the Prairie Area Library System in Rockford, IL, uses the PLGDB for marketing and research about patron needs. “In Illinois, we have varying service area lines, which are ascribed by funders, and using the PLGDB allows librarians to estimate customer market areas or the geography in which our users really live,” Smith explains, “but the real power is in identifying potential library customers.”

Patterns through bibliomining

Exploring data with visualization or OLAP tools is akin to driving around a neighborhood looking for a dream house. A data mining tool, on the other hand, is like having a real estate agent who finds the kind of houses you want. Both options help librarians find patterns of interest in the data.

Another data mining practice with roots in library and information science is called bibliometrics, which explores patterns in the creation of information. Linking works and authors together through citation analysis is a common example of bibliometrics. The bibliomining concept combines bibliometrics and data mining to improve library decision-making. Data warehouses are at the core of bibliomining, allowing bibliometrics and data mining to operate within the same data space, so librarians can more easily discover patterns spanning both creation and use of works.

The library's story can be told in many ways. In the past, we have relied upon anecdotes and broad aggregations, but funding agencies increasingly require more business-like data analysis and outcomes to justify how they will allocate shrinking tax dollars to the greatest effect. Drawing together selected artifacts of our daily efforts uncovers patterns that give substance to our anecdotes and is a more powerful way to convince funders of our value.


Link List
The Bibliomining Information Center
bibliomining.com/
Clementine
www.spss.com/clementine
The Normative Data Project
www.libraryndp.info/
OCLC
www.oclc.org/research/
researchworks/worldmap
/prototype.htm
Public Library Geographic Database Mapping
www.geolib.org
/PLGDB.cfm
SAS Enterprise Miner
www.sas.com
/technologies/analytics
/dataminig/index.html
University of Pennsylvania Library's Data Farm
metrics.library.upenn.edu
/prototype/datafarm
WEKA
www.cs.waikato.ac.nz
/~ml/weka/index.html
 


Author Information
Scott Nicholson is an Assistant Professor at the Syracuse University School of Information Studies, NY

Talkback

We would love your feedback!

Post a comment

» VIEW ALL TALKBACK THREADS

Related Content

Related Content

 

By This Author

There are no other articles written by this author.

Sponsored Links




 
Advertisement
Sponsored Links

MOST POPULAR PAGES

More Content

  • Blogs
  • Podcasts
  • Photos

Blogs

  • Cheryl LaGuardia
    E-Views

    November 20, 2009
    Portable Libraries, Mobile Students
    I attended this excellent ACRL-NE Information Information Technology Interest Group (ITIG) Social pr...
    More
  • Cheryl LaGuardia
    E-Views

    November 20, 2009
    Parker Library on the Web
    Corpus Christi College (Cambridge) and Stanford University Libraries recently released t...
    More
  • » VIEW ALL BLOGS RSS

Photos

  • Design Institute 2007
    December 11, 2007 at Chicago's Harold Washington Library Center:Design Institute 2007
  • Learning Gardens
    New York's GreenBranches program links the library to the street.
  • Green Picks: LBD May 2007
    Want to reduce your library's carbon footprint? Join the Cradle-to-Cradle revolution. Helen Milling shares the green products her firm is using.
Advertisements





LJ NEWSLETTERS


Booksmack
LJXpress
LJ Academic Newswire
LJReview Alert
LJ Criticas Review Alert
SLJ Extra Helping
Curriculum Connections
SLJTeen
PWDaily
Children's Bookshelf
PW Comics Week
Cooking the Books
Religion BookLine
Please read our Privacy Policy
©2009 Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy
Please visit these other Reed Business sites