Making Data Work Harder

As more activities move into networked spaces, more areas of our lives are shedding data. This data is increasingly being mined for intelligence that drives services. A major attribute of Internet companies like Google and Amazon is how they squeeze as much value as they can from the data they have. In other words, they make their data work harder. A good example is Amazon's recommendation service. Amazon looks for patterns in sales data and creates services that inform (or even gently nudge) the user's book purchasing decisions. What Amazon does is based on the assumption that people who have agreed in the past will tend to agree in the future, meaning 'readers who purchased Book A also purchased Book B' or 'if you like Book C you will also like Book D.' Another particularly innovative use of data to link potentially similar materials is the 'Statistically Improbable Phrases' associated with each title. Amazon scans the full text of books available through its 'Search Inside' feature and identifies phrases that are distinct to a particular book. Amazon then finds other books that contain these rare phrases and links them under the assumption that several books with the same unusual phrases may be related in meaning or subject matter. This helps the reader discover new items, encourages exploration, and makes it more likely they'll become a customer. In short, companies like Amazon repurpose data to create added value. This is a lesson librarians must learn if they want to improve their own visibility and value in increasingly crowded digital information spaces where users, as always, want good results without too much time or effort. The good news is that libraries don't come to this task empty-handed but with rich, structured information about the materials in our collections. Librarians possess many other valuable data sources, like circulation statistics, reference transactions, interlibrary loan (ILL) requests, usage behaviors, and more. In our daily work, libraries create a wealth of information that offers an extremely rich perspective on the characteristics and use of the wide variety of materials in library collections.

Data mining in libraries

Making data work harder for libraries will involve machine processing of existing data sources. Fortunately, computer processing power, storage capacity, and memory are such that even the largest databases can be mined for useful intelligence with little effort and cost. For example, the WorldCat bibliographic database contains more than 60 million records, but it could be easily stored as a single file of less than 50 gigabytes on a slightly upgraded computer workstation. Processing the file algorithmically takes about a half-hour, depending on the complexity of the algorithm. Larger and cheaper storage capacity also makes it practical to store historical copies of the database for dynamic analysis. It is now possible to mine intelligence from WorldCat quickly and conveniently. The benefits of data mining often depend on the scale of the data sources from which intelligence is gathered. WorldCat again provides a good illustration. As WorldCat grows in size, the records get more useful, because the information they contain can be aggregated, compared, related, and understood within an expanding context. So, the larger the WorldCat database, the more valuable it becomes to the OCLC membership - not just for cooperative cataloging, resource sharing, and bibliographic research but for the scope and depth of the intelligence that can be mined from it. What kinds of intelligence can be mined from bibliographic data like WorldCat? The possibilities seem endless, but here are a few general areas where librarians could realize immediate benefits from making their library's data work harder:

Collaborative Collection Management

Libraries already share collections across institutional boundaries, but data mining across library collections could open the door to new opportunities for shared collection management. Studies of holdings patterns for institutional clusters at the consortium, regional, or even national level could reveal opportunities to reduce cross-collection redundancies and free up resources to fill gaps in collections. Collection views New modes of slicing and dicing data naturally lead to new views of library collections for both users and librarians. Librarians need strong data analysis to help them discover how to segment their collections into useful views that facilitate discovery (e.g., by material type, format, or means of access) or support particular library functions (e.g., collection development, digitization, or preservation). Bibliographic records and other sources contain a wealth of information that could support the creation and maintenance of such views. Library decision-making A major purpose of collecting and mining business intelligence is to aid decision-making. For example, preservation decisions can be informed by data on the characteristics of a collection or set of collections, such as the number of items uniquely held by a library. Similarly, digitization decisions could be informed by data that sheds light on intellectual property rights implications of proposed initiatives or identifies publishers whose materials would be affected by digitization. Such information can often be extracted from bibliographic records. User behavior Studies of user interactions with library collections, especially in networked digital environments, provide valuable information for developing innovative data displays and interfaces that reflect user behaviors and enhance the value of library collections to users. Data sources, such as surveys, transaction logs, circulation data, ILL transactions, virtual reference transactions, and more, can be mined for information that reveals these behaviors. Trend-spotting Libraries increasingly operate in environments subject to rapid change, including shifting users, technology, and publishing models. Making sure they keep pace with change requires gathering intelligence that identifies, characterizes, and anticipates emerging trends. Collection analysis, user behavior studies, and other modes of analysis can generate information that speaks to these issues. As libraries begin data mining in the above areas, librarians can anticipate converging toward sets of key questions and consistent metrics for answering them. For example, if a library is considering beginning a digitization program, it is first necessary to decide which resources to digitize. This will hinge on a variety of factors, such as uniqueness, accessibility, and demand. Analysis of library holdings, ILL borrowing requests, circulation statistics, and other data will help quantify these factors and, ultimately, help librarians make informed decisions. As librarians gain more experience in making their data work harder, both in terms of gathering information and applying it, the process of culling data within particular contexts will likely become systematized and consistent; standard decision-making criteria and performance benchmarks will emerge.

Data mining and OCLC Research

OCLC Research has been tackling data mining from two directions. One uses data to create enhanced user experiences. The OCLC WorldMap is a geographical representation of WorldCat title holdings by state, province, and country of publication. WorldMap can be used to identify groups of collections visually for potential collaborative collection development and management. The Curiouser project demonstrates how value can be released from libraries' accumulated investment in structured data, in the form of a richer browsing experience. One approach deployed in Curiouser is to cluster bibliographic records using the OCLC Research 'FRBRization' algorithm, so that all versions of a title - editions of Hamlet, films, literary criticism, etc. - show up together. FRBR (Functional Requirements for Bibliographic Records) is a framework of relationships among bibliographic entities that helps cluster the multiple manifestations of works. FRBR takes existing information in bibliographic records and repurposes it to create new views of library collections, innovative search techniques and displays, and, finally, more effective discovery and collection management outcomes for users and librarians. [For more on FRBR see 'What Is FRBR?' by Linda Gonzalez, Spring 2005 netConnect, 4/15/05, p. 12.] A second focus of OCLC Research projects is mining data reservoirs for useful intelligence that supports library management decisions. OCLC recently introduced a Collection Analysis Service that allows libraries to obtain and apply intelligence about their individual collections, as well as compare their collections to those of other libraries across a variety of dimensions. OCLC Research has been addressing more general questions, such as: The systemwide print book collection Expanding opportunities for resource sharing created by networked digital environments have caused libraries to shift focus from local collections toward combined library holdings at the regional, national, or even international level. In collaboration with Ithaka, a not-for-profit agency promoting information technology in higher education, OCLC Research has undertaken a study to characterize the systemwide collection of print books, as reflected in the aggregate print book holdings in WorldCat. Findings from this study will help librarians place their local collections in the broader context of the system-wide collection and provide an empirical basis for collaboratively managing the systemwide collection in ways that create new value and allocate local resources effectively. Audience levels OCLC Research is using WorldCat holdings data to associate a target community or audience level to particular library profiles (e.g., Association of Research Libraries, public, K-12, etc.). The goal is to infer an item's audience level from the holdings pattern associated with it. Audience level information would be valuable in a variety of contexts, including collection management, readers' advisory, reference services, and information retrieval. 'Last copies' More than 20 million WorldCat records have only a single holding attached. Such records can be used to help identify rare and valuable materials in library collections, especially the sole remaining copy of a particular item. Identification of rare materials and last copies is essential intelligence to support storage, digitization, and preservation decision-making. OCLC Research is working on techniques for using holdings data and other bibliographic information to identify rare and unique materials in library collections. Identifying and characterizing digital materials in WorldCat OCLC Research is assessing how many digital materials have been cataloged in WorldCat, describing their attributes, and identifying common cataloging practices. Analysis of this kind demonstrates the growing presence of digital materials in library collections, trends in cataloging activity for these materials, and shifts in the types of digital materials libraries have acquired over the last several decades. These are only a few examples of the data mining projects underway in OCLC Research, each devoted to making the data we have work harder. These activities focus on pulling more value from the rich, structured information in the WorldCat bibliographic database and associated holdings file. Whether the results help librarians know more about their collections, support the creation of interesting and useful data displays, or provide intelligence to support a range of library decision-making needs, projects of this kind create demonstrable value for librarians and users. Don Waters, program officer of scholarly communication for the Mellon Foundation, has observed that 'what unites our interest in digitization and open access in a digital world is that the material becomes processable: it can be indexed, manipulated, mined, aggregated, decomposed, built up, and so on by algorithm, and it is this processability that makes digitized objects and open access materials valuable to scholars.' Libraries have great riches buried within their data, and this data can be processed. We must actively pull these riches to the visible surface, as Amazon and Google have done, to create more value for librarians and users.
Brian Lavoie is Senior Research Scientist, Lorcan Dempsey is Vice President and Chief Strategist, and Lynn Silipigni Connaway is Consulting Research Scientist, OCLC

OCLC Extras

Audience Levels: infers materials' target audience, or audience level, using holdings information www.oclc.org/research/presentations/connaway/lrsIII_audience.ppt Mining for Digital Resources: identifies and characterizes digital resources cataloged in WorldCat www.oclc.org/research/presentations/lavoie/acrl2005.ppt OCLC Research Data Mining Projects: overview of current activities www.oclc.org/research/projects/mining OCLC WorldCat Collection Analysis Service: web-based service that provides analysis and comparison of library collections based on holdings information contained in the WorldCat database www.oclc.org/collectionanalysis Publisher Name Server: resolves ISBN prefixes to publisher name; resolves variant publisher names to a preferred form; and captures and makes available various publisher attributes (e.g., location, language, genre/format, dominant subject domain) of the publisher's output. www.oclc.org/research/projects/publisherns/default.htm Systemwide Print Book Collection: analyzes the size and characteristics of aggregate print book holdings, with an emphasis on implications for digitization and preservation decision-making. www.oclc.org/research/presentations/lavoie/cni2005.ppt WorldMap: visualizes geographic distribution of selected library data. Currently available data include holdings and titles, each by place of publication (from OCLC WorldCat), and number of libraries, librarians, users, volumes, and annual expenditures (from other sources) www.oclc.org/research/researchworks/worldmap/prototype.htm
Comment Policy:
  • Be respectful, and do not attack the author, people mentioned in the article, or other commenters. Take on the idea, not the messenger.
  • Don't use obscene, profane, or vulgar language.
  • Stay on point. Comments that stray from the topic at hand may be deleted.
  • Comments may be republished in print, online, or other forms of media.
  • If you see something objectionable, please let us know. Once a comment has been flagged, a staff member will investigate.


RELATED 

ALREADY A SUBSCRIBER?

We are currently offering this content for free. Sign up now to activate your personal profile, where you can save articles for future viewing

ALREADY A SUBSCRIBER?