Library Journal Mobile
Log In  |  Register          Free Newsletter Subscription
Subscribe to LJ Magazine

Making Connections

From shuttling “bots” to library catalogs brimming with contextual data, Karen Coyle shows us the future made possible by Linked Data

By Karen Coyle -- netConnect, 4/15/2009

The Semantic Web was introduced to the general public in a May 2001 Scientific American article written by Tim Berners-Lee, James Hendler, and Ora Lassala. The article illustrated the Semantic Web with a futuristic scene in which “bots” (drawn as round, four-legged insects with single antenna) dashed around the web, gathering and analyzing data while consulting “ontologies,” the latter represented by booklike objects.

In the end the bots make doctors' appointments and add them to the calendars of siblings who will accompany their mother to treatments. This vision of a future world with smart devices making appointments for you and home appliances that chat about the “use by” date of their contents probably left many people feeling seriously skeptical about the technology that the article presented. While the explanation of the underlying data structures of the Semantic Web may not have been as memorable as the conversation between the microwave oven and its contents, it is, in fact, an excellent summary of what today we call “Linked Data” and how it will facilitate a rich information future. Linked Data may not yet be on the radar of the manufacturers of household goods, but it is definitely knocking on the door of libraries and library applications.

From documents to data

As Berners-Lee, Hendler, and Lassala wrote, “To date, the World Wide Web has developed most rapidly as a medium of documents for people rather than of information that can be manipulated automatically. By augmenting Web pages with data targeted at computers and by adding documents solely for computers, we will transform the Web into the Semantic Web.”

The web was developed initially as a way for scientists to share documents with one another, worldwide, cheaply, and instantly. The linking between documents on the web is analogous to the use of citations in academic articles: from a place in one text, one points to another text. The links themselves have virtually no semantics; a link means simply “link.” To understand why the author chooses to link, one has to interpret the verbal context around the link. It may be a passage that the author is quoting from the other document, or it could be a reference to earlier documents that informed the author's thinking. Nothing about the link itself explains the meaning behind the link.

The existence of online reference works, such as dictionaries, maps, and encyclopedias, has allowed us to make a new kind of link that isn't between documents but between terms and concepts and some further information about them. You can refer to a person in an online document and link the person's name to the entry for that person in Wikipedia, or you can link from a place-name to Google Maps, with a map centered on that place. These capabilities have helped us understand that the documents on the web often have data within their texts—data that can be treated as information in its own right. Berners-Lee and colleagues recognized this when they envisioned that we could enhance the web of documents with a web of data that exists sometimes within and sometimes alongside the documents themselves. Think of it as moving beyond the human language contained within documents to discover and reveal the underlying information and making that information machine-actionable.

This requires us not only to understand how to discover the information within documents but how to create meaningful links among data elements. We have to go beyond a structure that says that A has some connection with B, to stating that A has relationship X with B. And these relationships need both to make sense to human readers and be actionable by machines.

Having things make sense to humans is fairly easy. I can write “Herman Melville is the author of Moby-Dick,” and human readers know exactly what I mean. Humans know what I mean not because “Herman Melville is the author of Moby-Dick” is in itself particularly meaningful, but because they have the context needed to know that Herman Melville is a person, that Moby-Dick is a book or some other writing, and that being “author of” means that Melville wrote the text that makes up Moby-Dick. Having this same phrase make sense to a computer and to programs that will manipulate the data means that we have to provide the context in a way that a machine can understand it. There are three key elements to this machine understanding: identities, relationships, and rules.

Identities

Identification of concepts and relationships is key to the functioning of Linked Data. Humans have a wide tolerance for ambiguity in their language. We can have a conversation with a friend in which the geographic place “Georgia” is mentioned, and it will in most cases be perfectly clear whether we are referring to a U.S. state or an Eastern European country. To make this work in the Semantic Web, however, we need to provide a different identifier for the U.S. state and the country to the east. We need this because the computer is much less adept at understanding context, but also because we may well want to use this datum outside of the context of the conversation, or, in the case of the web, the online document. Identities are generally fixed using Uniform Resource Identifiers, and the practice in the Semantic Web community is to create these identifiers in the format of a Uniform Resource Location, or URL. This means that many identifiers will begin with “http://” and will in that sense be indistinguishable from URLs that serve as the locations of documents.

Groups of identified terms are often called vocabularies or, in the more formal language of the Semantic Web, ontologies. They can exist anywhere on the web, but the use of the URL format means that the best and easiest place to store formal information about an identified term is at the place located by the URI. In most cases, the vocabularies have meaning also to humans, so the URI can point to a document that describes the term for humans and may also provide additional information for machine-processing, such as relationships among terms in the vocabulary. A Semantic Web standard called Simple Knowledge Organization System (SKOS) defines the organization of terms into thesaurus form, with broader and narrower terms and alternate terms including alternate language entries. The terms may reside in a registry, where they are maintained by the community that is primarily interested in using them or promoting their use. Programs making use of the vocabularies can act on the machine-readable relationships and can also make use of human-readable definitions or instructions in the user interfaces of applications that call on the vocabulary.

Relationships

We're used to defining terms and placing them in lists, with or without hierarchy. We are less accustomed to defining relationships in a formal way. In the example used above, we had a person and a book, and the relationship between them is that the person is the author of the book. This relationship is a common one in human speech, but how can we define it for machines? We define it by giving it an identity. This doesn't mean that the machine will understand the concept behind the identity, but it will always recognize the same relationship because it has the same identifier. The dialog with the machine goes something like:

John Smith has relationship X with Betty Jones

George Johnson has relationship X with Betty Jones

The machine doesn't know what “relationship X” means, but if you asked the question: Who has relationship X with Betty Jones? the algorithm can answer: John Smith, George Johnson. (We'll leave it to you to amuse yourself by playing with different possible meanings of “relationship X.”)

Rules

Those of us who have been involved in the creation and use of library-related metadata for a long time tend to focus on the data itself. While this work is important, there is little that can be done with the data in an automated environment without rules that govern what the Semantic Web community calls “inferences.” We are all familiar with the concept of inferences from our high school mathematics years, although it would have had another name in that context. The simple statement:

if A = B, and B = C, then A = C

is a rule that allows a mathematically defined inference. Rules can be quite complex, of course, but it is the set of rules that makes the Semantic Web dynamic. Each application environment can create its own rules, but they must be machine-readable. The underlying retrieval standard for rule sets defined by the Semantic Web is called SPARQL. SPARQL is a query language that allows you to ask questions of Semantic Web data.

Here is a simple example that shows a vocabulary and its rules, although not in a formal way that could be acted on algorithmically. For each term, we will have a human word, a human-readable definition, and an identifier that will be used in machine applications. We will also define one simple relationship. This is followed by a set of rules:


Term: Series
Identifier: http://www.example.com/publishingTerms/3279
Definition: A group of documents published in an order over time

Term: Book
Identifier: http://www.example.com/publishingTerms/101
Definition: An independently published document

Relationship: isMemberOf
Identifier: http://www.example.com/publishingTerms/73
Definition: Belonging to a set
Rules:
Book can be “isMemberOf” Series
Series cannot be “isMemberOf” Book
Series can be defined as the sum of all
Books with relationship isMemberOf
With a little more information about books and series, you can expand the rules to state:
Series can be ordered by: series number
Series can be ordered by: publication date

With these defined terms and rules and the proper coding of books with the “isMemberOf” relationship, books in series can be identified anywhere on the web and can be displayed in one of the possible orders.

An important characteristic of Linked Data is that links can reach across the web. The real value of Linked Data is that the link from a book to a series, for example, only needs to be made once for that information to be available to all instances of that book on the web. Links easily become chains that can move from a single book to a series and then to all of the other books in that series. If at any point there is a connection made between the series and the publisher, then the chain extends to the other books issued by that publisher.

To connect Linked Data, however, you need data, not documents. Data within the documents needs to be identified so that it can be manipulated as Semantic Web elements. One large store of clearly identified data is Wikipedia, although even its data is not strictly in Semantic Web format. A project called DBpedia is creating a large set of Linked Data based on Wikipedia. This data set then functions as a central point for links into other Semantic Web–compatible data stores, many of which can be found through the Linked Data site, which also provides an interesting graphical view of the growing chain of Semantic Web links (for more on DBpedia and other developing uses of Linked Data, see Fiona Bradley's “Discovering Linked Data” on p. 48).

Linking it all to libraries

The Semantic Web faces some significant barriers in its development, not the least of which is that documents on the web today generally do not contain the data markup that will make linking possible. Names of persons have not been identified as names of persons, and if they were, there would still be no way to disambiguate the many John Smiths contained there. Berners-Lee et al.'s article explains that the web needs ontologies—that is, controlled vocabularies:

Ontologies can enhance the functioning of the Web in many ways. They can be used in a simple fashion to improve the accuracy of Web searches.... More advanced applications will use ontologies to relate the information on a page to the associated knowledge structures and inference rules.

Libraries, however, have fully developed, well-coded metadata that embraces both identification (authority control) and ontologies (controlled vocabularies). Of all the information communities, libraries are in the best position to transition their data into Linked Data because the basic elements already exist in our catalog data. What we need to do is to transform our data into Semantic Web structures and make that data available for linking.

To begin with, imagine what could be accomplished if all library authority data, names, uniform titles, and subjects, were available as Linked Data online. Projects like Wikipedia or the Open Library could link from names in their data to library authority records, creating a basis for name identification on the web. Accurate linking will require more than just the matching of alphanumeric strings; linking will need to use rule-based algorithms to support inference about identities. Such rules might look for matching titles in order to disambiguate author names, or may use birth dates recorded in Wikipedia entries to match names to authors with birth dates in the library name authority files. Nonlibrary applications would be able to link more easily to library data, even though the display forms of names may not be the same. Linked Data over the web could also use human input as part of its decision-making process, increasing the accuracy of linking especially in ambiguous cases (for more on refining data usage with context, see “A Sparrow with a Machine Gun” by R. David Lankes, p. 52).

Next, think about the many vocabularies that libraries use in record creation: codes for places and languages, lists of resource types, musical forms, audience levels. Whether or not other communities wish to use these directly, Linked Data can be used in the so-called “switching” systems, where one vocabulary is translated to another by creating relationships among the different vocabularies' terms. The simplest relationship is “equivalent” for terms that mean the same thing, but other relationships, like broader or narrower, are also possible. With the sharing of vocabularies, different communities can link to one another's data, extending the information reach beyond one's own boundaries. In essence, the web and library catalogs can become extensions of each other.

One of the key ontologies that libraries have developed is the language of cataloging, primarily embodied today in the MARC21 standard. No other community has such a rich set of descriptors for bibliographic data. Recasting our bibliographic data elements in the form of Semantic Web Linked Data would make it possible to imagine web applications that are directly compatible with library catalog data. It could also be a generous gift from the library community to the many developers who work with bibliographic data in nonlibrary contexts but who do not have the benefit of the deep knowledge of bibliographic structures that is inherent in cataloging data elements.

The development of library metadata into Semantic Web Linked Data is not far away, and in some instances the work has already been done. After experimenting with a version of the Library of Congress Subject Headings (LCSH) in Linked Data format, LC has announced that it will provide LCSH and other authority data online in the near term, with URI identifiers. It has also committed to providing similar access to various vocabularies used in the creation of bibliographic records.

The data elements that support the new cataloging rules, Resource Description and Access (RDA), have already been identified and registered in the National Science Digital Library (NSDL) Metadata Registry and will interact online with the RDA Online service being provided by the publishing arm of the American Library Association. The registry will also include the controlled vocabularies defined in RDA and a registered version of the Functional Requirements for Bibliographic Records (FRBR) elements and relationships.

The availability of the data elements and the various vocabularies online and in a machine-actionable format will provide a basis for the development of programs that make use of this data.

The machine-actionable and human-understandable definition of library data online and linkable also means library data itself can populate the web and be understood and used by any web sites that reference bibliographic data. Experience with programs like EndNote and Zotero shows that at least some users like the effort-saving capability of downloading bibliographic data rather than re-creating it as they do research. It could also be possible to make connections between citations and library holdings, as well as to take advantage of the capabilities of the web to connect library catalog data to additional information, like author web pages, using algorithms and searches rather than hand-coding of data.

All of the possible uses of library bibliographic data by others are unimaginable today, but we have already seen the popularity of nonlibrary use of bibliographic data in systems like Wikipedia and LibraryThing. The data provided in OCLC's WorldCat Identities research project gives us a hint of the richness of library data once it is freed from the capsules of the catalog and of the bibliographic record. Opening library data to the web in a linkable form should allow that richness to be explored on a global scale.


Link List
DBpedia dbpedia.org
EndNote endnote.com
ID.LOC.GOV Web Service id.loc.gov
IFLA FRBR Final Report www.ifla.org/VII/s13/frbr
LibraryThing librarything.com
Linked Data linkeddata.org
NSDL Metadata Registry metadataregistry.org
Open Library openlibrary.org
Semantic Web w3.org/2001/sw
Simple Knowledge Organization System (SKOS) w3.org/2004/02/skos
WorldCat Identities orlabs.oclc.org/Identities
Zotero zotero.org


Author Information
Karen Coyle (kcoyle@kcoyle.net) is a digital library consultant

Talkback

We would love your feedback!

Post a comment

» VIEW ALL TALKBACK THREADS

Related Content

Related Content

 

By This Author

Sponsored Links




 
Advertisement
Sponsored Links

MOST POPULAR PAGES

More Content

  • Blogs
  • Podcasts
  • Photos

Blogs


Sorry, no blogs are active for this topic.

» VIEW ALL BLOGS RSS

Photos

  • Design Institute 2007
    December 11, 2007 at Chicago's Harold Washington Library Center:Design Institute 2007
  • Learning Gardens
    New York's GreenBranches program links the library to the street.
  • Green Picks: LBD May 2007
    Want to reduce your library's carbon footprint? Join the Cradle-to-Cradle revolution. Helen Milling shares the green products her firm is using.
Advertisements





LJ NEWSLETTERS


Booksmack
LJXpress
LJ Academic Newswire
LJReview Alert
LJ Criticas Review Alert
SLJ Extra Helping
Curriculum Connections
SLJTeen
PWDaily
Children's Bookshelf
PW Comics Week
Cooking the Books
Religion BookLine
Please read our Privacy Policy
©2009 Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy
Please visit these other Reed Business sites