Library Journal Mobile
Log In  |  Register          Free Newsletter Subscription
Subscribe to LJ Magazine
Email
Learn RSS

Tennant: Digital Libraries   



Link This | Email this | Blog This | Comments (3)


Dirty Data

July 7, 2007 A while back I wrote a Library Journal column entitled "The Murky Bucket Syndrome" to describe a problem I perceived in our catalog data. The phrase "murky bucket" actually came from Lorcan Dempsey of OCLC (my new employer) in relation to issues he perceived in WorldCat data while trying to make the data work harder -- a long and now increasingly fruitful struggle to wrest more value from our collective database.

I was reminded about this issue when someone I know blogged about it recently from another perspective -- from the angle of formats and encodings. After solving a particularly knotty MARC character set issue, Ed Summers was prompted to state unequivocally: "Don’t build systems that import/export MARC in transmission format anymore unless you absolutely have to."

In that post Ed also points to the work of William Denton to FRBRize Pride and Prejudice that also exposes a number of problems with our data. Denton is even prompted to state that "There are a lot of ugly MARC records out there."

These experiences are by no means unique. Anyone who has tried to process large amounts of MARC data have encountered "dirty data" problems. These problems can range from simple typographical errors to incorrect uses of fields or subfields and just about everything in between.

The essential problem is that as we try to do more with our data -- such as mining it for relationships to other records for FRBR-ized displays or to create recommender services -- the consequences of mistakes in our data will both become more apparent and more problematic.

11 July 2007 Update: My colleague Thom Hickey today posted about a new kind of error creeping into our systems. I guess this just goes to show that even if we could completely clean up our data, constant vigilance will still be required. This also points out the importance of Karen's comment -- it should be harder to make mistakes and easier to correct them. I think we'll see improvements on both of these aspects before long.

Posted by Roy Tennant on July 7, 2007 | Comments (3)


Email
Learn RSS


July 7, 2007
In response to: Dirty Data
K.G. Schneider commented:

Well... wouldn't it help if it were easier to fix errors in WorldCat (or even easily report them)?

Not only that, is part of this due to what I wrote about on Techsource, which is that cataloging in general is ruled by complex, implicit rule sets that are not explicitly and consistently described?

In other words, it needs to be easier to fix mistakes, and harder to make them.




July 8, 2007
In response to: Dirty Data
LAURA DAWSON commented:

I found this reminiscent of problems that Muze encountered in 1995 when it licensed Bowker's data - and attemped a form of FRBRization and other merchandising-oriented data-mining. And it's something I warn clients about all the time - if you are bringing your data out into the cold light of day, be prepared to clean it.




July 9, 2007
In response to: Dirty Data
Patricia Thompson commented:

Maybe there are so many ugly MARC records out there because there are so many libraries who didn't think it was important to train their catalogers. "Just get a record in there-- who cares if all those subfields aren't quite right? As long as it works ok in our local system, that's all that matters." or "We don't have time to scrutinize every record that much. We need to get more done faster."

Also, I wonder if other forms of data aren't just as dirty. My ex-husband used to be a programmer, and he was always fixing bad data, and it wasn't MARC.





POST A COMMENT
Display Name or Registered Users Login Here.
Please restrict submissions to less than 7,000 characters (including any HTML formatting).

Change Image
Before submitting this form, please type the characters displayed above.
Note the letters are NOT case sensitive.

Advertisement

Advertisements





©2009 Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy
Please visit these other Reed Business sites