Recent Posts
Recent Comments
Most Commented On
Archives
Blog
Link This | Email this | Blog This | Comments (3)
Dirty DataJuly 7, 2007 A while back I wrote a Library Journal column entitled "The Murky Bucket Syndrome" to describe a problem I perceived in our catalog data. The phrase "murky bucket" actually came from Lorcan Dempsey of OCLC (my new employer) in relation to issues he perceived in WorldCat data while trying to make the data work harder -- a long and now increasingly fruitful struggle to wrest more value from our collective database.I was reminded about this issue when someone I know blogged about it recently from another perspective -- from the angle of formats and encodings. After solving a particularly knotty MARC character set issue, Ed Summers was prompted to state unequivocally: "Don’t build systems that import/export MARC in transmission format anymore unless you absolutely have to." In that post Ed also points to the work of William Denton to FRBRize Pride and Prejudice that also exposes a number of problems with our data. Denton is even prompted to state that "There are a lot of ugly MARC records out there." These experiences are by no means unique. Anyone who has tried to process large amounts of MARC data have encountered "dirty data" problems. These problems can range from simple typographical errors to incorrect uses of fields or subfields and just about everything in between. The essential problem is that as we try to do more with our data -- such as mining it for relationships to other records for FRBR-ized displays or to create recommender services -- the consequences of mistakes in our data will both become more apparent and more problematic. 11 July 2007 Update: My colleague Thom Hickey today posted about a new kind of error creeping into our systems. I guess this just goes to show that even if we could completely clean up our data, constant vigilance will still be required. This also points out the importance of Karen's comment -- it should be harder to make mistakes and easier to correct them. I think we'll see improvements on both of these aspects before long. Posted by Roy Tennant on July 7, 2007 | Comments (3)
July 7, 2007
In response to: Dirty Data K.G. Schneider commented: Well... wouldn't it help if it were easier to fix errors in WorldCat (or even easily report them)?
July 8, 2007
In response to: Dirty Data LAURA DAWSON commented: I found this reminiscent of problems that Muze encountered in 1995 when it licensed Bowker's data - and attemped a form of FRBRization and other merchandising-oriented data-mining. And it's something I warn clients about all the time - if you are bringing your data out into the cold light of day, be prepared to clean it.
July 9, 2007
In response to: Dirty Data Patricia Thompson commented: Maybe there are so many ugly MARC records out there because there are so many libraries who didn't think it was important to train their catalogers. "Just get a record in there-- who cares if all those subfields aren't quite right? As long as it works ok in our local system, that's all that matters." or "We don't have time to scrutinize every record that much. We need to get more done faster."
Advertisement
|
Advertisements
|
|
|
|