Newswire Analysis: Google Scholar's Ghost Authors, Lost Authors, and Other Problems
Why the popular tool can't be used to analyze the publishing performance and impact of researchers
Peter Jacso, University of Hawai'i at Manoa -- Library Journal, 09/24/2009
| Go back to the Academic Newswire for more stories |
Geoffrey Nunberg’s essay criticizing Google’s Book Search (GBS), which he subtitled “A Disaster for Scholars,” emphasized that disturbing errors are endemic. He well recognizes that for mainstream “Googling” purposes “we don’t really care about metadata … provided by a library catalog.” In perhaps his most discouraging point, Nunberg notes that the Google team blames libraries and publishers for bad data.
All these rhyme perfectly with my experience working with another of the search giant’s data-crunching products, Google Scholar (GS). With GS, however, I blame mostly the developers. They decided—very unwisely—not to use the good metadata generously offered to them by scholarly publishers and indexing/abstracting services, but instead chose to try and figure them out through ostensibly smart crawler and parser programs.
Thus research faculty and academic/special libraries dealing with GS face their own metadata disaster, one with dire consequences in evaluating the scholarly publishing productivity and impact of researchers, institutions, journals, and even countries. Millions of records have erroneous metadata, as well as inflated publication and citation counts.
A free tool, Google Scholar has become the most convenient resource to find a few good scholarly papers—often in free full-text format—on even the most esoteric topics. For topical keyword searches, GS is most valuable. But it cannot be used to analyze the publishing performance and impact of researchers.
Citation problems
Google’s algorithms create phantom authors for millions of papers. They derive false names from options listed on the search menu, such as P Login (for Please Login).
Very often, the real authors are relegated to ghost authors deprived of their authorship along with publication and citation counts. In the scholarly world, this is critical, as the mantra “publish or perish” is changing to “publish, get cited or perish.”
Compounding the problem, the inflated publication and citation counts produced by GS will embarrass those who take the reported numbers at face value, as they discover that many of the publications, randomly scattered in the detailed result lists, are just variant formats of the same paper, and the citations are mismatched.
While GS developers have fixed some of the most egregious problems that I reported in several reviews, columns and conference/workshop presentations since 2004—such as the 910,000 papers attributed to an author named “Password”—other large-scale nonsense remains and new absurdities are produced every day.
Skewed data
Google Scholar's publication and citation counts, and metadata for bibliometric and scientometric evaluations too often resemble Bernie Madoff’s profit numbers. Just as investors preferred the non-existent reality described by Madoff’s figures, users may like the publication and citation counts reported by Google Scholar, and the many inflated indicators derived from them.
We scholars, too, are well treated by GS, at least at first glance. For my own scholarly works, the publication and citation counts, the h-index, and derivative indices are considerably higher than they should be. I ought to be happy with Google Scholar—but I am not.
Inflated numbers
The numbers in GS are inflated for two main reasons. First, GS lumps together the number of master records (created from actual publications), and the number of citation records (distinguished by the prefix: [citation]) when reporting the total hits for author name search.
For example, it claims that the search for my name as author found 578 records, so I have 578 publications. However, 403 (79%) of these are records extracted from the reference section of publications that cite my paper. The example below shows five entries (all of the [citation] type), and the single paper is counted as five publications instead of one. In some cases, the inflation rate is much higher.
In this case GS did not create a master record for my first review after its debut, probably because the crawler did not go deep enough on the site of the GaleGroup that has hosted my reviews and has not aggregated the [citation] type entries because one or more of the metadata elements is/are wrong or incomplete. The same happens very often even if GS has created a master record but because of the differences in the [citation] records these cannot be aggregated in the master record. I call these entries stray references/citations.
In fairness, there are some correct citations for papers that do not have master records in GS. These orphan references/citations can and should be counted as a proxy for master records for the given publication, because the crawler did not pick up the document.
In contrast, fee-based Web of Science and Scopus have lower article and citation counts and scientometric indicators, as they have a far more selectively defined source base with fewer journals from which to gather publication and citations data. In addition, they count only the master records for the authors’ publication count (as they should), and keep the stray and orphan citations in a separate file.
These stray and orphan citations in Web of Science and Scopus are normalized manually only in exceptional cases (because it is a very tedious process), just as I did it in my contribution to the Festchrift edition of Library Trends—celebrating the 75th birthday of F.W. Lancaster—to calculate a realistic publishing performance and impact indicator for him.
In the example below GS produces two master records from the publisher’s site (there should be only one), and a single, obviously stray [citation] type record which, however, cannot be verified as GS can’t find it when clicking on it. There are higher duplicate rates beyond this, as when a paper or its manuscript version is also posted on several pre-print and reprint archives, and on other web sites.
Such duplicates, triplicates, quadruplicates, etc. also inflate the citation counts as illustrated here for the publication records of the 2004 review shown earlier.
Compounding errors
Unfortunately, the bad metadata has a long reach. These numbers are taken at face value by the free utilities such as the Google Scholar Citation Count gadget by Jan Feyereisl and the sophisticated and pretty Publish or Perish (PoP) software (produced by Tarma Software).
Such utilities turn many people into neophyte citation analysts who don’t see, don’t want to see, or assertively deny the metadata mess in GS, and produce ranking lists of researchers and journals based on both metadata and publication and citation count reported by GS.
Such programs produce simple and complex research performance indicators, such as total number of papers and citations, average citations per paper and author, authors per paper, citations per year, ratio of papers cited/published, and the widely popular single digit Hirsch index and its many derivates.
Ghosts in the machine
As about 10.2 million records from GBS are incorporated now in GS, the metadata disaster likely will continue unabated. It is bad enough to have so many records with erroneous publication years, titles, authors, and journal names.
It becomes much worse with millions of ghost author names fabricated by the artificial unintelligence of the GS parser.
False names are created from options on the seach menu, such as P Options (for Payment Options); from parts of the author affiliation (CA San Diego, C Ltd, M View for Mountain View); from Table of Contents pages on publishers’ web sites; and from section headings of articles (B Methods, D Definitions, G Assessment, H Variables, I Evaluation. (The initial varies depending on the section identifying letter or Roman numeral.)
In its stupor, the parser fancies as author names (parts of) section titles, article titles, journal names, company names, and addresses, such as Methods (42,700 records), Evaluation (43,900), Population (23,300), Contents (25,200), Technique(s) (30,000), Results (17,900), Background (10,500), or—in a whopping number of records— Limited (234,000) and Ltd (452,000). The numbers kept growing by several hundred thousands hits for the cumulative total of the above ”authors” during the few days this paper was being written. More screenshots are available here.
The parser apparently gives high priority to creating nonsense author names from menu options on the journals’ home page, such as Login (5,230), or Subscribe (73,400).
Subscribe seems to be a particularly common author name in GS. This name may not be easy to spot just by browsing the result lists, however, because it may appear in masked form under the initials STO, or SOR if the menu option is Subscribe To or Subscribe or Renew.
The parser knows no hurdles, and fabricates a single initial or several initials from the letter or Roman numerals preceding the section titles—I Background, V Findings, X Conclusions—and from the first word of the menu options of the home page P Login (from Please Login), N Subscriber (from New Subscriber), A Registered (from Already Registered), SD Access from Science Direct Access).
GS ignores existing, correct publication years, fancying page numbers, volume numbers, parts of document codes, ZIP codes, and street addresses of author affiliations as publication years instead. Table of contents pages are favorite hunting grounds for GS crawlers, and the parser can really make a big mess.
Lost authors abound
These errors could be considered relatively harmless if they did not affect the contributions of genuine, real scholars. But the biggest problem is when the mess replaces real scholars with ghost authors, leaving the former as lost authors.
In the sample record below, GS manages to deprive all eight authors not only of their authorship but also of their substantial volume of citations received (neither of which are usually prorated or fractionalized in measuring researchers’ publishing performance and impact).
From the correct short record of the original article (above), GS manages to produce the record below and advises that it was cited 1709 times. Authors do make erroneous references, but no one could be so much under the influence to use these author names in a reference. This should raise more concern about the poor quality of the citation-matching algorithm and process of GS, to say the least.
Searching by the first author’s name and part of the title, GS finds one master record and nine [citation] records with fewer than a total of 100 citations combined. Searching by the other authors—using their names with the parts of the title—retrieves fewer records and citations, so the loss is different but quite significant for all the eight real authors who were robbed of their authorship and citations.
If you wonder how the parser came up with this nonsense, the answer is easy. It took from the table of contents page the title of one paper, fabricated the first initials and the last name of the first author from the subtitle of another paper "Where Do We Stand Now," and the second author’s name from the title of a third paper, taking good care of concocting also double initials from the title: "Melanoma Risk Assessment." This is a joke and a very bad one.
It happens very often in records for papers in The Lancet, but this type of error is endemic. It may be that the data harvested from The Lancet is on the route of the most under-trained crawler/parser puppy GS has unleashed.
Attributing errors
The above examples show only the tip of the iceberg. Certainly the entire database isn’t rotten, just a few million records. That may be a relatively small percentage—Google won’t reveal the total number of records, and these are just my few forensic search test queries—but there’s ample cause for worry.
In case of GBS, Google relied on its collective Pavlovian reflex to blame the publishers and libraries (meaning the librarians, catalogers, indexers) for the wrong metadata.
In the case of Google Scholar, these same Googlish arguments will not fly, because practically all the scholarly publishers gave Google—hats in hand—their digital archive with metadata. The idea was to have Google index it and drive traffic to the publishers’ sites.
Yes, GS has fixed fairly quickly some of the major errors that I earlier used to demonstrate its illiteracy and innumeracy, but have so far left millions of others untouched.
I am happy that I no longer see many of my most disliked phantom authors fabricated by Google Scholar, such as members of the Password family credited with authoring 910,000 papers (with F Password alone infringing the rights of the real authors of 102,000 papers). But there’s a lot more work for Google to do.
How did we get here?
It must have taken some time to create such an imbecile parser. In the early days the GS developers decided not to use the metadata readily available from most of the scholarly publishers. This is obvious from the highly improved, intelligent (free) Scirus system that has made smart use of the publishers' metadata after its first bad steps that I criticized upon its debut.
The press and the public were so enamored of anything with the word Google in it that GS developers apparently believed they could create a parser to identify the metadata better than the human indexers at the publishers, repositories, and indexing/abstracting services who assigned metadata by listing author, title, journal name, publication year, and other metadata elements.
GS designers have sent very under-trained, ignorant crawlers/parsers to recognize and fetch the metadata elements on their own. Not all of the indexing/abstracting services are perfect and consistent, but their errors are dwarfed by the types and volume of those in GS. This is the perfect example of the lethal mix of ignorance and arrogance GS developers applied to metadata and relevance ranking issues.
No reason for optimism
I believe that, just as with GBS, commercial passion is the deciding factor for Google. So I am far less optimistic than Nunberg about Google’s pledges to improve the metadata “train-wreck” (to borrow his term).
The parsers have not improved much in the past five years despite much criticism. GS developers corrected some errors that got negative publicity, but these were Band-Aids, where brain surgery and extensive parser training is required. Without these, GS will keep producing similar errors on a mega-scale.
Peter Jacso is professor and chair of the Library and Information Science Program in the Department of Information and Computer Sciences at the University of Hawai'i at Manoa.
Read more Newswire stories:
Google Settlement Hearing Postponed After DoJ Push for More Negotiations
College Presidents Signal Support for Open Access Research Act
Obama Administration Supports Net Neutrality
Columns:
Pencils, Pixels, and Panic Attacks | Peer to Peer Review
Bargain Basement Higher Ed Has No Need for Libraries | From the Bell Tower
People
Teacher of the Year Nomination Guidelines
Best Sellers in History of Science















