Volume 43, no. 2, 2006



Editorial, Sandra K. Roe, Editor-In-Chief

Cataloging News, Daniel Lovins, Cataloging News Editor

 

Observations on the Catalogersí Role in Descriptive Metadata Creation in Academic Libraries
Jeanne M. K. Boydston and Joan M. Leysen

ABSTRACT: This article examines the case for the participation of catalogers in the creation of descriptive metadata. Metadata creation is an extension of the catalogersí existing skills, abilities and knowledge. As such, it should be encouraged and supported. However, issues in this process, such as cost, supply of catalogers and the need for further training will also be examined. The authors use examples from the literature and their own experiences in descriptive metadata creation. Suggestions for future research on the topic are included.

KEYWORDS: Metadata, cataloging, digitization, catalogerís role

 

Growing Our Own: Mentoring the Next Generation of Catalog Librarians
Christine DeZelar-Tiedman, Beth Picknally Camden, and Rebecca Uhl

ABSTRACT: This paper traces the development of a mentoring program for aspiring catalogers, sponsored and administered by the ALCTS CCS Committee on Education, Training, and Recruitment for Cataloging (CETRC). Background is given on the reasons for establishing the program, as well as the two pilot programs that preceded the current, ongoing mentoring service. Results of the assessment of the second pilot are shared. Though CETRC still faces challenges in sustaining the program on an ongoing basis, the Committee feels it is a valuable endeavor worth continuing.

KEYWORDS: Cataloging; catalogers; mentoring; mentors; mentees; Association for Library Collections & Technical Services (ALCTS); Committee on Education, Training and Recruitment for Cataloging (CETRC); recruitment; professional development

 

The Application of AACR2ís Rules for Personal Names in Certain Languages
Philip Hider and Saralee Turner

ABSTRACT: Anglo-American Cataloguing Rules include special rules for personal name headings in certain languages under 22.21-22.28. This article investigates the extent to which four of these rules, pertaining to Indonesian, Malay, and Thai names, have been applied by catalogers contributing to the Australian National Bibliographic Database and discusses their value of these rules in the context of the general rules they supplement. It was found that many headings were not compliant with the rules, especially those resulting from English-language cataloging. Given catalogersí apparent difficulty in applying the special rules, it is recommended that they be deleted, that the general rules be further generalized, and that more use is made of relevant linguistic and cultural resources.

KEYWORDS: Anglo-American Cataloguing Rules, personal name headings

 

Romanization in Cataloging of Korean Materials
SungKyung Kim

ABSTRACT: This paper analyzes cataloging rules for Korean materials focusing on the McCune-Reischauer (MR) system, the Korean romanization scheme currently used in the United States. This system has been used for a long time in many Western countries, and was officially adopted by the Library of Congress (LC) for use in the cataloging of Korean language materials. Considering usersí information seeking behavior and searching abilities, however, the MR system has many drawbacks for increasing usersí ability to retrieve information. This paper analyzes bibliographic records in academic libraries, the LC, and the Research Libraries Information Network (RLIN) to identify the issues and problems of the MR system. A user survey conducted demonstrates that the MR system is not a user-customized tool based on usersí searching ability. Several solutions are suggested to overcome the limitations of the MR system.

KEYWORDS: Cataloging, Korean Romanization, McCune-Reischauer system, CJK, information retrieval

 

Main Issues in Cataloging Persian Language Materials in North America
Fereshteh Molavi

ABSTRACT: The main problems of cataloging Persian language materials as it is practiced in North America are discussed. The problems are grouped by origin: those that originate from the implementation of the ALA-LC Romanization Tables for Persian; those that occur either because of misleading examples used in that Tableís rules for application section, or a lack of functional knowledge of Persian; and those that appear in the treatment of names generally and choice and form of main entry specifically, due to the application of inappropriate rules for Persian names. Suggestions for dealing with these cataloging issues are presented.

KEYWORDS: Cataloging, Persian language materials, ALA-LC Romanization Tables, North America

 

Objectivity and Subject Access in the Print Library
Chew Chiat Naun

ABSTRACT: Librarians have inherited from the print environment a particular way of thinking about subject representation, one based on the conscious identification by librarians of appropriate subject classes and terminology. This conception has played a central role in shaping the professionís characteristic approach to upholding one of its core values: objectivity. It is argued that the social and technological roots of traditional indexing practice are closely intertwined. It is further argued that in traditional library practice objectivity is to be understood as impartiality, and reflects the mediating role that librarians have played in society. The case presented here is not a historical one based on empirical research, but rather a conceptual examination of practices that are already familiar to most librarians.

KEYWORDS: Subject access, cultural bias, literary warrant, information retrieval

 


Editorial by Sandra K. Roe

Cataloging & Classification Quarterly is only possible with the help of many individuals who contribute a wide variety of expertise to the work Ė authors, reviewers, columnists. Manuscripts submitted to CCQ go through a double-blind peer review process and those reviewers are generally assigned from among the members of the editorial board. Less frequently, someone outside the board is asked to serve as reviewer. I would like to publicly thank those who have graciously agreed to review manuscripts over the past two years but whose names you will not find on our list of editorial board members. These include James Agenbroad, Pauline Atherton Cochrane, J. McRee Elrod, Pat Lawton, and Priscilla Matthews. Thank you.

This issue begins with an article that continues the current discussion on the future role of the cataloger, here by examining the involvement of catalogers in the creation of descriptive metadata for digitization programs in academic libraries. The second article documents the development and assessment of a mentoring program for aspiring catalogers.

Three articles within this issue speak to personal name headings for Indonesian, Malay, Thai, Korean, and Persian. The first of these evaluates the application of the special rules for personal names headings in AACR, how those rules have been applied by catalogers contributing to the Australian National Bibliographic Database, and recommends deleting these special rules and generalizing others. The second article describes an analysis of bibliographic records and a user study related to the use of the McCune-Reischauer romanization scheme for Korean currently in use in our catalogs, and introduces us to a newer scheme, developed by the South Korean Ministry of Culture and Tourism. The third article articulates problems that result from the implementation of the ALA-LC Romanization Tables for Persian and suggests solutions.

The final article in this issue speaks thoughtfully about the values of objectivity and impartiality as these relate to subject representation, as it is accomplished through traditional library indexing and newer methods, and the social mission of libraries.

The cataloging news column concludes this issue, and there has certainly been no shortage of that. A few the topics you will find discussed include the discontinuation of series authority records by the Library of Congress; the Calhoun report and Thomas Mannís response; catalog innovations such as the XML-based Extensible Catalog (XC) at the University of Rochester, the Endeca implementation at North Carolina State University; the RLG OCLC merger; and the long tail.

 

Return to the top of the page.

 

Cataloging News

Daniel Lovins, News Editor

Welcome to the news column. Its purpose is to disseminate information on any aspect of cataloging and classification that may be of interest to the cataloging community. This column is not just intended for news items, but serves to document discussions of interest as well as news concerning you, your research efforts, and your organization. If you have any pertinent materials, notes, minutes, or reports, please contact Daniel Lovins (email: daniel.lovins(at)yale.edu; phone: 203-432-1707). News columns will typically be available prior to publication in print from the CCQ website at http://catalogingandclassificationquarterly.com/.

We would appreciate receiving items having to do with:

Research and Opinion

Events

People

 

June 2006

Research and Opinion

Introduction

The revolution in cataloging (and librarianship in general) has continued unabated since my last column. For example, the Library of Congress (LC) announced a discontinuation of series authority records (SARs) effective June 1, 2006; the Research Libraries Group (RLG) announced its merger with OCLC effective July 1, 2006, and a steady stream of reports, white papers, and opinion pieces has continued to fuel debate on the future of our profession. At the same time, the new cataloging code is moving ahead quickly, innovative OPAC designs are being introduced, and the dream of a single open-access universal library seems closer than ever to being realized.

LC Pulls Plug on Series Authority Records

According to official sources, LC stopped creating series authority records (SARs), effective June 1, 2006. Reaction to the new policy has been swift and mostly negative:

The ALA Executive Board stated that "Keyword search is not an adequate substitute for authority controlled series access" and that "Any diminution of the quality or quantity of cataloging provided by the Library of Congress has an enormous financial impact on all of the nationís libraries," and expressed concern that "the importance of Library of Congress cataloging to the nationís libraries and to the development of an educated and informed populace is not sufficiently appreciated by the Libraryís senior administration."

The Library of Congress Professional Guild issued a resolution to "strongly oppose the recent decision by Library of Congress management to cease the production of series authority records (SARs), based on the extreme nature of the decision and the unilateral manner in which it was handed down without any opportunity for the staff and other concerned parties to voice their opinions," and that it will "erode the Library of Congressís role as a proponent of high standards" and increase rather than reduce costs (e.g., by requiring the analysis and separate classification of works otherwise included in collected set records).

The Africana Librarians Council issued an open letter, noting, among other things, that "series control, as with all aspects of bibliographic control, is critically important in the ever-expanding world of book publishing in Africa Ö African studies readers in the U.S. rely upon series names as brands of quality."

The Music Library Association expressed "serious concerns about several aspects of LCís recently announced decision to cease creating series authority records. The decisions will have a detrimental impact on the music cataloging community."

Other groups, such as the American Association of Law Libraries and the Association of Special Libraries, have similarly come out against the decision. The Program for Cooperative Cataloging (PCC) issued a more neutral statement, recognizing and supporting "the right of the Library of Congress (LC) to make cataloging decisions in its own best interest," but also warning that the series policy change "has widespread ramifications—especially in a context where, until now, there has been a one-to-one correspondence between LC and PCC standards."

The one group (of which I'm aware) to give full support to the LC decision was the Association of Research Libraries (ARL). Perhaps not surprisingly, the administration-oriented ARL endorsed LC's efforts "to redesign its services in order to focus better on the needs of the end-user -- the individual researcher -- and to streamline processes in order to make information accessible more conveniently and more quickly."

Is Cataloging Obsolete?

Perhaps nowhere is the current debate on cataloging more sharply drawn than in the public statements of Karen Calhoun and Thomas Mann. In her 2006 LC-commissioned report, "The Changing Nature of the Catalog and its Integration with Other Discovery Tools", Calhoun, A.U.L. for Technical Services at Cornell, asserts that "a large and growing number of students and scholars routinely bypass library catalogs in favor of other discovery tools, and the catalog represents a shrinking proportion of the universe of scholarly information." Her strategy for regaining market share includes qualified support of RDA, promoting FRBR, simplifying cataloging practices, and possibly "dismantling" LCSH. Similarly, in her February 2006 address to the PCC Participants Meeting, she criticized librarians' "marketing myopia" in the face of their shrinking presence in the "global infosphere." The good news, Calhoun seemed to suggest, is that catalogers know how to disintermediate themselves gracefully when asked. Witness the public catalog, she points out, the original disintermediation tool, which allowed patrons to bypass the librarian and, with call number in hand, help themselves to books in the stacks. Are we now going through another phase of disintermediation, this time courtesy of intelligent search engines?

Thomas Mann, a reference librarian at LC and author of the Oxford Guide to Library Research, takes Calhoun to task for (in his view) misrepresenting research data and having more of a pro-business than pro-scholarship agenda. Profit-oriented solutions ought not to be applied, he believes, to an institution long accepted as a subsidized public good. He also worries about misleading and unflattering comparisons between OPACs and Internet search engines. Google-type relevancy ranking "is expressly designed and optimized for quick information seeking rather than scholarship," he points out, whereas classification, authority control, subject analysis, and other forms of bibliographic control are precision tools for serious research. Similarly, he believes Calhoun misrepresents the survey data, conflating a growing preference for starting research on the Web with what may be undiminished demand for traditional library services at the more advanced stages.

In Calhoun's defense, and irrespective of the ambiguous empirical evidence she cites (correctly or incorrectly), what if she is simply the bearer of bad tidings, i.e., pointing out that the public's willingness to subsidize this public good is itself in decline? Even though libraries are not profit-seeking businesses, they nevertheless operate on the principle of supply and demand. When demand drops off, provosts, trustees, taxpayers, granting agencies, et al., may simply look for other pressing causes to support, and libraries must struggle to survive.

Changing perceptions of the library catalog are evidenced not just in the flurry of recent reports—e.g., the Calhoun report, Indiana University's "White Paper on the Future of Cataloging", Deanna Marcum's "Future of Cataloging" address, the Report Of The Task Group on the PCC Mission Statement, and the University of California Libraries Bibliographic Services Task Force Final Report—but also in the blogosphere and at IT conferences. In a previous column I cited Ellyssa Krosky's InfoTangle blog posting, that "the wisdom of crowds, the hive mind, and the collective intelligence are doing what heretofore only expert catalogers, information architects, and Website authors have done. They are categorizing and organizing the Internet and determining the user experience, and it's working." Clay Shirky and others have argued that, in the era of ubiquitous, networked, digital information, semantic systems such as LC Classification have outlived their usefulness. Folksonomies and social tagging do the job better, in his view. A wave of new Web application such as LibraryThing, del.icio.us, Technorati, and Flickr promise to make cataloging a more spontaneous, distributed, and even 'recreational' activity.

Reports of Obsolescence May be Premature

Despite all the hand-wringing about possible catalog obsolescence, there are some notable bright spots out there that suggest the opposite, namely, that catalogers and catalog designers are actually keeping up with Google-era expectations. The University of Rochester, for example, has won a Mellon Foundation grant to support its innovative XML-based Extensible Catalog (XC). Rochester's River Campus Libraries will receive $283,000 to perform planning and requirement analysis for the new system. It is hoped that the project will spawn "inexpensive, flexible alternatives to using off-the-shelf software to provide access to library collections."

Another celebrated development is the North Carolina State University (NCSU) implementation of Endeca's ProFind faceted navigation tool. Andrew Pace, NCSU library systems director, had famously complained that interfaces to library OPACs were tantamount to "putting lipstick on a pig." Now it has been declared (by Roy Tennant in Web4Lib, I think) that finally "NCSU has butchered the pig." One of the remarkable things about ProFind is that it exploits the thesaural power of classification tables and subject headings better than homegrown library tools have been able to do. At a time when some administrators question the value of structured ontologies and controlled vocabularies (see "Is Cataloging Obsolete?" above), it is interesting to see how a non-library company like Endeca can remind us of how exquisitely useful these tools can be.

Because of the way it uses thesaural relationships, a user can "drill down" through the classification outline or trace LCSH facets to the single item level without ever executing a single search argument, but rather browsing the entire collection via semantically linked facets. The ProFind-powered OPAC has also been praised for its ability to integrate NCSU's Sirsi catalog with other research databases in a unified browse display.

NCSU is Endeca's first library customer, incidentally. It's interesting to read who some of the other customers are: IBM, NASA, Boeing, Wal-Mart, Home Depot, the U.S. Defense Intelligence Agency, and Toys-R-Us.

The image of a lipstick-besmeared pig seems to have stuck in our collective mind's eye. At the first ever code4lib conference in Corvallis, Oregon, for example, Jim Robertson of New Jersey Institute of Technology (NJIT) discoursed on the subject: "Lipstick on a Pig: 7 Ways to Improve the Sex Life of your OPAC." He has been tweaking the Voyager implementation at NJIT to include book cover art, book reviews, live circulation usage history, recommendations (e.g., "others who borrowed this book, also borrowed ..."), RSS tables of contents for journals, live librarian support (i.e., integrated into the OPAC), and durable links (PURLs) to specific items.

Several other ideas for improving the OPAC came up at that code4lib conference as well, including (and this is just a sample): a recommendations-generating engine at the California Digital Library based on circulation statistics (Colleen Whitney); on open-source OPAC cobbled together out of Amazon's API, WordPress, COinS, and del.icio.us tags (Casey Bisson); and bookmarklets that pair paperback with hardcover ISBNs, effectively collapsing a difference that, to OPAC users, never really make a difference (Jeffrey Young).

LC's Bibliographic Enrichment Advisory Team (BEAT), OCLC's Wiki-D, and Syndetic Solutions are likewise finding ways to enhance OPAC capabilities. Using different techniques they are adding features like tables of contents, new title RSS feeds, cover art, book reviews, user recommendations, sample passages of texts, live help, and social bookmarking. A shared goal is to exploit control numbers (e.g., ISBNs) and other unique identifiers to link pre-existing metadata from libraries, vendors, publishers, wikis, online databases, and even ordinary users. One speaker captured this in a motto: "Don't catalog; resolve."

RDA Forging Ahead

Development of the new code, Resource Description and Access (RDA) continues at a furious pace. It may not seem to be moving that fast from the outside--considering that planning for what was originally called "AACR3" began in 2004, and publication isn't expected until 2008--but for anyone involved in the deliberations, I can assure you that the workload is massive and the deadlines are palpable. One major change, recently announced by the Joint Steering Committee, is that the published manual will include two large sections rather than the three that were originally planned. Part A will include Resource Description (formerly Part I) and Relationships (formerly part II), while Part B will include Access Point Control (formerly Part III). According to the JSC, "the integration of parts I and II into a single part will align RDA with the standards used in other resource description communities."

There have been other noteworthy RDA developments as well. On April 10, 2006 JSC announced a joint initiative with EDItEUR to harmonize the way RDA and ONIX (ONline Information eXchange) record metadata on form and content. According to the JSC announcement, "The objective is to develop a framework for categorizing resources in all media that will support the needs of both libraries and the publishing industry and will facilitate the transfer and use of resource description data across the two communities." For additional details, you can visit the JSC Website at http://www.collectionscanada.ca/jsc/rda.html.

Vendor Cataloging and Shelf-Ready Books

The question of outsourcing is never far from the thoughts of library administrators. Some embrace it more eagerly than others, but it's hard to deny that economies of scale are sometimes more easily obtained outside one's own institution. The Library of Congress signed a contract with its Italian book vendor Casalini Libri to catalog the approximately 4,000 Italian-language items LC purchases each year. LC librarians have trained Casalini staff to create core-level records and even authority records (aside from SARs, presumably) where necessary.

Other libraries are also taking advantage of vendor-supplied cataloging. "Shelf-ready" services—where the vendor not only maintains purchasing profiles and publisher relations, but also performs the cataloging, adds spine-labels, bookplates, and magnetic security strips—seems to be growing in popularity as well. With shelf-ready services, all the purchasing library needs to do is to load the records into their local database, and shelve the corresponding volumes in the stacks. In theory, vendors should be able to catalog and process widely-held titles at lower cost than could a single library on its own. Like cooperative cataloging at its best, it makes economic sense to have the cataloging and physical processing done once and for all 'upstream' and then the finished product distributed to all the customers 'downstream'. There is another motive, as well—perhaps more political than economic: with technical services facing stagnant or shrinking budgets, outsourcing helps make it possible for vital services to paid for out of collection development or other funds. This frees up technical services staff to do what it does best, namely, catalog uniquely-held, foreign language, or otherwise unusual but important items (i.e., the 'long tail').

Grabbing Hold of the Long Tail

One of the reasons that cataloging (whether in-house or outsourced) is so expensive is that many complex headings take a long time to assemble and then are used only once. The last time I checked, for example, the heading "September 11 Terrorist Attacks, 2001 -- Moral and ethical aspects" occurred only once throughout WorldCat's 65 million records. The large number of such single-frequency items in a database (i.e., the outliers in a frequency distribution curve) is known as the "long tail" problem. In the retail book and music industries, the long tail has historically been lopped off, because cataloging and stocking items of interest to only one in a million customers was prohibitively expensive.

The Internet has changed this completely. Due to the online aggregation of producers, consumers, and content, and the instantaneous networked global marketplace that now prevails on the Web, just-in-case stockpiles of inventory have given way to just-in-time production and delivery. In many cases there is no inventory. One of my favorite examples of this trend is cafepress.com. Any individual or business can get a cafepress.com account, upload a logo they've designed, and create the illusion that mugs, t-shirts, baseball caps, and refrigerator magnets (for example) emblazoned with that logo are stockpiled in a warehouse somewhere. In fact, trinkets are created on the fly each time a customer places an order, and then shipped straight from the factory.

Libraries are of course different from retailers (be they the dot-com or bricks-and-mortar type). Librarians have long tried to follow Ranganathan's dictum that every book has its reader, i.e., even the never-circulating most obscure titles at the thinnest end of the long tail. This is why Thomas Mann may cringe when he hears someone try to impose profit-oriented business models on research collections and services. In a research library, it is precisely the long tail of obscurity that, like biodiversity in a rainforest, holds the greatest promise for new discoveries. It is certainly expensive to catalog, house, and preserve such items, but, to the extent that it can afford to do so, the research community has always considered it worthwhile.

Lorcan Dempsey reflects on the long tail problem in his April 2006 article in D-Lib magazine. Looking at the Google Five research libraries (i.e., Stanford, Harvard, Michigan, NYPL, and the Oxford Bodleian, all of which participate in the Google Book Search scanning program), OCLC research found that only 10% of an individual library's titles account for 90% of its circulation. This is the 'sweet spot' of the collection, if you will, i.e., the part that would turn a profit if one were to charge for checking out books. The other 90% of the collection accounts for the remaining 10% of circulation, and this more or less constitutes the long tail. In a similar vein, OCLC found that even though 60% of the aggregate Google Five collections are held by only one of the five libraries, ILL transactions account for only 4.7% of system-wide circulation. Both of these findings, I would suggest, simply confirm what is commonly known, namely that many library items rarely if ever circulate.

Dempsey makes the following inference, though, which strikes me as odd. He writes, "These numbers suggest that many items in a specific collection may be underused" [emphasis added]. In other words, he thinks we are not doing a good enough job exposing our readers to the long tail. It seems to me, though, that, while Ranganathan may be correct that every book has its reader, it may also be true that this reader only comes by to visit once every thousand years.

In other words, the uneven rates of circulation could simply reflect the vast extent of our collections and the selective interest of our researchers. And this brings up an important point: as long as libraries continue to obtain foreign language, ephemera, unpublished manuscripts, mixed-media collections, and so on, it is unlikely that professional catalogers will disappear from the scene. The metadata standards they follow might not be AACR and MARC, but catalogers with bibliographic, language, and subject expertise will continue to be in demand. The question that administrators need to ask, perhaps, is to what extent unique collections, e.g., the 60% of Google Five collections held by only one library, treated as a priority. This will to a large extent determine the number of information specialists that need to be kept on staff.

In any event, Dempsey's conclusions seem reasonable enough: libraries should do a better job of intelligence gathering (e.g., usage statistics and market research), reducing service fragmentation (including, I presume, the recently-announced RLG-OCLC merger), better cost recovery (reducing transaction costs and operational friction along the lines of PayPal), and, perhaps the single most useful thing OCLC has to offer the library profession: "new services that operate at the network level, above the level of individual libraries."

RLG to "Merge" with OCLC

RLG and OCLC have announced a merger, effective (assuming RLG members approve) July 6, 2006. Some observers have suggested that the deal looks more like an acquisition than a merger, given that RLG will become "RLG Programs" and report to OCLC Vice President for Research Lorcan Dempsey, but in any event it is clear that the two utilities had been converging in terms of service, and were beginning to overlap. For example, once OCLC Connexion began to support the JACKPHY scripts (i.e., Japanese, Arabic, Chinese, Persian, Hebrew, and Yiddish) and then even Cyrillic, Tamil, and Thai, RLG lost one of its most distinctive competitive edges, namely its support for non-Roman script cataloging.

Now that the merger appears inevitable, OCLC is promising to find ways to emulate RLG's record clustering technique, which, after non-roman script support, was probably the other feature RLG catalogers are most concerned about. For those who don't know: record clustering allows original catalogers to preserve local modifications to their bibliographic records (i.e., even when 'duplicate' records for the same manifestation are already in the database). This is especially important for rare book catalogers who need to capture copy-specific details (e.g., provenance, signatures, binding technique) in addition to those that describe the whole edition. Clustering also allows copy catalogers to identify specific libraries' records, and even specific cataloger's initials, before deciding which to import locally.

Building an International Catalog

Given the 400 or so languages represented in our union catalogs, the adoption of the Unicode character-encoding standard has been a major leap forward in access. Not only do OCLC and RLG support Unicode scripts in bibliographic records, but LC has begun experimentally adding non-Roman scripts to authority records as well. Moreover, LC CPSO chief Barbara Tillett is promoting a "Virtual International Authority File" to link controlled vocabularies and multiple-script records in a unified virtual database.

ALCTS has appointed a Task Force on Non-English Access to determine ways to optimize the international use and exchange of bibliographic records. One expected outcome for the task force is a recommendation for the Digital Library Federation to support font development. This is an important issue because, even though virtually all the world's scripts are encoded in the Unicode standard, there does not yet exist a single font that can convert every character of each script into a graphical representation. So for example, if the default font is Arial MS Unicode and a reader happens to be looking at a Burmese script record, the Burmese characters will display incorrectly since Microsoft never created "glyphs" or graphic representations for that particularly script. Alternatively, if the user switches to a dedicated Burmese Unicode font, the record data will display correctly, but the catalog interface will become unreadable.

Building a Universal Library

Spurred on by Google's audacious Book Search initiative, where the vast holdings of 5 research libraries are being digitally scanned and processed by optical character recognition (OCR) for full text online retrieval, the old dream of a single library containing all the world's knowledge is being revived with enthusiasm. Comparisons are made with the 500,000 papyrus scrolls held (and eventually lost) at the ancient Library of Alexandria, the closest, perhaps, humanity has ever come to achieving this Promethean objective.

Kevin Kelley, "Senior Maverick" at Wired magazine gave his take on the matter in his cover story for the May 14, 2006 New York Times Magazine. He estimates that 32 million books, 750 million articles and essays, 25 million songs, 500 million images, 500,000 movies, 3 million videos, TV shows, and short films, and 100 billion public Web pages have been 'published' over the course of human history. The amount of computer memory needed to store all this information is apparently 50 petabytes. "Today you need a building about the size of a small-town library to house 50 petabytes," Kelley points out, "With tomorrow's technology, it will all fit on your iPod."

Kelley is particularly interested in the way our perception of books will change once they are enmeshed, Wikipedia like, in a deep web of connection with other books and readers. In this changed world, he writes, "each word in each book is cross-linked, clustered, cited, extracted, indexed, analyzed, annotated, remixed, reassembled, and woven deeper into the culture than ever before. In the new world of books, every bit informs another; every page reads all the other pages."

Is it reasonable to expect that this dream will come true? Granted, it may be possible some day, technically speaking, to squeeze the world's combined cultural heritage into a 50-petabyte iPod. But potential economic, legal, and even aesthetic barriers should not be underestimated. Leaving aside graphic images, videos, sound recordings, and other media for the moment, and just looking at printed texts: imagine how much it would cost to develop OCR for all the world's scripts and typefaces? OCR has been developed (albeit imperfectly) for the most widely-spoken languages, but many others (e.g., Burmese? See "Building the International Catalog" above) remain to be developed. Furthermore, imagine indexing billions of digital objects in all their different types of encoding and formats, and then planning and executing data migrations to refresh data and avoid format obsolescence.

Aside from the economics, there is the matter of negotiating licenses and fighting lawsuits. As MIT library director Ann Wolpert has pointed out (in "Google at the Gate," American Libraries, March 2005, p. 42), between 62% and 88% of the 30 million works that have been published and copyrighted in the U.S. since 1790 are still covered by copyright protection. That's approximately 27 million texts that might be tied up. Google would need to negotiate licenses and/or pay royalties to each one of these copyright holders in order to gain legal access. So, even when every extant book has been digitized, there are still millions that Google will be forced to keep off line, except for sample pages and perhaps tables of contents. In the mean time, the American Association of Publishers (AAP) and Authors Guild are already suing Google for intellectual property theft.

Perhaps the Open Content Alliance (OCA) will have better luck preventing panic among publishers and authors. As mentioned in my previous column, RLG decided to supply bibliographic records to the OCA-based Open Library Project, a non-commercial alternative to Google Print. In her May 28 open letter to the New York Times (responding to the Kelley piece), AAP president Pat Schroeder asserts "that Microsoft, Yahoo and others in the Open Content Alliance are also in the business of connecting words and ideas [i.e., in addition to Google] -- the difference is that they first obtain permission to copy works under copyright."

In any event, why dwell on mass digitization in a cataloging news column? Because the ultra-fast scanners, optical character recognition, 50-petabyte memory devices, xml interoperability, smart search engines, etc., are precisely what pose the greatest challenges to cataloging as we know it. As Kelley points out, "When books are deeply linked, you'll be able to click on the title in any bibliography or any footnote and find the actual book referred to in the footnote." In this environment, the collocating power of the traditional catalog loses some of its luster. That is, unless the catalog assimilates some of those very same technologies.

People

100 1_ Avram, Henriette D., 1920-2006

Henriette D. Avram, the inventor of the Machine Readable Cataloging (MARC) metadata format, died on April 22, 2006. According to the Washington Post, Avram worked as a National Security Agency programmer before joining the Library of Congress in 1965. In 1968 she completed her design for the MARC format, which LC then implemented in 1970. Margalit Fox noted in the New York Times, that in order to figure out algorithms for converting 3x5 index cards into database records, "Mrs. Avram also had to enter the mind of the library cataloger, a profession whose arcane knowledge—involving deep philosophical questions about taxonomy, interconnectedness and the nature of similarity and difference—was guarded like a priestly ritual." Moreover, notes Fox, she "helped transform the gentle art of librarianship into the sleek new field of information science." Avram's children were impressed as well. I remember reading in American Libraries (October 1989, p. 855) that they referred to MARC affectionately as "Mother Avram's Remarkable Contribution."

Avram's work has been a pillar of cooperative cataloging and networked library systems for 35 years. Now, as we move from MARC toward an XML architecture, it is good to remember that we've been through this before, and that, with visionary leaders to guide us (and there are many), libraries will continue to flourish in the new environment.

New England Technical Services Librarians (NETSL) Award

This next piece is a shout-out to my friend and Yale colleague Matthew Beacom, winner of the 2006 NETSL award for Excellence in Technical Services. According to the official announcement, Beacom is recognized for his "deep and lasting contribution to the development of international cataloging standards." For those of you who've seen him in action as an advocate for excellence in cataloging (through JDC, CCDA, and elsewhere), the announcement should come as no surprise. Congratulations, Matthew!

Other News

Emerging Metadata Registry

The National Science Digital Library (NSDL) Metadata Initiative will allow agencies to register metadata schemas (element/property sets) and schemes (controlled vocabularies) in support of increased metadata interoperability. Based on the open-source Dublin Care Metadata Initiative (DCMI) registry application, the NSDL registry exploits emerging Semantic Web tools such as Research Description Framework (RDF) and Simple Knowledge Organization System (SKOS). The mission of the Registry is to facilitate controlled vocabularies, crosswalks, and interoperability among all participating projects and data providers. Items from beyond the NSDL and NSF, including orphan schemes and schemas (i.e., those still in use but lacking current institutional support), are to be included as well.

NISO

Here are some recent developments at the National Information Standards Organization (NISO): Pat Harris resigned as NISO executive director effective November 15, 2005, and Pat Stevens was appointed interim head. Roy Tennant published his report: "NISO Standards Development Process: Review & Recommendations" on December 15th 2005. The 4th edition of NISO Z39.19, now called Guidelines for the Construction, Format, and management of Monolingual Controlled Vocabularies, was published in January 2006. NISO is sponsoring the "Standardized Usage Statistics Harvesting Initiative", or SUSHI, designed to support machine-to-machine exchange of resource usage statistics, particularly as tracked within Electronic Resource Management (ERM) systems.

Return to the top of the page.

 


©Haworth Press, Inc.