« NASIG 2006 | Main | Bearded Pigs in Phoenix »

May 09, 2006

Open Computation

Coincidentally, a draft of an article by Cliff Lynch that started bouncing around the net yesterday points to some very intriguing ways of dealing with the problems so congently identified by Mark D in his comments on my post yesterday.   Lynch suggests that we are approaching the day when we can apply data mining and analysis techniques to large sets of articles in such a way that we can begin to automate the kinds of meta-analyses and knowledge discovery that are now so terribly labor-intensive.

I raised this possibility some months ago during one of the NLM Long-Range Planning sessions, but Lynch (of course) describes the opportunity far more cogently and in more detail than I could.  He deals directly with the conundrum that Mark D raises -- we are overwhelmed with articles and reports and seem to have less and less time to actually make sense of them.   Now we may have an opportunity to move past the focus on individual articles to developing systems that can do the kind of synthesis that we really need.

Lynch's article also addresses one of the questions that I've heard from publishers regarding the NIH public access program -- namely, why does NIH want to have all of these articles in a single repository?  Why isn't it sufficient to establish effective linkages to the publishers' sites?  Who cares where the articles actually reside? 

I started to get a glimmer of the answer to this during Elias Zerhouni's talk at the AAMC meeting last fall.  If access to individual articles were the only issue, then linking is all that we need.  But Zerhouni is after something more than that.  In an article published earlier this spring in Health Affairs, he points out that "we have no place where the integration of information can be used as a powerful hypothesis generator as well as a powerful way of understanding change."  Lynch maintains that in order to start to develop systems that can do this kind of integration, centralized research databanks are more efficient than distributed ones.

This does not mean that a repository like PubMed Central should be the ONLY place where such articles reside.  Depending on the purposes, it is likely to be useful to have multiple such repositories, perhaps organized along discipline lines.  Lynch suggests that one of the benefits of more open access is that it will be easier to create such repositories, but that's clearly not essential, if publishers are willing to see the benefits of such repositories and rethink how they manage the control of the content that they own.

When I spoke to the Elsevier managers I said that a company that relies on selling individual articles will not survive this transition that we're now in.  Mark D's comments point to the same thing.  Open access aside, we can't make good use of all of the information that is at our fingertips now.  Developing the kinds of open computation tools that Lynch envisions will have a far greater impact on the development of real knowledge than the elimination of subscription barriers to individual articles ever could.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c225453ef00d83489430753ef

Listed below are links to weblogs that reference Open Computation:

Comments

This vision of actually using the literature in new ways just isn't going to happen until Open Access becomes a reality. It's instructive to note that the Public Library of Science itself stemed from the frustration of scientists who were trying to do do text-mining of the research literature to make precisely the sorts of connections discussed above.

It didn't happen then because the publishers wouldn't play ball. I haven't seen much to demonstrate that they're significantly more enlightened now.

I'd like to concur with Scott that Mark D. is right: meta-analyses are not the holy grail for all of this. Nobody really knows where "open computation" will lead, which is both exciting and somewhat unsettling.

Ed Sperr also has a point, although perhaps to the chagrin of Mark D. Lynch spends a good deal of time talking about how "rights clearing" could be a major impediment to the computational future he envisions. But he may be overstating the case--GenBank has existed for many years now, on the premise that the raw data should be accessible regardless of how you feel about access to the final article.

One thing that caught my eye in Lynch' article is the notion of easier access to "negative data," which is usually not reported in traditional literature. Like everyone else, scientists want to put their best foot forward. But this often leaves out critical parts of the story. Grey literature sleuths can ferret out this negative data if it's really important, but usually it is just lost. It's exciting to think that this may be changing.

Marcus, I am chagrined. It is not that I oppose the idea of a comprehensive database. In fact I have supported and worked towards that objective for years. I served on the LOCKSS advisory board until I moved to Australia.

However I am very much opposed to an NIH comprehensive central database for many good reasons.

I fear that a single database run by a single government would make the academic community vulnerable by intorducing a number of risks:

1) Political manipulation of the data - do any of you truly feel comfortable with the Bush Administration controlling the sum of our knowledge?
2) Data corruption. Databases degrade over time. A central repository is more vulnerable to data corruption.
3)The danger of natural and manmade disasters (earthquakes, floods, terrorist attacks....

Books have one big advantage over electronic formats, books are not easily transported. Consequently, many libraries in many locations were required to purchase the same book. This simple fact lead to a natural organic system that defended against the above stated risks.

My concerns regarding the NIH center on the fact that its activities will allow the library community to forget its most important role, a role that has existed since the very first library. In my view, librarians are the guardians of knowledge. You guard knowledge by collecting and storing books and journals. I fear that if the NIH takes up this role that the library community, at large, will no longer feel the need to continue this important task.

I am against a central repository run by a single governemnt entity. I am in favor of a knowledge net where the data is stored in many locations, involves the participation of private and public institutions, located in many different countries, and is controlled by a consortium of libraries with no single library in complete control. I would prefer that the 100 or so leading reasearch institutions from around the globe band together and manage this system. I prefer that the NIH remain out of it completely. However, if they must be involved I feel that they should be just one of many in charge of the program.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment