First meeting of the RDA Interest Group – Long tail of Research data 17.9.13, Washington

It was standing room only: over 50 RDA delegates crammed into a breakout group for the first meeting of the ‘long-tail of research data’ interest group. Kathleen Shearer, COAR Executive Director and the coordinator of the group, moderated the two-hour session which, as she highlighted, was to concentrate on the plight of the datasets that had no natural forum for discussion in the other RDA groups. These datasets originate mostly in the institution, are often multi-disciplinary in nature and often go unnoticed. While they can be the remit of the library to manage, the data fall through the cracks in terms of structured disciplinary support. So what support do libraries need to tackle data of such heterogeneous nature? Where can we begin?

Wolfram Horstmann, from the Bodleian Library at Oxford, gave us a backdrop to ‘what is the long-tail?’. The sources he quoted[1] all pointed to the untapped potential value of smaller datasets – they can even be breeding grounds for new science and new discoveries. There might be big projects creating big volumes data, but there are also countless projects creating smaller datasets that are equally vital research.  What lies ahead of us now is to have a place for those datasets to be preserved and make them more discoverable. And that could be the library’s role as a data custodian.


Some salient points from the 2-hour varied and stimulating discussion are as follows:

  • There are different challenges in ‘big data’ management than in management the long tail. For example,  can be more resource intensive at the level of the individual dataset: metadata, rights/access management, appraisal.
  • As a group we must be aware that some disciplinary repositories do indeed cater for the long-tail. But this set of disciplines is so far limited.
  • Librarians need disciplinary-specific backgrounds to be able to appropriately manage the data. Being involved at the start of the life-cycle is highly desirable.
  • Storage costs are a factor: large data sets can be too costly for institutional data repositories to manage.
  • What researchers want: Ease of deposit, visibility of their data, trustworthy institution
  • While the value proposition is not always clear, most data repositories are happy to support all researchers who are willing to contribute their data

What are the common characteristics of the data that fall through the cracks? How and when should can we step in and intervene?


What we reached a consensus on:

  • Let’s be proactive! We don’t want data to go the way of scholarly literature. Let’s try to build infrastructures for data now and be active in terms of collecting it before it gets sold back to us at a price.
  • We can’t do it all! Data that underpins publications is one approach for selection and appraisal. We can’t aim at collecting and managing all data.
  • Lower the threshold with for deposit! At the same time we should look at the ease of access to the data infrastructure and above all lower the barriers for deposit.
  • Learning from the others! We can learn a lot from discipline-based initiatives (i.e. discovery mechanisms, analytical tools, support)
  • Let’s find it! On a practical level, some delegates in the room simply wanted to be able to point users to appropriate datasets existing in other data repositories. So the metadata should be descriptive (at best) at dataset level to facilitate discovery.

So what’s next for the long-tail?

We need a bit of time to digest the points raised and work out what are objectives are achievable. We also want to ensure that we don’t take on work that is already being done by others. Potential areas of further work are:

Advocacy, identifying costs and funding models, gathering evidence to better understand the long tail, facilitation of discovery, and collecting current practices and lessons learned.
However, we will soon have a list of areas of work in the area of managing the long tail of research data. As a coordinator of this group, I will be sure to keep you informed with our progress.

So – watch this space – follow our wiki and please get involved!

Kathleen Shearer, Najla Rettberg,  Birgit Schmidt


[1] Chris Anderson (2004)

Lorcan Dempsey – 2006 applied this to libraries in DLib piece.  Not just research data, but knowledge resources.

P. Bryan Heidorn Library Trends Fall 2008 – Long tail as in research data,

Shedding Light on the Dark Data in the Long Tail of Science