One provocation for Big Data

I’ve started thinking a lot about Big Data and what it could mean for museums in a time when, as Danah Boyd and Kate Crawford write “The era of Big Data has begun.”

The two have put forward an excellent and provocative paper about some of the weaknesses and problematics associated with the use of Big Data, titled Six Provocations for Big Data. Chief amongst these is the idea that Big Data is changing the very way we research. They write:

Big Data not only refers to very large data sets and the tools and procedures used to manipulate and analyze them, but also to a computational turn in thought and research (Burkholder 1992). Just as Ford changed the way we made cars – and then transformed work itself – Big Data has emerged a system of knowledge that is already changing the objects of knowledge, while also having the power to inform how we understand human networks and community…

We would argue that Big Data creates a radical shift in how we think about research. Commenting on computational social science, Lazer et al argue that it offers ‘the capacity to collect and analyze data with an unprecedented breadth and depth and scale’ (2009, p. 722). But it is not just a matter of scale. Neither is enough to consider it in terms of proximity, or what Moretti (2007) refers to as distant or close analysis of texts. Rather, it is a profound change at the levels of epistemology and ethics. It reframes key questions about the constitution of knowledge, the processes of research, how we should engage with information, and the nature and the categorization of reality. Just as du Gay and Pryke note that ‘accounting tools…do not simply aid the measurement of economic activity, they shape the reality they measure’ (2002, pp. 12-13), so Big Data stakes out new terrains of objects, methods of knowing, and definitions of social life.

This is merely one of the fascinating propositions that the two put forward, as they argue for a serious interrogation of the way Big Data will shape research, and problematise the problems of both the nature of the data, and the way it is used in analysis.

It is a very interesting paper, and one that discusses a very real issue that I think museums will more and more have to confront in coming years, vested as we are in “the nature and the categorization of reality.” Museum collection databases are a significant cultural resource – and a knowledge asset in their own right. However, to date, they have rarely been treated as such. Museum collection data is still generally considered as secondary to the object itself as an interpretive tool. It merely supports the object.

However, as we move further into this new era – an era when data can be related, mined and aggregated with new viscosity, when the value of data for knowledge production increases, then museums need to address this issue. We need to think about the quality of our data, and how we want people to be able to access and use it. We need to ask who should manage and take care of our data, and what data should be included. If it has the potential to be as valuable (maybe even more so?) to society as our objects, then surely it needs to be taken care of with the same level of priority.

In my recent post on whether museums should still be treating the physical space as the main one, Mia Ridge made the following comment:

And to play devil’s advocate… there are probably lots of people who can do more interesting things with museum content online than your average museum can currently manage. That might be because of resourcing or recruitment issues, a lack of imagination, because the organisation doesn’t know how to value or get excited about online content, whatever… but maybe if they’re not going to do digital well, then museums should just open up their data and let other people get on with creating the next wave of museums online.

This too raises interesting issues for museums about how to best make their data available for others to use, however, because effective data modelling is often complex. As Daniel W. Rasmus writes, in his article on Why Big Data Won’t Make You Smart, Rich, Or Pretty

Combining models full of nuance and obscurity increases complexity. Organizations that plan complex uses of Big Data and the algorithms that analyze the data need to think about continuity and succession planning in order to maintain the accuracy and relevance of their models over time, and they need to be very cautious about the time it will take to integrate, and the value of results achieved, from data and models that border on the cryptic.

So, if Big Data is becoming increasingly important in research and the constitution of knowledge, and yet museums are not themselves necessarily likely to be the ones using it internally (assuming that our expertise lies elsewhere) how can we then think of continuity and succession planning for our data, to ensure it is useful for other researchers? Is this something we can even achieve?

The Linked Open Data movement is obviously going to be a part of this, but I wonder how much further we need to go. Surely the notion of moving from object-based knowledge to knowledge that integrates Big Data starts to essentially change the very core of how museums function as a knowledge institution? And if it does, what does that mean? Is it even possible for museums to tackle this without knowing what an anticipated end result might be? Or is this something that is too complex to be dealt with for all but a very few institutions (if any)? And if so, do we just withdraw from what some believe will be the fifth wave in the technology revolution?

This zippy little article shows what 100 million calls to 311 revealed about New York. What patterns could emerge from our collections if we could analyse information about our collections on such scale? Would it become feasible to see both the trees and  the forest of the museum collection – the objects, and the large-scale contexts in which they exist. Could utilising museum  collections data in this way recomplexify museum objects and collections, adding new layers of meaning and reconnecting them   back to the wider world of information?

I have no answers here. These are still ideas in sketch, and there is much more to be discussed as my ideas evolve on this subject. But I think it is something we should be talking about.

9 thoughts on “One provocation for Big Data

  1. The benefits of ‘big data’ probably only accrue to the biggest museums – the rest will get eaten. Aaron Straup Cope spoke of this future in his MW2011 paper.

    Also, as we’ve seen from the recent MegaUpload affair, where your data lives is immensely political and fragile.

  2. There is a long history of trying to use data that is not necessarily collected for the original analysis in the area of medical research. The actual number of results that have any real value approaches zero. It is very hard to understand the variables of importance when they are coded through an imperfect interface, as the inversion will always be lossy and out of focus. Since most strong effects are known by mechanism or cause, we are talking about trying to pick up very subtle signals using very imperfect tools. I am not holding my breath that any of the world transformation is going to occur, save the ability of tech guys to fool the resource allocation class of big companies and academic institutions. Can we first walk and show feasibility before we change the fabric of the universe

    1. This is a really good and important point. Any data is only ever as useful as its interpretation, and there are very real complicating issues here. David Weinberger, in his newly published Too Big To Know, writes (p39, 39)

      The massive increase in the amount of information available makes it easier than ever for things to go wrong. We have so many facts at such ready disposal that they lose their ability to nail conclusions down, because there are always other facts supporting other interpretations.

      This is particularly problematic for museums, given that our job is one of interpretation. There is a nice section in the Cope paper that Seb linked to that seems relevant here.

      All organizations, sooner or later, struggle with the task of marshaling that oral tradition into a more rigid framework that aims to capture the essence of the history, but in a controlled and easily repeatable fashion. Paramount in many of these systems is the idea of complex search and database facilities to answer the multitude of questions that may exist.

      The problem with this scenario is that stories evolve and databases don’t (or when they do, not nearly fast enough). Rather than the systems adapting to the needs of the users, what ends up happening is a kind of intellectual body-modification in the service of the framework. This often leads to a perverse language of expertise geared towards the needs of a database that only a few people may have mastered, and without any of the underlying richness of the stories first told.

      Similarly, one of the comments that Rasmus makes in his article is about algorithms and a lack of theory (emphasis mine).

      It is not only algorithms that can go wrong when a theory proves incorrect or the assumptions underlying the algorithm change. There are places where no theory exists at any level of consensus to be meaningful. The impact of education (and the effectiveness of various approaches), how innovation works, or what triggers a fad are examples of behaviors for which little valid theory exists–it’s not that plenty of opinion about various approaches or models is lacking, but that a theory, in the scientific sense, is nonexistent. For Big Data that means a number of things, first and foremost, that if you don’t have a working theory, you probably don’t know what data you need to test any hypotheses you may posit. It also means that data scientists can’t create a model because no reliable underlying logic exists that can be encoded into a model.

      So of course we are talking about imperfect tools at this point (and maybe always will be). And maybe this isn’t something that museums should be concerned about. Maybe no world transformation will occur, and maybe the way knowledge is produced will not alter in a very real sense. But if it does, is it something that we should be thinking about? Of course, the answer might be no. But if it’s yes, what would that mean for museum business?

  3. This is a larger societal issue really. We’re all swimming, maybe drowning, in data. A big part of the problem is that we’ve been collecting data for a very long time, but there was never an easy way to see it all before. Now we’re starting to see it, for the first time really, and we’re seeing connections in things that we never noticed were connected before. We’re finding patterns where we never expected to find patterns before. We’re basically learning a whole hell of a lot about our world, our systems and ourselves in the process.

    Or, at least, we think we are. It turns out that reality is way more complicated than we really felt it to be, and we’re not really sure if we’re learning things or just drawing false conclusions through a sort of collective apophenia. I think that so far, we’ve only just learned to cope with truly big data rather than really understand it.

    Think of when writing was first invented. Someone looked at those cuneiform marks on clay tablets and scratched their head in wonder that anybody else could make sense of it. Someone must have been overwhelmed by the tallies of crop yields, the emergence of time-sensitive grain-backed currencies and the ability to communicate (one-way at least) with the dead. Someone surely must have been terrified by it all.

    Millennia later, literacy is a prerequisite for a successful existence and many people are fully literate in multiple natural and/or machine languages. Somebody invented this thing called a computer, and it’s really poorly named because it doesn’t just calculate like some glorified abacus. It’s a device for storing and executing procedural memory.

    Think about that. It’s like having a set of instructions for assembling an IKEA chair that follow themselves and assemble the chair for you. We’ve had devices that do things before, but they always did one thing, not a lot of different things that are only limited by the imagination (or time) of the programmer. Now we have a device that allows us to not only record any set of instructions, but execute them as well. And in some cases, they’re starting to program themselves (look up genetic algorithms).

    It used to be that could we only read the words of the dead, hear their recorded voices and see images of their faces. Now they can actually keep doing things through the procedural artifacts they’ve left behind. The OS you are using now was contributed to by at least one person who is no longer with us, but the instructions they left behind keep executing.

    This is bigger than writing and it’s far more confusing for us right now. Most people still haven’t really wrapped their heads around what is happening. It’s not about the data. The data has always been there. Much of the data has been collected before (and we have tons of old data, more than most people realize). The increased amount of data we now have is just one product of this revolution. What’s changed is the process.

    We can’t comprehend the data. We’re simply not capable of it. Our brains have not yet evolved to a point where we, as humans, can even try to do that successfully, even with advanced statistics and Big Computation helping us. Our knowledge, that which we actually possess, is physiologically limited right now. We’re already used to augmenting our knowledge with knowledge that’s stored somewhere else, in books, recordings and databases. But all of that was interpreted by a person at one point or another. What we have to get used to now is the idea of extending our knowledge with the knowledge has been interpreted outside of ourselves, outside of any self. And we have to devise a set of processes to filter, appraise and assess that externally (or prosthetically) created knowledge.

    What we should be doing is critically studying our processes, improving them and sharing them. We need to share them in this new procedural language (or set of languages really) that is available to us now. If we can figure out the processes, then we can set the machines to churning on the procedural code to comprehend the data for us.

    This is no small feat though. Really analyzing process is something that is relatively new for humanity. But it’s the processes that count. It’s not the data we should be thinking about. It’s the processes of our field. If we can crack that nut (and it’s a tough one), the data will take care of itself.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s