How I Work Open: Ben Marwick

photo of Ben Marwick

Ben Marwick
Associate Professor

How would you describe your research?

I do Paleolithic archaeology, mainly in Southeast Asia.  This includes looking at the movement of modern humans into that area and how they contribute to populations further down the line. I’m also interested in how they’ve adapted to environmental problems – for example, human behavioral ecology and how they’ve used technology to adapt to changes in the landscape.

What kind of open work do you do?

While part of my archeology work consists of traditional activities like excavation, fieldwork, and surveying, another is part is computer-based. Much of my effort is focused on making the computational work done in the lab open and transparent, because in the field what we’re doing is automatically open. Anyone can come by and see what we’re doing. But in the lab, it’s much harder to engage the public and colleagues in that work.

So I focus my efforts in three areas. The first is in sharing the data – things like measurements we collect with instruments, or observations we make. The second area is making my code and methods open. For me, it’s always possible to have open code, because I’m not going to patent, or copyright anything. But open data is not always something I can achieve. I was just working with an Aboriginal community who asked me not to share the data I gathered. It contained the locations of a bunch of archaeological sites, and they were concerned that those sites would be vulnerable to theft. It would have been unethical to violate the agreement I have with the community and say I don’t care about their concerns. But I make my data open whenever possible.

The third area of focus is ensuring my publications are freely accessible. This can be  achieved in two ways. For some researchers a good option is to publish in open access journals like PLoS One. But in archeology there are incredibly expensive fees for making your work open in a hybrid model. And I still want to publish in high impact places and journals that are associated with quality work, without being limited to a small number of open access journals. To solve this problem, I publish in whichever journal I like, but I also put a preprint in an open access repository. I primarily use the non-commercial service SocArXiv, which also gives me a DOI. I feel like I’ve fulfilled my commitment to open access, and made my work freely available,  but I save myself the $3,000 per paper. I feel like this is a sustainable and practical solution, and I encourage my students and collaborators to do the same.

What are the methods and tools you are using to make your work open?

We make a point to share the rawest data feasible in open, trustworthy data repositories – places like Figshare, Zenodo, the Open Science Framework, and UW ResearchWorks. We make sure to get a DOI for the data, so that it’s easy to refer to in the papers we write. We primarily use those repositories because they’re free, but there are two archeology-specific repositories: tDAR, and a data publisher called Open Context. As an archeologist I think in terms of long timeframes: what does it mean for data to be open in 100 years? In 1000? So I usually try to present the data in the simplest format possible. Even if I collect it in Excel, I’ll deposit it as a csv because I know that format will be usable long into the future.

For code, I use R, the programming language. I made a big commitment to it early on in thinking about how to become open, because I saw that principles and methods of R were optimized around openness. Because of my work with R, I was recently invited to Berlin to lead a summer session on reproducibility in Archaeology where we developed a package researchers can use to streamline the process of working reproducibly. rrtools is a package of reproducible research tools that captures a bunch of the key principles of working reproducibly.  It compresses what may have been a day’s worth of fussing and setup into two or three minutes, so we can focus on analysis and writing.

I also pay attention to my licensing. When I put preprints on SocArXiv, I use Creative Commons Attribution (CC-BY). I use the MIT license for code, and CC0 for data. In rrtools, licenses are baked in – it includes a template for a readme.txt file that declares those licenses, which you can then change. This helps to ensure I get credit for my work, and makes it clear to other researchers that they can reuse my materials.

How did you get started working openly?

As archaeologists we study artifacts, and so I’m interested in the first artifacts of open science, which we see during the Enlightenment. Robert Boyle’s vacuum pump is a great example: he made these pumps, but then he also wrote an exquisite account of how someone else could make one, and how to use one in a way they could get the same results as he did with his experiments. I thought that’s how science should be – specifying in enough detail so that anyone could reproduce what you’ve done. Boyle wrote it in longhand, but the modern equivalent is to script your process into a programming language that shows all the decisions you’ve made, from gathering the raw data to your final analysis.

For my PhD thesis I was working with a limited amount of data, so I decided to try to apply statistics and quantitative methods to make things more interesting. I was using S and S PLUS, a commercial product, and I thought: how can I do this more cheaply? At that point R didn’t have the wider community and lacked many of the functions it has now. I downloaded it, but I couldn’t make sense of it – I needed a lower entry point. I did that once per year until RStudio came out, and now I can make sense of it. I also had to wait for R and stackoverflow to develop so that most of my questions were answered. Those developments helped me to become a more efficient R user.

There was one major paper where I made a big commitment, as a personal experiment, to invest the time to learn how to make all my work open – code, data, and text – with entirely executable manuscripts.  It worked out quite well and I wrote commentary and papers about what I’d done. But that first experience tripled the amount of time I spent on the paper.  I thought to myself that I have to make this much more efficient if I’m going to do it and expect my colleagues to do it. I wanted it to be sustainable, our rrtools package is a big step forward with this.

What barriers have you faced in trying to work openly?

Most archeologists don’t see the open and reproducible approach as important. Or at least, the appreciation of concerns about open is not standard across the discipline, and participation is uneven and patchy. There are small groups of true believers, but among my peers and more senior colleagues, there’s not much interest. Since I got tenure I’m less vulnerable to this lack of awareness, and feel like my efforts around openness have been vindicated. I try to integrate those values into my work as a peer reviewer and grant reviewer, and ask for things like code or data for others’ work. But if the editor is unaware of these concerns, they think I’m being too demanding.

The time it takes to share and make one’s work open can also be a barrier, but for me at this point it doesn’t take more time because I’ve learned to be more efficient.  I try to teach my students, grad students, and peers, but they can be resistant because the culture of science can be very resistant to change.

Why do you think open scholarship is important?

Archeologists dig up stuff that isn’t ours personally – the artifacts belong to a group of humans or a cultural group. So we have the patrimony of various Indigenous and local groups to consider, and part of our duty is to make it accessible to those groups and to humanity generally. What people know about their past affects their sense of purpose and how they perceive their identity. So a core part of working openly is to find out something about the past and communicate it, so that people have more context for making decisions about who they are and information that might inform how they behave in the present and future.

Because I’m sharing my data freely, my international colleagues know they can always work on it themselves, teach with it, or share it as a resource. In other situations, local collaborators may never see the researcher’s data sets. Working in a community that values openness helps reduce some of the inequality between the west and the global South and East. In communities that have far fewer resources, many researchers use pirated tools, or tools not in their language. But open source software is more accessible – they can be sure to use the latest version, and they have communities in their own language and can support each other.

Reproducibility and openness are the original cornerstones of science. Researchers making claims ought to make them as transparently and openly as possible. This enables others to decide for themselves if the claims are reliable. Scripts are a narrative of our scientific work; they are reproducible, transparent, and make claims accessible to others. Science is indistinguishable from religion otherwise – you have no way of objectively checking the claims.

What opportunities does working openly offer that traditional scholarship does not?

For me, being on the vanguard in terms of reproducibility has career benefits – citations, invitations to interesting meetings and workshops. People ask my advice and I have some influence on the direction of the discipline. This will fade eventually as being open becomes more normal, but what will persist are the  broader advantages of efficiencies, synthesis, and new types of analyses being possible. Many of us have this vision of a global/regional synthesis of data. Ecologists already do this. Many of us are looking forward to the time when all data sets can be connected, and questions can be answered that now take years to work through.

It lets you sleep better at night when your analysis is documented, and you don’t wake up asking ‘how did I do this, did I do it right? Was this a bad choice?’ If you have a concern you can go back to the code and rerun, and decide if it was a good choice or not. We’ve seen the damage irreproducible claims can do to the public credibility of some other disciplines, such as social psychology. Unless a discipline adopts the principles of openness, they’re at risk of this kind of credibility  crisis. The tripod supporting a scientific claim needs to be openly accessible: data, methods, and publications. I hope in the future we’ll be at the point where people will look back on this dark age when a claim was judged only by the contents of a journal article.

