Reproducibility has always been an important component of good research, and as time has gone on, more and more journals are now requesting full code used for data-handing and analysis to accompany manuscripts. But the big question is: how reproducible is our code, really? How long does a script we’ve spent hours working on last? Does code have a “shelf life”?
To start to answer these questions, on the 9th of June 2021 Dr Thomas Cornulier, a research fellow in applied statistics, and the Aberdeen Study Group, a PGR-led group for skill-sharing and coding, hosted an open-access hackathon: Reprothon 2021. The aim, to collect data on the reproducibility of R code available on one of the biggest coding and programming forums, Stackoverflow.
Hundreds of thousands of posts from 2008 to March 2021, went through a filtering and structured sampling process, ready for all to access and start testing verified (i.e., worked at the time) R code. The event featured 23 amazing contributors based in Aberdeen, Edinburgh, Dundee, Reading, Austria and France. We all worked through code used in anything from plotting bar charts to more complex time-series analyses. Together, in 1.5 hours, we tested 134 pieces of code.
Credit to everyone who contributed on the day (not all pictured): Alexandra Jebb, Annette Raffan, Auriel Sumner-Hempel, Aurore Ponchon, Camilla Negri, David Fisher, Eilidh Fummey, Heather Ritchie-Parker, Hongjie Zhao, Katherine August, Laura Mackenzie, Lucy Henshall, Marcela Espinaze, Maria Kamouyiaros, Max Tschol, Rosie Baillie, Sania Wadud, Susan Kenyon, Tamsin Woodman, Thomas Cornulier, Virginia Iorio, Yanlin Liu, Zhibin Wen
Luckily for all of us with a vault of “dusty” R scripts, the majority of them passed, but almost 12% still failed despite having been verified, highlighting that code really doesn’t last forever (it’s not just my bad scripting skills). So, what can we do to help minimize this? Here’s a short list of some good habits to have that I put together from the discussions on the day:
- Always note down what versions you’re using
- Include reproducible example data (even if it’s just a subset) with example command line output
- Call in your values don’t write them out manually in your code – this makes it transferable between datasets and projects
- Annotate your code (for real this time)
- Use relative file paths
- Refer to your packages for your functions using “::” (“package_name::function”)
- If you’re generating values make sure you set a seed (a specified starting value) when you can!
It is impressive how many posts were tested within such a small timeframe, and there’s still more that this data can offer (are there specific packages or analyses that “age” faster?). For me, however, this hackathon was a perfect demonstration of how locally organised, online events make for an open and accessible way to discuss, share ideas and collaborate with people across institutions; people you’d probably never get a chance to meet in person, all working towards a common goal.
If you missed the live event, the Reprothon is still on-going! All information and data is fully available on the Aberdeen Study Group website. So, if you are interested in joining the next live event on the 11th of May 2022 or to contribute to this on your own time and keep the data going with a group of friends, colleagues, relatives and/or pets it’s all available for you!