For many years, users at institutions subscribing to Google’s suite of tools enjoyed effectively unlimited data storage within Google Drive. Seriously, there was a time when the only hard limit was 1 terabyte(!) per file. For researchers needing to store large amounts of data, this was a wonderous benefit. Data were even encrypted both at rest and in transit, so it was relatively secure.
Recently, Google changed their terms such that the salad days of unlimited storage are, sadly, behind us. For some researchers, this may not have much consequence, especially if they have small amounts of data or their data are, e.g., mostly text-based. Many researchers, however, rely on larger formats, such as photos and videos, and the output from various scientific workflows can be enormous. Coupled with mandates from most funding agencies that require keeping all data necessary to recreate entire workflows and making those data reasonably accessible – sometimes for five or more years – the loss of functionally unlimited storage can be challenging.
Furthermore, all major funding agencies require researchers to submit a data management plan, which forces researchers to consider all aspects related to data, including inputs, outputs, storage, sharing, etc. As such, it is imperative to plan for data needs long before research begins.
While a complete discussion of alternatives is impossible, since each project often has unique requirements, here are some thoughts:
- Research making use of Swarthmore’s HPC cluster, or related national supercomputing resources such as ACCESS systems, may be able to store data on Strelka. Currently we have several hundred terabytes available, and within several months we hope to offer close to 1.2 petabytes. While not intended to be permanent, long-term “cold” storage, it can be used for ongoing projects over their reasonable lifetime. Plus, we hope soon to implement Globus as a data transfer service, making data ingress/egress much easier.
- For longer-term storage, or for projects needing a more customized solution, cloud providers (such as AWS), or even on-prem servers, might be an option.
- Many disciplines offer research data storage well-suited to the needs of their research community, which often offer a compelling solution for making datasets accessible when projects are complete.
- Finally, Google Drive does remain a potential option for research that does not require a significant amount space.
Given potential challenges related to storage space, backups, security, access requirements, sharing, and so forth, planning how to manage data throughout all aspects of a research project cannot be an afterthought. Swarthmore has wonderful resources to assist, within ITS, the Libraries, Sponsored Programs, the Office of Research Integrity and Engagement, the Provost’s Office, and more. For further assistance, please reach out to Jason Simms (jsimms1@swarthmore.edu), who can help to coordinate any resources needed to manage your research data.
And, stay tuned to this space! The topic of managing research data, and a discussion of helpful resources, is much to large for one post. Over the coming months, I will be posting additional information about various facets of this critical challenge.