Making Mission Data Available in the Cloud

A. Smith (arfon[at]stsci.edu), I. Momcheva, and J. Peek

Over the past 18 months, the Data Management Division (the home for the engineering and science teams at MAST), together with the Data Science Mission Office, have been working on making it possible for the astronomical community to access mission data in the cloud.

As part of the AWS Public Dataset Program, data from Hubble (ACS, COS, STIS, WFC3, & WFPC2), TESS (calibrated and uncalibrated full frame images, two-minute cadence target pixel and light curve files), and Kepler (light curves, target pixel files and full frame images) are now available next to the vast cloud computing resources of Amazon Web Services.

Foundational steps in enabling data science with astronomical data

Many of the most promising data science techniques, such as Deep Learning, require simultaneous access to large quantities of data, large amounts of compute (both CPU and GPU), and programmatic interfaces (APIs) for interacting with the data. Our work over the past 18 months has been about addressing all three of these points:

Making the data 'highly available' and next to substantial computational resources: By staging public data for Hubble, TESS, and Kepler in the cloud, we're making it possible for anyone in the community to access hundreds of terabytes of mission data in a high-performance computational environment. Whereas previously, astronomers wishing to analyze large volumes of mission data would have to download the data locally and place it somewhere with sufficient storage and compute, now any astronomer can rent a supercomputer by the hour (https://medium.com/descarteslabs-team/thunder-from-the-cloud-40-000-cores-running-in-concert-on-aws-bf1610679978) for their science.

Make sure there's a good programmatic interface for accessing the data: Bulk access to data is only useful if it's possible to programmatically script your analyses. The team at MAST have developed an open source module as part of the Astroquery project which is 'cloud aware' and can serve data from the cloud on demand (https://astroquery.readthedocs.io/en/latest/mast/mast.html).

Make it possible for people to secure resources to work with the data: While cloud computing makes it possible to access a variety of computational resources in a flexible, pay-as-you-go model, these services cost real money. That's why as of Cycle 26, there has been a new category of HST archival proposal "Legacy Archival Cloud Computation Studies" designed to support astronomers wishing to make use of the Hubble data in the cloud.

Cloud Volume
Figure 1: Volume of HST data accessed per month after launching the HST public dataset on Amazon Web Services.

Additional resources

Wrapping up

Whether you’re looking to process large volumes of mission data, or train some kind of deep learning algorithm to analyze Hubble images or hunt for exoplanets in TESS and Kepler data, we think that making these datasets available in the cloud is a first step in facilitating new, more sophisticated analyses of archival data.

If you'd like to provide us any feedback on this new initiative, please email archive@stsci.edu; we'd love to hear from you!