CADC The Canadian Astronomy Data Centre
Herzberg Institute of Astrophysics
Acknowledgments
HST

The Hubble Space Telescope Archive at CADC   (Subscribe to new public observations)

CSA

The CADC's HST cache system

HST Cache

The CADC started the Hubble Space Telescope Project in March 1986. The center became operational in middle 1987 using standard catalogue data only because launch delay of the Hubble telescope.

Early on, in order to distinguish themselves from the main archive center, the CADC started implementing important added-value features to complement what was already available.

One of the important CADC's contribution was the first implementation of the concept of "recalibration-on-the-fly" (hereafter OTFR/OTFC) which, in addition to be able to save previous space at the data provider end, the user what then able to receive the desired science images calibrated with the latest software and the latest calibration files.

Although very nice, the OTFR/OTFC system is suffering from a major problem: time. The recalibration process is indeed quite time consuming. For individual requests, it is not a big issue since the users could easily wait a few hours to be able to get their recalibrated data. Where this time limitation is a very big problem is when one try to use a large set of images for doing some sort of data mining project. In this case, any process which accesses a large number of images would have to wait a considerable amount of time.

In order to ease these large operations, as well as being able to serve the Virtual Observatory in a more timely manner, the CADC have started the creation of an HST science cache in which ALL files are being saved, from the raw images to the drizzled ACS associations, all HST images are directly and immediately available. This system is called the HST cache.

When requesting data using the regular CADC's HST interface, the user will now get their desired images immediately on their computer, using the CADC's HST cache.

What is the HST Cache?

The cache is an envelope around HST archive file production. It is a set of database tables and software agents that ensures that all science pipeline products are locally available preprocessed and readily available from storage at all times. This includes mechanisms to discover newly observed datasets to insert, and automatic reprocessing of datasets which benefit from updates to reference files, available meta-data and general processing software upgrades.

Why do we need a cache?

Since 2002 all data from active instruments has been produced from scratch triggered by user requests. The reasoning behind the On The Fly Reprocessing (OTFR) and On The Fly Calibration (OTFC) pipelines was that it would guarantee that the archive user always would get her data equipped with the newest set of meta-data and calibrated according to the best methods available. This was a clear advantage to the previous system, where the raw data was produced centrally at the STScI and delivered to the partner-sites, essentially freezing that data in time. Another advantage of the system was that it conserved storage space as only the Hubble Space Telescope telemetry files and a few smaller auxiliary files needed to be stored, an important resource aspect when data is stored on optical disks in jukeboxes.

With the advent of cheap mass storage in form of hard-disk arrays this aspect became less important and a number of other drawbacks of the on-the-fly paradigm became apparent over time as well: Live processing of data requires that support is available at all times to resolve errors and bugs in the pipeline, a inevitable task when a system becomes as complex as this with such a heterogeneous set of data as input. Another drawback is the processing speed: Producing a dataset could take from several minutes to hours, which might not be an issue for the patient astronomer, but makes it impossible to expose the data through synchronous VO protocols. Next level efforts like data-mining/metadata harvesting and production of high-level data products is also enormously difficult in the on-the-fly world.

The advantages of the HST Cache are:

  • Faster access Speed
  • Shields users from processing errors
  • Direct programmatic & VO protocol access to the data
  • Makes the archive less prone to overall system breakdowns.
  • Allows site interoperability and redundancy
  • Less maintenance in the long run
  • Allows harvesting of meta-data and data-mining

Programmatic access to the data

The user can now request data from CADC directly from his or her local computer using the CADC's programmatic interface (proxy) to its data collection or use the new CADC interface. One little hurdle is that the user MUST know the filename associated to a given scientific image. The CADC is trying to simplify those filenames and a list of the most important are available here.

It is important to mention that all HST data is stored within the HSTCA archive while all the HLA files are in the HLADR2 archive.

As one example, suppose you want to get a final drizzled ACS of the observation J8MJ95080,

http://www.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/getData?archive=HSTCA&file_id=J8MJ95080_DRZ

One hidden problem behind the cache concept is the refresh rate. The CADC still believes that the OTFR/OTFC process is the best insurance to get the best science data. So the images in the cache will have to be refreshed quite often using a clever trigger mechanisms. Establishing those trigger is quite a challenge. The HST calibration system is rather complex. To know when a given image has to be refreshed, one has to take an account of software releases, calibration files version, etc... It is fair to say that, very often, after a few years, a given instrument software and calibration sequence is quite stable. This should minimize the refresh rate of the cache.

NRC HST