|
The CADC's HST cache system
The CADC started the Hubble Space Telescope Project in March 1986. The
center became operational in middle 1987 using standard catalogue data
only because launch delay of the Hubble telescope.
Early on, in order to distinguish themselves from the main
archive center, the CADC started implementing important added-value
features to complement what was already available.
One of the important CADC's contribution was the first implementation
of the concept of "recalibration-on-the-fly" (hereafter OTFR/OTFC)
which, in addition to be able to save previous space at the data
provider end, the user what then able to receive the desired science
images calibrated with the latest software and the latest calibration
files.
Although very nice, the OTFR/OTFC system is suffering from a major
problem: time. The recalibration process is indeed quite time
consuming. For individual requests, it is not a big issue since the
users could easily wait a few hours to be able to get their
recalibrated data. Where this time limitation is a very big problem
is when one try to use a large set of images for doing some sort of
data mining project. In this case, any process which accesses a large
number of images would have to wait a considerable amount of time.
In order to ease these large operations, as well as being able to
serve the Virtual Observatory in a more timely manner, the CADC have
started the creation of an HST science cache in which ALL files are
being saved, from the raw images to the drizzled ACS associations, all
HST images are directly and immediately available. This system is called the HST cache.
When requesting data using the regular CADC's HST interface, the
user will now get their desired images
immediately on their computer, using the CADC's HST cache.
The cache is an envelope around HST archive file production. It
is a set of database tables and software agents that ensures that
all science pipeline products are locally available preprocessed and
readily available from storage at all times. This includes mechanisms to
discover newly observed datasets to insert, and automatic reprocessing
of datasets which benefit from updates to reference files, available
meta-data and general processing software upgrades.
Since 2002 all data from active instruments has been produced from
scratch triggered by user requests. The reasoning behind the On The
Fly Reprocessing (OTFR) and On The Fly Calibration (OTFC) pipelines was
that it would guarantee that the archive user always would get her data
equipped with the newest set of meta-data and calibrated according to the
best methods available. This was a clear advantage to the previous system,
where the raw data was produced centrally at the STScI and delivered
to the partner-sites, essentially freezing that data in time. Another
advantage of the system was that it conserved storage space as only the
Hubble Space Telescope telemetry files and a few smaller auxiliary files
needed to be stored, an important resource aspect when data is stored
on optical disks in jukeboxes.
With the advent of cheap mass storage in form of hard-disk arrays this
aspect became less important and a number of other drawbacks of the
on-the-fly paradigm became apparent over time as well: Live processing of
data requires that support is available at all times to resolve errors
and bugs in the pipeline, a inevitable task when a system becomes as
complex as this with such a heterogeneous set of data as input. Another
drawback is the processing speed: Producing a dataset could take from
several minutes to hours, which might not be an issue for the patient
astronomer, but makes it impossible to expose the data through synchronous
VO protocols. Next level efforts like data-mining/metadata harvesting
and production of high-level data products is also enormously difficult
in the on-the-fly world.
The advantages of the HST Cache are:
- Faster access Speed
- Shields users from processing errors
- Direct programmatic & VO protocol access to the data
- Makes the archive less prone to overall system breakdowns.
- Allows site interoperability and redundancy
- Less maintenance in the long run
- Allows harvesting of meta-data and data-mining
Programmatic access to the data |
|---|
The user can now request data from CADC directly from his or her local
computer using the CADC's programmatic interface (proxy) to its data
collection or use the new CADC interface. One little hurdle is that the user MUST know the filename
associated to a given scientific image. The CADC is trying to simplify
those filenames and a list of the most important are available here.
It is important to mention that all HST data is stored within the HSTCA archive while all the HLA files
are in the HLADR2 archive.
As one example, suppose you want to get a final drizzled ACS of the
observation J8MJ95080,
http://www.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/getData?archive=HSTCA&file_id=J8MJ95080_DRZ
One hidden problem behind the cache concept is the refresh
rate. The CADC still believes that the OTFR/OTFC process is the best
insurance to get the best science data. So the images in the cache
will have to be refreshed quite often using a clever trigger
mechanisms. Establishing those trigger is quite a challenge. The HST
calibration system is rather complex. To know when a given image has
to be refreshed, one has to take an account of software releases,
calibration files version, etc... It is fair to say that, very often,
after a few years, a given instrument software and calibration
sequence is quite stable. This should minimize the refresh rate of the
cache.
|