DaCHS Reference Documentation¶
Author: | Markus Demleitner |
---|---|
Email: | gavo@ari.uni-heidelberg.de |
Date: | 2023-05-12 |
Copyright: | Waived under CC-0 |
Contents
- DaCHS Reference Documentation
- Resource Descriptor Element Reference
- Active Tags
- Grammars Available
- Cores Available
- Predefined Macros
- Mixins
- Triggers
- Renderers Available
- Predefined Procedures
- Predefined Streams
- Configuration Reference
- Code in DaCHS
- Data Descriptors
- Metadata
- Display Hints
- Data Model Annotation
- DaCHS’ Service Interface
- Writing Custom Cores
- Regression Testing
- Datalink and SODA
- Product Previews
- Custom UWSes
- Custom Pages
- Manufacturing Spectra
- Echelle Spectra
- Adapting Obscore
- Writing Custom Grammars
- Scripting
- ReStructuredText
- The DaCHS API
- System Tables
Resource Descriptor Element Reference¶
The following (XML) elements are defined for resource descriptors. Some elements are polymorous (Grammars, Cores). See below for a reference on the respective real elements known to the software.
Each element description gives a general introduction to the element’s use (complain if it’s too technical; it’s not unlikely that it is since these texts are actually the defining classes’ docstrings).
Within RDs, element properties that can (but need not) be written in XML attributes, i.e., as a single string, are called “atomic”. Their types are given in parentheses after the attribute name along with a default value.
In general, items defaulted to Undefined are mandatory. Failing to give a value will result in an error at RD parse time.
Within RD XML documents, you can (almost always) give atomic children
either as XML attribute (att="abc"
) or as child elements
(<att>abc</abc>
). Some of the “atomic” attributes actually contain
lists of items. For those, you should normally write multiple child
elements (<att>val1</att><att>val2</att>
), although sometimes it’s
allowed to mash together the individual list items using a variety of
separators.
Here are some short words about the types you may encounter, together with valid literals:
- boolean – these allow quite a number of literals; use
True
andFalse
oryes
andno
and stick to your choice. - unicode string – there may be additional syntactical limitations on those. See the explanation
- integer – only decimal integer literals are allowed
- id reference – these are references to items within XML documents; all
elements within RDs can have an
id
attribute, which can then be used as an id reference. Additionally, you can reference elements in different RDs using <rd-id>#<id>. Note that DaCHS does not support forward references (i.e., references to items lexically behind the referencing element). - list of id references – Lists of id references. The values could be mashed together with commas, but prefer multiple child elements.
There are also “Dict-like” attributes. These are built from XML like:
<d key="ab">val1</d>
<d key="cd">val2</d>
In addition to key, other (possibly more descriptive) attributes for the key within these mappings may also be allowed. In special circumstances (in particular with properties) it may be useful to add to a value:
<property key="brokencols">ab,cd</property>
<property key="brokencols" cumulative="True">,x</property>
will leave ab,cd,x
in brokencols.
Many elements can also have “structure children”. These correspond to
compound things with attributes and possibly children of their own.
The name given at the start of each description is irrelevant to the
pure user; it’s the attribute name you’d use when you have the
corresponding python objects. For authoring XML, you use the name in
the following link; thus, the phrase “colRefs (contains Element
columnRef…” means you’d write <columnRef...>
.
Here are some guidelines as to the naming of the attributes:
- Attributes giving keys into dictionaries or similar (e.g., column
names) should always be named
key
- Attributes giving references to some source of events or data
should always be named
source
, never “src” or similar - Attributes referencing generic things should always be called
ref
; of course, references to specific things like tables or services should indicate in their names what they are supposed to reference.
Also note that examples for the usage of almost everything mentioned here can be found in in the GAVO datacenter element reference.
Active Tags¶
The following tags are “active”, which means that they do not directly contribute to the RD parsed. Instead they define, replay, or edit streams of elements.
Grammars Available¶
The following elements are all grammar related. All grammar elements can occur in data descriptors.
Cores Available¶
The following elements are related to cores. All cores can only occur toplevel, i.e. as direct children of resource descriptors. Cores are only useful with an id to make them referencable from services using that core.
Predefined Macros¶
Macro expansions in DaCHS start with a backslash, arguments are given in curly braces. What macros are available depends on the element doing the expansion; regrettably, not all strings are expanded, and at this point it’s not usually documented which are and which are not (though we hope DaCHS typically behaves “as expected”). If this bites you, complain to the authors and we promise we’ll give fixing this a higher priority.
Mixins¶
Mixins ensure a certain functionality on a table. Typically, this is used to provide certain guaranteed fields to particular cores. For many mixins, there are predefined procedures (both rowmaker applys and grammar rowfilters) that should be used in grammars and/or rowmakers feeding the tables mixing in a given mixin.
Note that when a piece of metadata in a mixin gets in your way, you can
selectively override attributes of columns and params by copying and
changing them. For instance, if your mixin example#m" gives you a
column ``flux
, and you need to change its unit, you would say:
<table mixin="example#m">
<column original="flux" unit="mJy"/>
</table>
Triggers¶
In DaCHS, triggers are conditions on rows – either the raw rows emitted by grammars if they are used within grammars, or the rows about to be shipped to a table if they are used within tables. Triggers may be used recursively, i.e., triggers may contain more triggers. Child triggers are normally or-ed together.
Currently, there is one useful top-level trigger, the `element ignoreOn`_. If an ignoreOn is triggered, the respective row is silently dropped (actually, you ignoreOn has a bail attribute that allows you to raise an error if the trigger is pulled; this is mainly for debugging).
The following triggers are defined:
Renderers Available¶
The following renderers are available for allowing and URL creation. The parameter style is relevant when adapting condDescs` or table based cores to renderers:
- With clear, parameters are just handed through
- With form, suitable parameters are turned into vizier-like expressions
- With pql, suitable parameters are turned into their PQL counterparts, letting you specify ranges and such.
Unchecked renderers can be applied to any service and need not be explicitly allowed by the service.
Predefined Procedures¶
Procedures available for rowmaker/parmaker apply¶
Procedures available for grammar rowfilters¶
Procedures available for datalink cores¶
Predefined Streams¶
Streams are recorded RD elements that can be replayed into resource
descriptors using the FEED
active tag. They do, however, support
macro expansion; if macros are expanded, you need to given them values
in the FEED element (as attributes). What attributes are required
should be mentioned in the following descriptions for those predefined
streams within DaCHS that are intended for developer consumption.
Streams for building conditions on dbCores¶
Other Streams¶
Configuration Reference¶
DaCHS’ basic configuration is done through one or more INI-style files, which are parsed using Python’s configparser module; in case of doubt on features and syntax, refer to the Python reference documentation of the Python version you are using.
Some of the more common items are discussed in the tutorial (tutorial.html#configuration-settings).
DaCHS first looks for the configuration in /etc/gavo.rc
, and on
production sites, it is recommended to only use this file to avoid
surprises. For special situations and development systems, the
following extra features are available:
- DaCHS will also pick up configuration from
.gavorc
in the calling user’s home directory. Configuration there is merged with what is in the main configuration. You can use this to, for instance, run a second, development-type server on a production instance (obviously, you will have to run it on a different port), or for configuring extra debugging features, or whatever. - You can override the location of
/etc/gavo.rc
using theGAVOSETTINGS
environment variable. This is intended mainly for when everything DaCHS-related should be in a single, non-OS directory, such as when the DaCHS binaries sit in a container that should be readily exchangable. Another usage is when, on a development system, you want to be able to run DaCHS using the resources of the production system (but that is, obviously, a bit dangerous if the DaCHS versions on development and production are significantly different). - You can even override the location of the per-user configuration
(i.e.,
~/.gavorc
) usingGAVOCUSTOM
. This is mainly for quickly switching between configurations on development systems and thus probably irrelevant for operators.
The configuration items available are, by section:
Code in DaCHS¶
This section contains a few general points for python code embedded in DaCHS. Most of the material applies to procedure definitions (`Element apply`_, `Element dataFormatter`_, `Element dataFunction`_, `Element descriptorGenerator`_, `Element iterator`_, `Element metaMaker`_, `Element processEarly`_, `Element processLate`_, `Element pargetter`_, `Element phraseMaker`_, `Element regTest`_, `Element rowfilter`_, `Element sourceFields`_) as well as to other pieces of code, such as in `Element customGrammar`_, `Element customCore`_, or Custom Pages.
More information on what names DaCHS would like you to see are available in Functions Available For Row Makers and The DaCHS API.
Importing Modules¶
To keep the various resources as separate from each other as possible,
DaCHS does not manipulate Python’s import path. However, one frequently
wants to have library-like modules providing common functionality or
configuration in a resdir (the conventional place for these would be in
res/
).
To import these, use api.loadPythonModule(path)
. Path, here, is the
full path to the file containing the python code, but without the
.py
. When you have the RD, the conventional pattern is:
mymod, _ = api.loadPythonModule(rd.getAbsPath("res/mymod"))
instead of import mymod
.
As you can see loadPythonModule
returns a tuple; you’re very
typically only interested in the first element.
Note in particular that for modules loaded in this way, the usual rule that you can just import modules next to you does not apply. To import a modules “next to” you without having to go through the RD, use the special form:
siblingmod, _ = api.loadPythonModule("siblingmod", relativeTo=__file__)
instead of import siblingmod
. This will take the directory part for
what’s in relativeTo
(here, the module’s own path) and make a full
path out of the first argument to pull the modules from there.
Database Queries¶
You often need to query the database from DaCHS-related code. To get connections, use DaCHS’ connection pool, and to make sure you return the connections to the pool when done, only use them through context managers. Depending on what you need to do, there are four pools you might be interested in:
- getTableConn – a sane default, lets you look at all tables but not write (anything). This is what DaCHS uses to run queries derived from normal DAL requests.
- getUntrustedConn – a connection that has the privileges of a remote user coming in through TAP.
- getWritableAdminConn – a connection that lets you read and write about everywhere. When using this, you must explicitly commit the connection, or your changes will be lost.
- getAdminConn – like getWritableAdminConn, but will automatically be committed as it is returned to the pool (unless an exception happened, in which case the connection will be closed without committing it).
With DaCHS’ connections, you usually will not obtain cursors but directly use one of the
query(q, args={}, timeout=None)
orqueryToDict(q, args={}, timeout=None, caseFixer=None)
methods.
The query q
has psycopg2-style placeholders (... WHERE
mag<%(maglim)s
) which are filled, again psycopg2-like from the
args
dictionary. timeout
is given in seconds.
Both return an iterator; for query
, this yields a tuple per row, for
queryToDict
, that is dictionaries, the keys of which are the
lowercased column names (or something hard to predict for expressions
in the select clause not sporting an AS). In case you need mixed-case
keys (we recommend you avoid that), you can pass in a caseFixer
dictionary that maps the lowercased names to their mixed-case versions.
Note that the queries passed will not be executed unless you start
consuming the iterator returned. This means that you can only use them
when you actually get back a result. Instead of the query methods, use
execute
for statements that do not return anything.
In sum, the typical database query in the vicinity of DaCHS would look like this:
count_sum = 0
with base.getTableConn() as conn:
for row in conn.queryToDicts("SELECT * FROM sch.my_table"):
count_sum += row["count"]
print(row)
or, for queries not returning anything:
with base.getWritableAdminConn() as conn:
conn.execute("DROP TABLE obsolete_table")
In case you wondered: Yes, in the past we have experimented with abstracting away the SQL. And we found it doesn’t make the code any more robust, just a lot harder to figure out.
Data Descriptors¶
Most basic information on data descriptors is contained in tutorial.html. The material here just covers some advanced topics.
Updating Data Descriptors¶
By default, dachs imp
will try to drop all tables made by the data
descriptors selected. For “growing” data, that is suboptimal,
since typicaly just a few new datasets need to be added to the table,
and re-ingesting everything else is just a waste of time and CPU.
To accomodate such situations, DaCHS allows to add an
updating="True"
attribute to a data
element; updating DDs will
create tables that do not exist but will not drop existing ones.
Using fromdb on ignoreSources¶
Updating DDs will still run like normal DDs and thus import everything
matching the DD’s sources
. Thus, after the second import you would
have duplicate records for sources that existed during the first import.
To avoid that, you (usually) need to ignore existing sources (see
`Element ignoreSources`_). In the typical case, where a dataset’s accref
is just the inputs-relative path to the dataset’s source, that is easily
accomplished through the fromdb
attribute of ignoreSources
; its
value is a database query that returns the inputs-relative paths of
sources to ignore.
Hence, unless you are playing games with the accrefs (in which case you
are probably smart enough to figure out how to adapt the pattern), the
following speficiation will exactly import all FITS files within the
data subdirectory of the resdir that haven’t been ingested into the
mydata
table during the last run, either because they’ve not been
there or because there were skipped during an import -c
:
<data id="import" updating="true">
<sources pattern="data/*.fits">
<ignoreSources fromdb="select accref from \schema.mydata"/>
</sources>
<fitsProdGrammar>
<rowfilter procDef="//products#define">
<bind key="table">"\schema.mydata"</bind>
</rowfilter>
</fitsProdGrammar>
<make table="mydata">
<!-- your rowmaker here -->
</make>
</data>
Note that fromdb
can be combined with fromfiles
and pattern
;
whatever is specified in the latter two will always be ignored.
To completely re-import such a table – for instance after a table schema
change or because the whole data collection has been re-processed –,
just run dachs drop
on the DD and run import as usual.
It is probably a good idea to occasionally run dachs imp -I
on
tables updated in this way to optimise the indices (a REINDEX
<tablename>
in a database shell will do, too).
Using fromdbUpdating on ignoreSources¶
Sometimes reprocessing happens quite frequently to a small subset of the datasets in a resource. In that case, it would again be a waste to tear down the entire thing just to update a handful of records.
For such situations, there is the fromdbUpdating
attribute of
ignoreSources
. As with fromdb
, this contains a database query,
but in addition to the accref, this query has to return a timestamp. A
source is then only ignored if this timestamp is not newer than the disk
file’s one. If that timestamp is the mtime of the file in the original
import, the net effect is that files that have been modified since that
import will be re-ingested.
There is a catch, though: You need to make sure that the record ingested
previously is removed from the table. Typically, you can do that by
defining accref as a primary key (if that’s not possible because you are
generating multiple records with the same accref, there is nothing wrong
with using a compound primary key). This will, on an attempted
overwrite, cause an IntegrityError
, and you can configure DaCHS to
turn this into an overwrite using the table
’s forceUnique
and
dupePolicy
attributes.
The following snippet illustrates the technique:
<table id="withdate" mixin="//products#table" onDisk="True"
primary="accref"
forceUnique="True"
dupePolicy="overwrite">
<column name="mtime" type="timestamp"
ucd="time;meta.file"
tablehead="Timestamp"
description="Modification date of the source file."/>
<!-- your other columns -->
</table>
<data id="import" updating="True">
<sources pattern="data/*.fits">
<ignoreSources
fromdbUpdating="select accref, mtime from \schema.withdate"/>
</sources>
<fitsProdGrammar>
<rowfilter procDef="//products#define">
<bind key="table">"\schema.withdate"</bind>
</rowfilter>
</fitsProdGrammar>
<make table="withdate">
<rowmaker>
<map key="mtime">datetime.datetime.utcfromtimestamp(
os.path.getmtime(\fullPath))</map>
<!-- other rowmaker rules -->
</rowmaker>
</make>
</data>
Again, this can be combined with the other attributes of
ignoreSources
; in effect, whatever is ignored from them is treated
as if their modification dates were in the future.
Harvesting from Remote Databases¶
When using `Element odbcGrammar`_, identifying what is already ingested
clearly cannot use sources
. Instead, you will have to pick added or
modified records in some other way. Realistically, you should keep some
monotonously increasing value on both sides. Ideally, it would be a
transaction id, because for these, it’s clear whether or not something has
actually been transferred. In a pinch, a unix timestamp will do, too.
Particular care should be taken when harvesting from databases to avoid
duplicate rows when a re-harvest fetches the “same” record twice for
whatever reason. You should probably designate a primary key and
specify a dupePolicy
like this:
<table id="mirror" onDisk="True"
primary="obid"
dupePolicy="overwrite">
In case some other table holds foreign keys into your table, it is
wise to think hard whether the dupePolicy should really be dropOld
(cf. Element table).
Your data element will use an odbc grammar with a computed query (available on DaCHS newer than 2.5). The example in Element makeyQuery shows the basics. The important part is to consider the case when the local table does not exist yet (as it will on the original import). Dealing with this as shown in the example lets you use the same data element for imports and updates.
Metadata¶
Various elements support the setting of metadata through meta elements. Metadata is used for conveying RMI-style metadata used in the VO registry. See [RMI] for an overview of those. We use the keys given in RMI, but there are some extensions discussed in RMI-style Metadata.
The other big use of meta information is for feeding templates. Those “local” keys should all start with an underscore. You are basically free to use those as you like and fetch them from your custom templates. The predefined templates already have some meta items built in, discussed in Template Metadata.
So, metadata is a key-value mapping. Keys may be compound like in RMI, i.e., they may consist of period-separated atoms, like publisher.address.email. There may be multiple items for each meta key.
Defining Metadata¶
In RDs, there are two ways to define metadata: Meta elements and meta
streams; the latter are also used in defaultmeta.txt
.
Meta Elements¶
These look like normal XML elements and have a mandatory name
attribute, a meta key relative to the element’s root . The text content
is taken as the meta value; child meta elements are legal.
An optional attribute for all meta elements is format
(see Meta
Formats).
Typed meta elements can have further attributes; these usually can also be given as meta children with the same name.
Usually, metadata is additive; add a key twice and you will have a sequence of two meta values. To remove previous content, prefix the meta name with a bang (!). Here is an example:
<resource>
<!-- a simple piece of metadata -->
<meta name="title">A Meta example</meta>
<!-- repeat a meta thing for a sequence (caution: not everything
is repeatable in all output formats -->
<meta name="subject">Examples</meta>
<meta name="subject">DaCHS</meta>
<!-- Hierarchical meta can be set nested -->
<meta name="creator">
<meta name="name">Nations, U.N.</meta>
<meta name="logo">http://un.org/logo.png</meta>
</meta>
<meta name="creator">
<meta name="name">Neumann, A.E.</meta>
</meta>
<!-- @format lets you specify extra markup; make sure you
have consistent initial indentation. -->
<meta name=description" format="rst">
This resource is used in the `DaCHS reference docs`_
.. _DaCHS reference Docs: http://docs.g-vo.org/DaCHS
</meta>
<!-- you can contract "deeper" trees in paths -->
<meta name="contact.email">gavo@ari.uni-heidelberg.de</meta>
<!-- typed meta elements can have additional attributes -->
<meta name="uses" ivoId="ivo://org.gavo.dc/DaCHS"
>DaCHS server sortware</meta>
<!-- To overwrite a key set before, prefix the name with a bang. -->
<meta name="!title">An improved Meta example</meta>
</resource>
The resulting meta structure is like this:
+-- title
| +---- "An improved Meta example
|
+-- subject
| +---- "Examples
| +---- "DaCHS
|
+-- creator
| +----- name
| | +---- "Nations, U.N.
| +----- logo
| | +---- "http://un.org/logo.png
+-- creator
| +----- name
| +---- "Neumann, A.E.
|
+-- description
| +----- [formatted text, "This resource..."]
|
+-- contact
| +----- email
| +----- "gavo@ari.uni-heidelberg.de
|
+-- uses
+----- "DaCHS server software
+----- ivoId
+----- "ivo://org.gavo.dc/DaCHS
Stream Metadata¶
In several places, most notably in the defaultmeta.txt
file and in
meta elements without a name
attribute, you can give metadata as a
“meta stream”. This is just a sequence of lines containing pairs of
<meta key> and <meta value>.
In addition, there are comments, empty lines, continuations, forced overwriting, and format selection.
Continuation lines work by ending a line with a backslash. The following line separator and all blanks and tabs following it are then ignored. Thus, the following two meta keys end up having identical values:
meta1: A contin\
uation line needs \
a blank if you wan\
t one.
meta2: A continuation line needs a blank if you want one
Note that whitespace behind a backslash prevents it from being a continuation character. That is, admittedly, a bit of a trap.
Other than their use as continuation characters, backslashes have no special meaning within meta streams as such. Within meta elements, however, macros are expanded after continuation line processing if the meta parent knows how to expand macros. This lets you write things like:
<meta>
creationDate: \metaString{authority.creationDate}
managingOrg:ivo://\getConfig{ivoa}{authority}
</meta>
Comments and empty lines are easy: Empty lines are allowed, and a comment is a line with a hash (#) as its first non-whitespace character. Both constructs are ignored, and you can even continue comments (though you should not).
When you repeat a key, metadata is being added. Hence,:
subject: active-galaxies
subject: meteorites
will lead to two keywords in the subject meta. Sometimes you instead want to overwrite what is already there, in particular for meta that needs to be unique:
!title: A revised title
will make sure that there is just a single title meta, and it is “A revised title”.
Finally, stream meta has format=plain
by default. To select raw or
reStructuredText format, prefix the value with raw:
or rst:
,
respectively:
_sidebarlocal: raw:<div class="sidebarnote">\
<a href="/adql">Try ADQL</a> to query our data.</div>
description: rst:This is *nice*, **important** data.
Meta inheritance¶
When you query an element for metadata, it first sees if it has this metadata. If that is not the case, it will ask its meta parent. This usually is the embedding element. It wil again delegate the request to its parent, if it exists. If there is no parent, configured defaults are examined. These are taken from rootDir/etc/defaultmeta, where they are given as colon-separated key-value pairs, e.g.,
publisher: The GAVO DC team
publisherID: ivo://org.gavo.dc
contact.name: GAVO Data Center Team
contact.address: Moenchhofstrasse 12-14, D-69120 Heidelberg
contact.email: gavo@ari.uni-heidelberg.de
contact.telephone: ++49 6221 54 1837
creator.name: GAVO Data Center
creator.logo: http://vo.ari.uni-heidelberg.de/docs/GavoTiny.png
The effect is that you can give global titles, descriptions, etc. in the RD but override them in services, tables, etc. The configured defaults let you specify meta items that are probably constant for everything in your data center, though of course you can override these in your RD elements, too.
In HTML templates, missing meta usually is not an error. The corresponding elements are just left empty. In registry documents, missing meta may be an error.
Meta formats¶
Metadata must work in registry records as well as in HTML pages and possibly in other places. Thus, it should ideally be given in formats that can be sensibly transformed into the various formats.
DaCHS knows four input formats:
- literal
- The textual content of the element will not be touched. In HTML, it will end up in a div block of class literalmeta.
- plain
- The textual content of the element will be whitespace-normalized, i.e., whitespace will be stripped from the start and the end, runs of blanks and tabs are replaced by a single blank, and empty lines translate into paragraphs. In HTML, these blocks com in plainmeta div elements.
- rst
- The textual content of the element is interpreted as ReStructuredText. When requested as plain text, the ReStructuredText itself is returned, in HTML, the standard docutils rendering is returned.
- raw
- The textual content of the element is not touched. It will be embedded into HTML directly. You can use this, probably together with CDATA sections, to embed HTML – the other formats should not contain anything special to HTML (i.e., they should be PCDATA in XML lingo). While the software does not enforce this, raw content should not be used with RMI-type metadata. Only use it for items that will not be rendered outside of HTML templates.
Macros in Meta Elements¶
Macros will be expanded in meta items using the embedding element as macro processors (i.e., you can use the macros defined by this element).
Typed Meta Elements¶
While generally the DC software does not care what you put into meta items and views them all as strings, certain keys are treated specially. The following meta keys trigger some special behaviour:
Additionally, there is creator
, which is really special (at least
for now). When you set creator to a string, the string will be split at
semicolons, and for each substring a creator item with the respective
name is generated. This may sound complicated but really does about
what your would expect when you write:
<meta name="creator">Last, J.; First, B.; Middle, I.</meta>
Metadata in Standard Renderers¶
Certain meta keys have a data center-internal interpretation, used in renderers or writers of certain formats. These keys should always start with an underscore. Among those are:
_intro: | used by the standard HTML template for explanatory text above the search form. |
---|---|
_bottominfo: | used by the standard HTML template for explanatory text below the search form. |
_related: | used in the standard HTML template for links to related services. As listed above, this is a link, i.e., you can give a title attribute. |
_longdoc: | used by the service info renderer for an explanatory piece of text of arbitrary length. This will usually be in ReStructuredText, and we recommend having the whole meta body in a CDATA section. |
_news: | news on the service. See above at Typed Meta Elements. |
_warning: | used by both the VOTable and the HTML table renderer. The content is rendered as some kind of warning. Unfortunately, there is no standard how to do this in VOTables. There is no telling if the info elements generated will show anywhere. |
_noresultwarning: | |
displayed by the default response template instead of an empty table (use it for things like “No Foobar data for your query”) |
|
_type: | on Data instances, used by the VOTable writer to set the
|
superseded: | in RDs or services, marks them as superseded, which generally makes them inaccessible. The body of this meta should provide pointers to where the new version(s) of the resources might be found (cf. tutorial.html#deleting-resources). |
_plotOptions: | typically set on services, this lets you configure
the initial appearance of the javascript-based quick plot. The value
must be a javascript dictionary literal (like
|
RMI-Style Metadata¶
For services (and other things) that are registered in the Registry, you must give certain metadata items (and you can give more), where we take their keys from [RMI]. We provide a explanatory leaflet for data providers. The most common keys – used by the registry interface and in part by HTML and VOTable renderers – include:
title: | this should in general be given separately on the resource, each table, and each service. In simple cases, though, you may get by by just having one global title on the resource and rely on metadata inheritance. |
---|---|
shortName: | a string that should indicate what the service is in 16 characters or less. |
creationDate: | Use ISO format with time, UTC only, like this: 2007-10-04T12:00:00Z |
_dataUpdated: | The timestamp of the last successful dachs imp ,
again in DALI/ISO format. |
_metadataUpdated: | |
Timestamp when the metadata was last updated. On
RDs, that’s the timestamp of the RD source, on published things,
it’s the timestamp of the last dachs pub . |
|
subject: | A subject keyword. By VOResource 1.1, these should be taken from http://www/ivoa.net/rdf/uat |
rights: | freetext copyright notice. See tutorial.html#Licensing for details. |
rights.rightsURI: | |
machine readable license URI. There’s no formal list of acknowledged URIs yet, but the common CC URIs are a good start. | |
source: | bibcodes will be expanded to ADS links here. |
referenceURL: | again, a link, so you can give a title for presentation purposes. If you give no referenceURL, the service’s info page will be used. |
creator.name: | this should be the name of the “author” of the data
set. If you set this, you may want to override creator.logo as well.
For persons, always use the form “Last, F.I.“; this saves all
components the unsolvable problem to tell first from last names, and
sorting these strings will naturally yield the sequence people expect.
Also, if you have multiple creators, better just set creator as
discussed in Typed Meta Elements. |
type: | one of Other, Archive, Bibliography, Catalog, Journal, Library, Simulation, Survey, Transformation, Education, Outreach, EPOResource, Animation, Artwork, Background, BasicData, Historical, Photographic, Press, Organisation, Project, Registry – it’s optional and we doubt its usefulness. You may repeat the content type if you need to; see also [RMI], sect. 3.3. |
contentLevel: | addresse(s) of the data: Research, Amateur, General |
facility: | no IVOA ids are supported here yet, but probably this should change. |
coverage: | see the special section |
service-specific metadata (for SIA, SCS, etc.): | |
see the documentation of the respective cores. | |
utype: | tables (and possibly other items) can have utypes to signify their role in specific data models. For tables, this utype gets exported to the tap_schema. |
identifier: | this is the IVOID of the resource, usually generated
by DaCHS. Do not override this unless you know what you are doing
(which at least means you know how to make DaCHS declare an authority
and claim it). If you do override the identifier of a service that’s
already published, make sure you run
dachs admin makeDeletedRecord <previous identifier> (before or
after the dachs pub on the resource, or the registries will have
two copies of your record, one of which will not be updated any more;
and that would suck for Registry users. |
mirrorURL: | add these on publication to declare mirrors for a service. Only do so if you actually manage the other service. If you list the service’s own accessURL here, it will be filtered from this registry record; this is so you can use the same RD on the primary site and the mirror. |
_example: | A DALI example. See tutorials.html#writing-examples |
moreExamples: | A URI for an additional DALI examples document. These get translated to DALI continuation-s. |
tableset: | A DaCHS reference to a table to include in a registry tableset (new in 2.7.3). This currently will only be interpreted in document-typed resources. |
While you can set any of these in etc/defaultmeta.txt, the following items are usually set there:
- publisher
- publisherID
- contact.name
- contact.address
- contact.email
- contact.telephone
Coverage Metadata¶
Coverage metadata lets clients get a quick idea of where in space, time, and electromagnetic spectrum the data within a resource is. Obviously, this information is particularly important for resource discovery in registries.
Not all resources have coverages on all axes; a service validator, say, probably has no physical coverage at all, and a theoretical spectral service may just have meaningful spectral coverage.
There are two meta keys pertinent to coverage metadata:
coverage.waveband
– One of Radio, Millimeter, Infrared, Optical, UV, EUV, X-ray, Gamma-ray, and you can have multiple waveband specifications. As this information is quite regularly used in discovery you should make sure to define it if applicable.coverage.regionOfRegard
– in essence, the “pixel size” of the service in degrees. If, for example, your service gives data on a lattice of sampling points, the typical distance of such points should be given here. You will not usually specify this unless your „pixel size” is significantly larger than about an arcsec.
The legacy coverage.profile
meta key should not be used any more.
To give proper, numeric STC coverage, use the `Element coverage`_.
It has three children, one each for the spatial, spectral, and temporal axes. For spectral and temporal, just add as many intervals as necessary. Do not worry about gaps in the temporal coverage: it is not necessary that the coverage is “tight”; as long as there is a reasonable expectation that data could be there, it’s fine to declare coverage. Hence, for ground-based observations, there is no need to exclude intervals of daylight, bad weather, or even maintenance downtime.
Intervals are given as in VOTable tabledata, i.e., as two floating point numbers separated by whitespace. There are no (half-) open intervals – just use insanely small or large numbers if you really think you need them.
For spatial coverage, a single spatial
element should be given. It
has to contain a MOC in ASCII serialisation. Recent versions of Aladin
can generate those, or you can write SQL queries to have them computed
by sufficiently new versions of pgsphere. Most typically, you will use
updater
elements to fill spatial coverage (see below).
A complete coverage element would thus look like this:
<coverage>
<spectral>3.8e-07 5.2e-07</spectral>
<temporal>18867 27155</temporal>
<spatial>
4/2068
5/8263,8268-8269,8271,8280,8323,8326,8329,9376,9378
6/33045-33047,33049,33051,33069,33080-33081,33083,33104-33106,
33112,33124-33126,33128-33130,33287,33289,33291,33297-33299,
33313,33315,33323-33326,33328-33330,37416,37418,37536
</spatial>
</coverage>
In general computing coverage is a tedious task. Hence, DaCHS has rules
to compute it for many common cases (SSAP, SIAP, Obscore, catalogs with
usable UCDs). Because coverage calculations can run for a long time,
they are not performed online. Instead, DaCHS updates coverage elements
when the operator runs dachs limits
. In the simplest case,
operators add:
<coverage>
<updater sourceTable="data"/>
<spectral/>
<temporal/>
<spatial/>
</coverage>
into an RD with a table named data. Currently, this must be lexically below the table element, but if this isn’t fixed to allow the location of the coverage element near the rest of the metadata near the top of the RD, complain fiercely.
Operators then run dachs limits q
(assuming the RD is called
q.rd
), and DaCHS will fill out the three coverage elements (in case
you want to fix them: the heuristics it uses to do that are in
gavo.user.info).
In this construction, DaCHS will overwrite any previous content in the
coverage child elements. If you want to fill out some coverage items
manually and have DaCHS only compute, say, the spatial coverage, don’t
give the sourceTable
attribute (which essentially says: “grab as
much coverage from the referenced table as you can”) but rather the
specialised spaceTable
. This is particularly useful if you want to
annotate ”holes” in your temporal coverage. For instance, if your
resource contains two fairly separate campaigns (which DaCHS does not
currently realise automatically):
<coverage>
<updater spaceTable="main"/>
<spatial/>
<temporal>45201 45409</temporal>
<temporal>54888 55056</temporal>
</coverage>
Due to limitations of pgsphere, DaCHS does not currently take into account the size of the items in a database table. While that is probably all right for spectra and catalogs, for images this might lose significant coverage, as DaCHS only uses the centers of the images and just marks the containing healpix of the selected MOC order. The default MOC order is 6 (a resolution of about a degree). Until we properly deal with polygons, make sure to increase the MOC order to at least the order of magnitude of the images in an image service, like this:
<coverage>
<updater sourceTable="main" mocOrder="4"/>
<spatial/>
</coverage>
If you know your resource only contains relatively few but compact
patches, you may also want to increase mocOrder
(spatial resolution
doubles when you increase mocOrder by one).
Display Hints¶
Display hints use an open vocabulary. As you add value formatters, you can evaluate any display hint you like. Display hints understood by the built-in value formatters include:
- displayUnit
- use the value of this hint as the unit to display a value in.
- spectralUnit
- as displayUnit, expect non-linear transformations between length, frequency, and energy are also supported, assuming they refer to electromagnitic radiation (otherwise it makes absolutely no sense to convert Joule to meter, say).
- nopreview
- if this key is present with any value, no HTML code to generate previews when mousing over a link will be generated.
- sepChar
- a separation character for sexagesimal displays and the like.
- sf
- “Significant figures” – length of the mantissa for this column. Will probably be replaced by a column attribute analogous to what VOTable does.
- type
a key that gives hints what to do with the column. Values currently understood include:
- bar
- display a numeric value as a bar of length value pixels.
- bibcode
- display the value as a link to an ADS bibcode query.
- checkmark
- in HTML tables, render this column as empty or checkmark depending on whether the value is false or true to python.
- humanDate
- display a timestamp value or a real number in either yr (julian year), d (JD, or MJD if DaCHS guesses it’s mjd; that’s unfortunately arcane still), or s (unix timestamp) as an ISO string.
- humanDay
- display a timestamp or date value as an ISO string without time.
- humanTime
- display values as h:m:s.
- keephtml
- lets you include raw HTML. In VOTables, tags are removed.
- product
- treats the value as a product key and expands it to a URL for the product (i.e., typically image). This is defined in protocols.products. This display hint is also used by, e.g., the tar format to identify which columns should contribute to the tar file.
- dms
- format a float as degree, minutes, seconds.
- simbadlink
- formats a column consisting of alpha and delta as a link to query simbad. You can add a coneMins displayHint to specify the search radius.
- hms
- force formatting of this column as a time (usually for RA).
- url
- makes value a link in HTML tables. The anchor text will be the last element of the path part of the URL, or, if given, the value of the anchorText property of the column (which is for cases when you want a constant text like “Details”). If you need more control over the anchor text, use an outputField with a formatter.
- imageURL
- makes value the src of an image. Add width to force a certain image size.
- noxml
- if ‘true’ (exactly like this), do not include this column in VOTables.
Note that not any combination of display hints is correctly interpreted. The interpretation is greedy, and only one formatter at a time attempts to interpret display hints.
Data Model Annotation¶
In the VO, data models are used when simple, more or less linear annotation methods like UCDs do not provide sufficient expressive power. Or well, they should be used. As of early 2017, things are, admittedly, still a mess.
DaCHS lets you annotate your data in dm
elements; the annotation
will then be turned into standard VOTable annotation (when that’s
defined). Sometimes, the structured references provided by the DM
annotation are useful elsewhere, too – the first actual use of this
framework was the geojson serialisation discussed below.
We first discuss SIL, then its use in actual data models. At least skim over the next section – it sucks to discover the SIL grammar by trial and error.
Old-style STC annotation is not discussed here. If you still want to do it (and for now, you have to if you want any STC annotation – sigh), check out the terse discussion in the tutorial
Annotation Using SIL¶
Data model annotation in DaCHS is done using SIL, the Simple Instance Language. It essentially resembles JSON, but all delimiters not really necessary for our use case have been dropped, and type annotation has been added.
The elements of SIL are:
The NULL literal,
__NULL__
. Attributes that are set to this literal are elided. This is mostly useful in the connection with data model annotation within mixins.Atomic Values. For SIL, everything is a string (it’s a problem of DM validation to decide otherwise). When your string consists exclusively of alphanumerics and [._-], you can just write it in SIL. Otherwise, you must use double quotes. as in SQL, write two double quotes to include a literal double quote. So, valid literals in SIL are
2.3e-3
red
"white and blue"
"""Yes,"" the computer said."
"could write (type:foo) {bar. baz} here"
(elements of SIL are protected in quoted literals)
Invalid literals include:
http://www.g-vo.org
(: and / may not occur in literals)red, white and blue
(no blanks and commas)22"
(no single quotes)
Plain Identifiers. These are C-like identifiers (a letter or an underscore optionally followed by letters, number or underscores).
Comments. SIL comments are classical C-style comments (
/*...*/
). They don’t nest yet, but they probably will at some point, so don’t write/*
within a comment.Object annotation. This is like a dictionary; only plain identifiers are allowed as keys. So, an object looks like this:
{ foo: bar longer: "This is a value with blanks in it" }
Note again that no commas or quotes around the keys are necessary (or even allowed).
Sequences. This is like a list. Members can be atomic or objects, but they have to be homogeneous (SIL doesn’t enforce this by grammatical means, though. Here is an object with two sequences:
{ seq1: [3 4 5 "You thought these were numbers? They're strings!"] seq2: [ { seq_index: 0 value: 3.3} { seq_index: 1 value: 0.3} ] }
References. The point of SIL is to say things about column and param instances. Both of them (and other dm instances, tables, and in principle anything else in RDs) can be referenced from within SIL. A reference starts with an @ and is then a normal DaCHS cross identifier (columns and params within a table can be referenced by name only, columns take precedence on name clashes). If you use odd characters in your RD names or in-RD identifiers, think again: only [._/#-] are allowed in such references. Here is an object with some valid references:
{ long: @raj2000 /* a column in the enclosing table */ lat: @dej2000 system: @//systems#icrs /* could be a dm instance in a DaCHS-global RD;, this does *not* exist yet */ source: @supercat/q#main /* perhaps a table in another RD */ }
Casting. You can (and sometimes have to) give explicit types in the SIL annotation. Types look like C-style casts. The root of a SIL annotation must always have a cast; that allows DaCHS to figure out what it is, which is essential for validation (and possibly inference of defaults and such). You can cast both single objects and sequences. Here’s an example that actually validates for DaCHS’ SIL (which the examples above wouldn’t because they’re missing the root annotation):
(testdm:testclass) { /* cast on root: mandatory */ attr1 { /* no cast here; DaCHS can infer attr1's type if necessary */ attr2: val } seq: (testdm:otherclass)[ /* Sequence cast: */ {attr1: a} /* all of these are now treated as testdm:otherclass */ {attr1: b} {attr1: c}]}
Photometry annotation¶
To produce photcal groups as per the 2020 Timeseries Note and perhaps later specs, use an annotation like this:
<dm>
(phot:PhotCal) {
filterIdentifier: "Gaia/G"
zeroPointFlux: \zeroPointFlux
magnitudeSystem: Vega
effectiveWavelength: \effectiveWavelength
value: @phot
}
</dm>
– where phot
would be the column containing the photometry.
When – as is most likely when you want to have the – you are using the //timeseries#phot-0 mixin, you would only give such a group when you have multiple photometry systems in one light curve (which you should avoid). For the first photometry column, this declaration is done by the mixin.
GeoJSON annotation¶
To produce GeoJSON output (as supported by DaCHS’ TAP implementation),
DaCHS needs to know what the “geometry“ in the sense of GeoJSON is.
Furthermore, DaCHS keeps supporting declaring reference systems in the
crs
attribute, as the planetology community uses it.
The root class of the geojson DM is geojson:FeatureCollection
. It
has up to two attributes (crs and feature), closely following
the GeoJSON structure itself. The geometry is defined in feature’s
geometry attribute. All columns not used for geometry will end up
in GeoJSON properties.
So, a complete GeoJSON annotation, in this case for an EPN-TAP table, could look like this:
<table>
<dm>
(geojson:FeatureCollection){
crs: (geojson:CRS) {
type: name
properties: (geojson:CRSProperties) {
name: "urn:x-invented:titan"}}}}
feature: {
geometry: {
type: sepsimplex
c1min: @c1min
c2min: @c2min
c1max: @c1max
c2max: @c2max }}}
</dm>
<mixin
spatial_frame_type="body"/>
</table>
Yes, the use type attributes is a bit of an abomination, but we wanted the structure to follow GeoJSON in spirit.
The crs attribute could also be of type link
, in which case the
properties would have attributes href and type; we’re not aware of
any applications of this in planetology, though. crs is optional (but
standards-compliant GeoJSON clients will interpret your coordinates as
WGS84 on Earth if you leave it out).
For geometry, several values for type are defined by DaCHS, depending on how the GeoJSON geometry should be constructed from the table. Currently defined types include (complain if you need something else, it’s not hard to add):
sepcoo
– this is for a spherical point with separate columns for the two axes. This needs latitude and longitude attributes, like this:<dm> (geojson:FeatureCollection){ feature: { geometry: { type: sepcoo latitude: @lat longitude: @long }}} </dm>
seppoly
– this constructs a spherical polygon out of column references. These have the form c_n_m, where m is 1 or 2, and n is counted from 1 up to the number of points. DaCHS will stop collecting points as soon as it doesn’t find an expected key. If you find yourself using this, check your data model. An example:<dm> (geojson:FeatureCollection){ feature: { geometry: { type: seppoly /* a triangle of some kind */ c1_1: @rb0 c1_2: @rb1 c2_1: @lb0 c2_2: @lb1 c3_1: @t0 c3_2: @t1 }}} </dm>
sepsimplex
– this constructs a spherical box-like thing from minimum and maximum values. It has c[12](min|max) keys as in EPN-TAP. As a matter of fact, a fairly typical annotation for EPN-TAP would be:<dm> (geojson:FeatureCollection){ feature: { geometry: { type: sepsimplex c1min: @c1min c2min: @c2min c1max: @c1max c2max: @c2max }}} </dm>
geometry
– this constructs a geometry from a pgsphere column. Since GeoJSON doesn’t have circles, only spoint and spoly columns can be used. They are referenced from the value key. For instance, obscore and friends could use:<dm> (geojson:FeatureCollection) { feature: { geometry: { type: geometry value: @s_region }}} </dm>
DaCHS’ Service Interface¶
Even though normal users should rarely be confronted with too many of the technical details of request processing in DaCHS, it helps to have a rough comprehension in order to understand several user-visible details.
In DaCHS’ architecture, a service is essentially a combination of a core and a renderer. The core is what actually does the query or the computation, the renderer adapts input and outputs to what a protocol or interface expects. While a service always has exactly one core (could be a nullCore, though), it can support more than one renderer, although the parameters in all renderers are, within reason, about the same, within reason.
However, parameters on a form interface will typcially be interpreted
differently from a VO interface on the same core. For instance, ranges
on the form interface are written as 1 .. 3
(VizieR compliance), on
an SSA 1.x interface 1/3
(“PQL” prototype), and on a datalink dlget
interface “1 2” (DALI 1.1 style). The extreme of what probably still
makes sense is the core search core that replaces SCS’s RA, DEC, and SR
with an entirely different set of parameters perhaps better suited for
interactive, browser-based usage.
Cores communicate their input interface by defining an input table, which is essentially a sequence of input keys, which in turn essentially work like params: in particular, they have all the standard metadata like units, ucds, etc. Input tables, contrary to what their name might suggest, have no rows. They can hold metadata, though, which is sometimes convenient to pass data between parameter parsers and the core.
When a request comes in, the service first determines the renderer
responsible. It then requests an inputTable for that renderer from the
core. The core, in turn, will map each inputKey in its inputTable
through a renderer adaptor as returned from
svcs.inputdef.getRendererAdaptor; this inspects the
renderer.parameterStyle, which must be taken from the
svcs.inputdef._RENDERER_ADAPTORS’ keys (currently form
, pql
,
dali
). inputKeys have to have the adaptToRenderer property set to
True to have them adapted. Most automatically generated inputKeys have
that; where you manually define inputKeys, you would have to set the
property manually if you want that behaviour (and know that you want
it; outside of table-based cores, it is unikely that you do).
Core Args¶
The input table, together with the raw arguments coming from the client,
is then used to build a svcs.CoreArgs
instance, which in turn takes
the set of input keys to build a context grammar. The core args have the
underlying input table (with the input keys for the metadata) in the
inputTD
attribute, the parsed arguments in the dictionary args
.
For each input key args
maps its name to a value; context grammars
are case-semisensitive, meaning that case in the HTTP parameter names is
in general ignored, but if a parameter name matching case is found, it
is preferred. Yes, ugly, but unfortunately the VO has started with
case-insensitive parameter names. Sigh.
The values in args
are a bit tricky:
- each raw parameter given must parse with a single inputKey’s parse. For instance, if an inputKey is a real[2], it will be parsed as a float array.
- if no raw parameter is given for an input key, its value will be None.
- when an inputKey specifies multiplicity=”multiple”, the non-None value in the core args is a list. Each list item is something that came out of the inputKey’s parser (i.e., it could be another list for array-valued parameters).
- when an inputKey specifies multiplicity=”single”, the value in the core args is a single value of whatever inputKey parses (or None for missing parameters). This is even true when a parameter has been given multiple times; while currently, the last parameter will win, we don’t guarantee that.
- when an inputKey specifies multiplicity=”force-single”, DaCHS works as in the single case, except that multiple specification will lead to an error.
- when an inputKey does not specify multiplicity, DaCHS will infer the desired multiplicity from various hints; essentially, enumerated parameters (values/options given in some way) have multiplicity multiple, everything else multiplicity single. It is wise not to rely on this behaviour.
These rules are independent of the type of core and hold for pythonCores or whatever just as for the normal, table-based cores. For these (and they are what users are mostly concerned with), special rules and shortcuts apply, though.
Table-based cores¶
Conddescs and input keys: Defining the input parameters¶
You will usually deal with cores querying database tables – dbCore, ssapCore, etc. For these, you will not normally define an inputTable, as it is being generated by the software from condDescs.
To create simple constraints, just buildFrom
the columns queried:
<condDesc buildFrom="myColumn"/>
(the names are resolved in the core’s queried table). DaCHS will automatically adapt the concrete parameter style is adapted to the renderer – in the web interface, there are vizier-like expressions, in protocol interfaces, you get fields understanding expressions, either as in SSAP (for the pql parameter style) or as defined in DALI (the dali parameter style).
This will generate query fields that work against data as stored in the database, with some exceptions (columns containing MJDs will, for example, be turned into VizieR-like date expressions for web forms).
Since in HTML forms, astronomers often ask for odd units and then want to input them, too, DaCHS will also honor the displayUnit display hint for forms. for instance, if you wrote:
<table id="ex1">
<column name="minDist"
unit="deg"
displayHint="displayUnit=arcsec"/>
...
<dbCore queriedTable="ex1">
<condDesc buildFrom="minDist"/>
...
then the form renderer would declare the minDist column to take its values in arcsecs and do the necessary conversions, while minDist would properly work with degrees in SCS or TAP.
For object lists and similar, it is frequently desirable to give the possible values (unless there are too many of those; these will be translated to option lists in forms and to metadata items for protocol services and hence be user visible). In this case, you need to change the input key itself. You can do this by deriving the input key from the column and assign it to a condDesc, like this:
<condDesc>
<inputKey original="source">
<values fromdb="source from plc.data"/>
</inputKey>
</condDesc>
Use the showItems="n"
attribute of input keys to determine how many
items in the selector are shown at one time.
If you want your service to fail if a parameter is not given, declare the condDesc as required:
<condDesc buildFrom="myColumn" required="True"/>
(you can also declare individual an inputKey
as required
).
If, on the other hand, you want DaCHS to fill in a default if the user provides no value, give a default to the input key using the values child:
<condDesc>
<inputKey original="radius">
<values default="0.5"/>
</inputKey>
<condDesc>
Sometimes a parameter shouldn’t be defaulted in a protocol request
(perhaps to satisfy an external contract), while the web interface
should pre-fill a sensible choice. In that case, use the
defaultForForm
property:
<condDesc>
<inputKey original="radius">
<property key="defaultForForm">0.5</property>
</inputKey>
<condDesc>
DaCHS will also interpret min
and max
attributes on the input
keys (and the columns they are generated from) to generate input hints;
that’s a good way to fight the horror vacui users have when there’s an
input box and they have no idea what to put there. The best way to deal
with this, however, is to not change the input keys but the columns
themselves, as in:
<table id="ex1">
<column name="mjd" type="double precision"
...>
<values min="" max=""/>
...
<dbCore queriesTable="ex1">
<condDesc buildFrom="mjd"/>
You will typically leave min and max empty and run:
dachs limits q#ex
when the table contents change; this will make DaCHS update the values in the RD itself.
Phrasemakers: Making custom queries¶
CondDescs will generate SQL adapted to the type of their input keys,
which; as you can imagine, for cases like the VizieR expressions, that’s
not done in a couple of lines. However, there are times when you need
custom behaviour. You can then give your conddescs a phraseMaker
, a
piece of python code generating a query and adding parameters:
<condDesc>
<inputKey original="confirmed" multiplicity="single">
<property name="adaptToRenderer">False</property>
</inputKey>
<phraseMaker>
<code>
if inPars.get(inputKeys[0].name, False):
yield "confirmed IS NOT NULL"
</code>
</phraseMaker>
</condDesc>
PhraseMakers work like other code embedded in RDs (and thus may have
setup). inPars
gives a dictionary of the input parameters as parsed
by the inputDD according to multiplicity. inputKeys
contains a
sequence of the conddesc’s inputKeys. By using their names as above,
your code will not break if the parameters are renamed.
It is usually a good idea to set the property adaptToRenderer
to
False in such cases – you generally don’t want DaCHS to use its standard
rules for input key adaptation as discussion above because that will
typically change what ends up in inPars
and hence break your code
for some renderers.
Note again that parameters not given will have the value None
throughout. The will be present in inPars, though, so do not try
things like "myName" in inPars
– that’s always true.
Phrase makers must yield zero or more SQL fragments; multiple SQL
fragments are joined in conjunctions (i.e., end up in ANDed conditions
in the WHERE clause). If you need to OR your fragments, you’ll
have to do that yourself. Use the base.joinOperatorExpr(operator,
operands)
for robustness to construct ORs.
Since you are dealing with raw SQL here, never include material from
inPars
directly in the query strings you return – this would immediately
let people do SQL injections at least when the input key’s type is
text or similar. Instead, use the getSQLKey
function as in this example:
<condDesc>
<inputKey original="hdwl" multiplicity="single"/>
<phraseMaker>
<code>
ik = inputKeys[0]
destRE = "^%s\\.[0-9]*$"%inPars[ik.name]
yield "%s ~ (%%(%s)s)"%(ik.name,
base.getSQLKey("destRE", destRE, outPars))
</code>
</phraseMaker>
</condDesc>
getSQLKey
takes a suggested name, a value and a dictionary, which
within phrase makers always is outPars
. It will enter value with the
suggested name as key into outPars
or change the suggested name if there
is a name clash. The generated name will be returned, and that is what
is entered in the SQL statement.
The outPars
dictionary is shared between all conddescs entering into
a query. Hence, if you do anything with it except passing it to
base.getSQLKey
, you’re voiding your entire warranty.
Here’s how to define a condDesc doing a full text search in a column:
<condDesc>
<inputKey original="source" description="Words from the catalog
description, e.g., author names or title words.">
<property name="adaptToRenderer">False</property>
</inputKey>
<phraseMaker>
<code>
yield ("to_tsvector('english', source)"
" @@ plainto_tsquery('english', %%(%s)s)")%(
base.getSQLKey("source", inPars["source"], outPars))
</code>
</phraseMaker>
</condDesc>
Incidentally, this would go with an index definition like:
<index columns="source" method="gin"
>to_tsvector('english', source)</index>
Grouping Input Keys¶
For special effects, you can group inputKeys. This will make them show up under a common label and in a single line in HTML forms. Other renderers currently don’t do anything with the groups.
Here’s an example for a simple range selector:
<condDesc>
<inputKey name="el" type="text" tablehead="Element"/>
<inputKey name="mfmin" tablehead="Min. Mass Fraction \item">
<property name="cssClass">a_min</property>
</inputKey>
<inputKey name="mfmax" tablehead="Max. Mass Fraction \item">
<property name="cssClass">a_max</property>
</inputKey>
<group name="mf">
<description>Mass fraction of an element. You may leave out
either upper or lower bound.</description>
<property name="label">Mass Fraction between...</property>
<property name="style">compact</property>
</group>
</condDesc>
You will probably want to style the result of this effort using the
service
element’s customCSS
property, maybe like this:
<service...>
<property name="customCSS">
input.a_min {width: 5em}
input.a_max {width: 5em}
input.formkey_min {width: 6em!important}
input.formkey_max {width: 6em!important}
span.a_min:before { content:" between "; }
span.a_max:before { content:" and "; }
tr.mflegend td {
padding-top: 0.5ex;
padding-bottom: 0.5ex;
border-bottom: 1px solid black;
}
</property>
</service>
See also the entries on multi-line input, selecting input fields with a widget, and customizing generated SCS conditions in DaCHS’ howto document.
Output tables¶
When determining what columns to include in a response from a table-based core, DaCHS follows relatively complicated rules because displays in the browser and almost anywhere else are subject to somewhat different constraints. In the following, when wie talk about “VOTable”, we refer to all tabular formats produced by DaCHS (FITS binary, CSV, TSV…).
The column selection is influenced by:
Verbosity. This is controlled by the
VERB
parameter (1..3) or preferentiallyverbosity
(1..30). Only columns withverbLevel
not exceeding verbosity (or, if not given, VERB*10) are included in the result set. This, in particular, means that columns withverbLevel
larger than 30 are never automatically included in output tables (but they can be manually selected for HTML using_ADDITEM
).Output Format. While VOTable takes the core’s output table and apply the verbosity filter, HTML uses the service’s output table as the basis from which to filter columns. On the other hand, in HTML output the core output table is used to create the list of potential additional columns.
votableRespectsOutputTable
. This is a property on services that makes DaCHS use the service’s output table even when generating VOTable output if it is set toTrue
. Write:<property name="votableRespectsOutputTable">True</property>
in your service element to enable this behaviour.
_ADDITEM
. This parameter (used by DaCHS’ web interface) lets users select columns not selected by the current settings or the service’s output table._ADDITEM
is ignored in VOTable unless in HTML mode (which is used in transferring web results via SAMP).noxml
. Columns can be furnished with adisplayHint="noxml=true"
, and they will never be included in VOTable output; use this when you use complex formatters to produce HTML displays._SET
. DaCHS supports “column sets”, for instance, to let users select certain kinds of coordinates. See apfs/res/apfs_new.rd` for an example. Essentially, when defining an output table, each output field gets asets
attribute (default: no set; use ALL to have the column included in all outputs). Then, add a_SET
service parameter (usevalues
do declare the available sets). Note that the_SET
parameter changes VOTable column selection tovotableRespectsOutputTable
mode as discussed above. Services that use column sets should therefore set the property manually for consistency whether or not clients actually pass_SET
.
Sorry for this mess; all this had, and by and large still has, good reasons.
The Vanity Map¶
DaCHS’ URL scheme leads to somewhat clunky URLs that, in particular, reflect the file system underneath. While this doesn’t matter to the VO registry, it is possibly unwelcome when publishing URLs outside of the VO. To overcome it, you can define “vanity names”, single path elements that are mapped to paths.
These mappings are read from the file $GAVO_ROOT/etc/vanitynames.txt
.
The file contains lines of the format:
<target> <key> [<option>]
Target is a path that must not include nevowRoot and must not start with a slash (unless you’re going for very special effects).
Key normally is a single path element (i.e., a string without a slash). If this path element is found in the first segment, it is replaced with the segments in target.
<option>
can only be !redirect
or empty right now.
If it is !redirect
, <key>
may be a path fragment (as opposed to
a single path element); leading and trailing slashes are ignored. If
the enire query path matches this key, a redirect to this key is
generated. This is intended to let you shut down services and introduce
replacements. If the incoming URL contains a query, it will be appended
to the replacement URL. Thus, even stored queries or forms can
potentially work across such a redirect.
You can also (ab)use the redirect option to give vanity names, but since the target will show up in the browser address line, normal maps are highly preferred. The only time normal maps don’t work for this is when the resource directory is identical to the vanity name (you’ll get an endless loop then), so you should avoid that situation.
Empty lines and #-on-a-line-comments are allowed in the input.
As an example, here’s the vanity map that DaCHS had builtin as of version 2.1:
__system__/products/p/get getproduct
__system__/products/p/dlasync datalinkuws
__system__/services/registry/pubreg.xml oai.xml
__system__/services/overview/external odoc
__system__/dc_tables/show/tablenote tablenote
__system__/dc_tables/show/tableinfo tableinfo
__system__/services/overview/admin seffe
__system__/services/overview/rdinfo browse
__system__/tap/run/tap tap
__system__/adql/query/form adql !redirect
__system__/run/genrd genrd
Note again that <key>
must be a single path element only.
Writing Custom Cores¶
While DaCHS provides cores for many common operations – in particular, database queries and wrapped external binaries –, there are of course services that need to do things not covered by what the shipped cores do. A common case is wrapping external binaries.
Many such cases still follow the basic premise of services: GET or POST
parameters in, something table-like out. You should then use custom
cores, which then still let you use normal DaCHS renderers (in
particular form
and api
/sync
). When that doesn’t cut it,
you’ll need to use a custom renderer.
While a custom core is defined in a separate module – this also helps debugging since you can run it outside of DaCHS –, there’s also the python core that keeps the custom code inside of the RD. This is very similar; Python Cores instead of Custom Cores explains the differences.
The following exposition is derived from the times service in the GAVO data center, a service wrapping some FORTRAN code wrapping SOFA (yes, we’re aware that we would directly use SOFA through astropy; that’s not the point here). Check out the sources at http://svn.ari.uni-heidelberg.de/svn/gavo/hdinputs/apfs; the RD is times.rd.
Defining a Custom Core¶
In an RD, a custom core is very typically just written with a reference to a defining module:
<customCore module="res/timescore"/>
The path is relative to the resdir, and you don’t include the module’s extension (DaCHS uses normal python module resolution, except for temporarily extending the search path with the enclosing directory). You can, in principle, declare the core’s interface in that element, but that’s typically not a good idea (see below).
The above declaration means you will find the core itself in
res/timescore.py
.
Ideally, you’ll just use the DaCHS API in the core, since we try fairly hard to keep that api constant. The timescore doesn’t quite follow that rule because it wants to expand VizieR expressions, which normal services probably won’t do.
DaCHS expects the custom core under the name Core
. Thus, the
centerpiece of the module is:
from gavo import api
class Core(api.Core):
The core needs an InputTable and an OutputTable like all cores. You could define it in the resource descriptor like this:
<customCore id="createCore" module="bin/create">
<inputTable>
<inputKey .../>
</inputTable>
<outputTable>
<column name="itemsAdded" type="integer" tablehead="Items added"/>
</outputTable>
</customCore>
It’s preferable to define at least the input in the code, though, since
it’s more likely to be kept in sync with the code in that case.
Embedding the definitions is done using the class attribute
inputTableXML
:
class Core(core.Core):
inputTableXML = """
<inputTable>
<inputKey name="ut1" type="vexpr-date" multiplicity="single"
tablehead="UT1"
description="Date and time (UT1)" ucd="time.epoch;meta.main"/>
<inputKey name="interval" type="integer" multiplicity="single"
tablehead="Interval"
unit="s" ucd="time.interval"
description="Interval between two sets of computed values"
>3600</inputKey>
</inputTable>
"""
There is also outputTableXML
, which you should use if you were to
compute stuff in some lines of Python, since then the fields are
directly defined by the core itself.
However, the case of timescore is fairly typical: There is some,
essentially external, resource that produces something that needs to be
parsed. In that case, it’s a better idea to define the parsing logic in
a normal RD data
item. Its table then is the output table of the
core. In the times example, the output of timescompute
is described
by the build_result
data item in times.rd
:
<table id="times">
<column name="ut1" type="timestamp" tablehead="UT1"
ucd="time.epoch;meta.main" verbLevel="1"
description="Time and date (UT1)" displayHint="type=humanDate"/>
<column name="gmst" type="time" tablehead="GMST"
verbLevel="1" description="Greenwich mean sidereal time"
xtype="adql:TIMESTAMP" displayHint="type=humanTime,sf=4"/>
<column name="gast" type="time" tablehead="GAST"
verbLevel="1" description="Greenwich apparent sidereal time"
xtype="adql:TIMESTAMP" displayHint="type=humanTime,sf=4"/>
<column name="era" type="double precision" tablehead="ERA"
verbLevel="1" description="Earth rotation angle"
displayHint="type=dms,sf=3" unit="deg"/>
</table>
<data id="build_result" auto="False">
<reGrammar>
<names>ut1,gmst,gast,era</names>
</reGrammar>
<make table="times">
<rowmaker>
<map dest="gmst">parseWithNull(@gmst, parseTime, "None")</map>
...
</rowmaker>
</make>
</data>
So, the core needs to say “my output table has the structure of #times”.
As usual with DaCHS structures, you should not override the
constructor, as it is defined by a metaclass. Instead, Cores call,
immediately after the XML parse (technically, as the first thing of
their completeElement
method), a method called initialize
. This
is where you should set the output table. For the times core, this
looks like this:
def initialize(self):
self.outputTable = api.OutputTableDef.fromTableDef(
self.rd.getById("times"), None)
Of course, you are not limited to setting the output table there; as
initialize
is only called once while parsing, this is also a good
place to perform expensive, one-time operations like reading and parsing
larger external resources.
Giving the Core Functionality¶
To have the core do something, you have to override the run method, which has to have the following signature:
run(service, inputTable, queryMeta) -> stuff
The stuff returned will usually be a Table or Data instance (that need not match the outputTable definition – the latter is targeted at the registry and possibly applications like output field selection). The standard renderers also accept a pair of mime type and a string containing some data and will deliver this as-is. With custom renderers, you could return basically anything you want.
Services come up with some idea of the schema of the table they want to return and adapt tables coming out of the core to this. Sometimes, you want to suppress this behaviour, e.g., because the service’s ideas are off. In that case, set a noPostprocess attribute on the table to any value (the TAP core does this, for instance).
In service
you get the service using the core; this may make a
difference since different services can use the same core and could
control details of its operations through properties, their output
table, or anything else.
The inputTable
argument is the CoreArgs
instance discussed in
Core Args. Essentially, you’ll usually use its args
attribute,
a dictionary mapping the keys defined by your input table to values or
lists of them.
The queryMeta
argument is discussed in Database Options.
In the times example, the parameter interpretation is done in an extra function (which helps testability when there’s a bit more complex things going on):
def computeDates(args):
"""yields datetimes at which to compute times from the ut1/interval
inputs in coreArgs args.
"""
interval = args["interval"] or 3600
if args["ut1"] is None:
yield datetime.datetime.utcnow()
return
try:
expr = vizierexprs.parseDateExpr(args["ut1"])
if expr.operator in set([',', '=']):
for c in expr.children:
yield c
elif expr.operator=='..':
for c in expandDates(expr.children[0],
expr.children[1], interval):
yield c
elif expr.operator=="+/-":
d0, wiggle = expr.children[0], datetime.timedelta(
expr.expr.children[1])
for c in expandDates(d0-wiggle, d0+wiggle):
yield c
else:
raise api.ValidationError("This sort of date expression"
" does not make sense for this service", colName="ut1")
except base.ParseException, msg:
raise api.ValidationError(
"Invalid date expression (at %s)."%msg.loc,
colName="ut1")
While the details of the parameter parsing and expansion don’t really
matter, note now exceptions are mapped to a ValidationError and give
a colName
– this lets the form renderer display error messages next
to the inputs that caused the failure.
The next thing timescore does is build some input, which in this case is fairly trivial:
input = "\n".join(utils.formatISODT(date) for date in dates)+"\n"
If your input is more complex or you need input files or similar, you want to be a bit more careful. In particular, do not change directory (or, equivalently, use the utils.sandbox context manager); this may confuse the server, and in particular will break the first time two requests are served simultaneously: The core runs within the main process, and that can only have one current directory.
Instead, in such situations, make a temporary directory and manually place your inputs in there. The spacecore (http://svn.ari.uni-heidelberg.de/svn/gavo/hdinputs/sp_ace/res/spacecore.py) shows how this could look like, including tearing the stuff down safely when done (the runSpace function).
For the timescore, that is not necessary; you just run the wrapped program using standard subprocess functionality:
computer = service.rd.getAbsPath("bin/timescompute")
pipe = subprocess.Popen([computer],
stdin=subprocess.PIPE, stdout=subprocess.PIPE, close_fds=True,
cwd=os.path.dirname(computer))
data, errmsg = pipe.communicate(input)
if pipe.returncode:
raise api.ValidationError("The backend computing program failed"
" (exit code %s). Messages may be available as"
" hints."%pipe.returncode,
"ut1",
hint=errmsg)
Note that with today’s computers, you shouldn’t need to worry about streaming input or output until they are in the dozens of megabytes (in which case you should probably think hard about a custom UWS and keep the files in the job’s working directories).
To turn the program’s output into a table, you use the data item defined in the RD:
return api.makeData(
self.rd.getById("build_result"),
forceSource=StringIO(data))
When the core defines the data itself, you would skip makeData
.
Just directly produce the rowdicts and make the output table directly
from the rows:
rows = [{"foo": 3*i, "bar": 8*i} for i in range(30)]
return api.TableForDef(self.outputTable, rows=rows)
Database Options¶
The standard DB cores receive a “table widget” on form generation,
including sort and limit options. To make the Form renderer output this
for your core as well, define a method wantsTableWidget()
and return
True
from it.
The queryMeta
that your run
method receives has a dbLimit key.
It contains the user selection or, as a fallback, the global
db/defaultLimit value. These values are integers.
So, if you order a table widget, you should do something like:
cursor.execute("SELECT .... LIMIT %(queryLimit)s",
{"queryLimit": queryMeta["dbLimit"],...})
In general, you should warn people if the query limit was reached; a simple way to do that is:
if len(res)==queryLimit:
res.addMeta("_warning", "The query limit was reached. Increase it"
" to retrieve more matches. Note that unsorted truncated queries"
" are not reproducible (i.e., might return a different result set"
" at a later time).")
where res would be your result table. _warning metadata is displayed in both HTML and VOTable output, though of course VOTable tools will not usually display it.
Python Cores instead of Custom Cores¶
If you only have a couple of lines of python, you don’t have to have a
separate module. Instead, use a python core. In it, you essentially
have the run
method as discussed in Giving the Core Functionality
in a standard procApp
. The advantage is that interface and
implementation is nicely bundled together. The following example should
illustrate the use of such python cores; note that rsc
already is
in the procApp’s namespace:
<pythonCore>
<inputTable>
<inputKey name="opre" description="Operand, real part"
required="True"/>
<inputKey name="opim" description="Operand, imaginary part"
required="True"/>
<inputKey name="powers" description="Powers to compute"
type="integer" multiplicity="multiple"/>
</inputTable>
<outputTable>
<outputField name="re" description="Result, real part"/>
<outputField name="im" description="Result, imaginary part"/>
<outputField name="log"
description="real part of logarithm of result"/>
</outputTable>
<coreProc>
<setup imports="cmath"/>
<code>
powers = inputTable.args["powers"]
if not powers:
powers = [1,2]
op = complex(inputTable.args["opre"],
inputTable.args["opim"])
rows = []
for p in powers:
val = op**p
rows.append({
"re": val.real,
"im": val.imag,
"log": cmath.log(val).real})
return api.TableForDef(self.outputTable, rows=rows)
</code>
</coreProc>
</pythonCore>
Regression Testing¶
Introduction¶
Things break – perhaps because someone foolishly dropped a database table, because something happened in your upstream, because you changed something or even because we changed the API (if that’s not mentioned in Changes, we owe you a beverage of your choice). Given that, having regression tests that you can easily run will really help your peace of mind.
Therefore, DaCHS contains a framework for embedding regression tests in resource descriptors. Before we tell you how these work, some words of advice, as writing useful regression tests is an art as much as engineering.
Don’t overdo it. There’s little point in checking all kinds of functionality that only uses DaCHS code – we’re running our tests before committing into the repository, and of course before making a release. If the services just use condDescs with buildFrom and one of the standard renderers, there’s little point in testing beyond a request that tells you the database table is still there and contains something resembling the data that should be there.
Don’t be over-confident. Just because it seems trivial doesn’t mean it cannot fail. Whatever code there is in the service processing of your RD, be it phrase makers, output field formatters, custom render or data functions, not to mention custom renderers and cores, deserves regression testing.
Be specific. In choosing the queries you test against, try to find something that won’t change when data is added to your service, when you add input keys or when doing similar maintenance-like this. Change will happen, and it’s annoying to have to fix the regression test every time the output might legitimately change. This helps with the next point.
Be pedantic. Do not accept failing regression tests, even if you think you know why they’re failing. The real trick with useful testing is to keep “normal” output minimal. If you have to “manually” ignore diagnostics, you’re doing it wrong. Also, sometimes tests may fail “just once”. That’s usually a sign of a race condition, and you should really try to figure out what’s going on.
Make it fail first. It’s surprisingly easy to write no-op tests that run but won’t fail when the assertion you think you’re making is no longer true. So, when developing a test, assert something wrong first, make sure there’s some diagnostics, and only then assert what you really expect.
Be terse. While in unit tests it’s good to test for maximally specific properties so failing unit tests lead you on the right track as fast as possible, in regression tests there’s nothing wrong with plastering a number of assertions into one test. Regression tests actually make requests to a web server, and these are comparatively expensive. The important thing here is that regression testing is fast enough to let you run them every time you make a change.
Writing Regression Tests¶
DaCHS’ regression testing framework is organized a bit along the lines of python’s unittest and its predecessors, with some differences due to the different scope.
So, tests are grouped into suites, where each suite is contained in a
regSuite element. These have a (currently unused) title and a boolean
attribute sequential
intended for when the tests contained must be
executed in the sequence specified and not in parallel. It defaults to
false, which means the requests are made in random order and in
parallel, which speeds up the test runs and, in particular, will help
uncover race conditions.
On the other hand, if you’re testing some sort of interaction across
requests (e.g., make an upload, see if it’s there, remove it again),
this wouldn’t work, and you must set sequential=”True”. Keep these
sequential suites as short as possible. In tests within such suites
(and only there), you can pass information from one test to the
following one by adding attributes to self.followUp
(which are
available as attributes of self in the next test). If you need to
manipulate the next URL, it’s at self.followUp.url.content_
. For the
common case of a redirect to the url in the location header (or a child
thereof), there’s the pointNextToLocation(child="")
method of
regression tests. In the tests that are manipulated like this, the URL
given in the RD should conventionally be overridden in the previous
test
. Of course, additional parameters, httpMethods, etc, are still
applied in the manipulated url element.
Regression suites contain tests, represented in regTest elements.
These are procDefs (just like, e.g., rowmakery apply
), so you can
have setup code, and you could have a library of parametrizable regTests
procDefs that you’d then turn into regTests by setting their parameters.
We’ve not found that terribly useful so far, though.
You must given them a title
, which is used when reporting problems
with them. Otherwise, the crucial children of these are url
and, as
always with procDefs, code
.
Here are some hints on development:
- Give the test you’re just developing an id; at the GAVO DC, we’re
usually using cur; that way, we run variations of
dachs test rdId#cur
, and only the test in question is run. - After defining the url, just put an
assert False
into the test code. Then rundachs test -Devidence.xml rdId#cur
or similar. Then investigateevidence.xml
(possibly after piping throughxmlstarlet fo
) for stable and strong indicators that things are working. - If you get a BadCode for a test you’re just writing, the message may
not always be terribly helpful. To see what’s actually bugging
python, run
dachs --debug test ...
and check dcInfos.
RegTest URLs¶
The url element encapsulates all aspects of building the request. In the simplest case, you just can have a simple URL, in which case it works as an attribute, like this:
<regTest title="example" url="svc/form">
...
URLs without a scheme and a leading slash are interpreted relative to the RD’s root URL, so you’d usually just give the service id and the renderer to be applied. You can also specify root-relative and fully specified URLs as described in the documentation of the url element.
White space in URLs is removed, which lets you break long URLs as convenient.
You could have GET parameters in this URL, but that’s inconvient due to both XML and HTTP escaping. So, if you want to pass parameters, just give them as attributes to the element:
<regTest title="example">
<url RA="10" DEC="-42.3" SR="1" parSet="form">svc/form</url>
The parSet=form
here sets up things such that processing for the
form renderer is performed – our form library nevow formal has some
hidden parameters that you don’t want to repeat in every URL.
To easily translate URLs taken from a browser’s address bar or the form
renderer’s result link, you can run dachs totesturl
and paste the
URLs there. Note that totesturl fails for values with embedded quotes,
takes only the first value of repeated parameters and is a over-quick
hack all around. Patches are gratefully accepted.
The url
element hence accepts arbitrary attributes, which can be a
trap if you think you’ve given values to url’s private attributes and
mistyped their names. If uploads or authentication don’t seem to
happen, check if your attribute ended up the in the URL (which is
displayed with the failure message) and fix the attribute name;
most private url attributes start with http
. If you really need
to pass a parameter named like one of url’s private attributes, pass it
in the URL if you can. If you can’t because you’re posting, spank us.
After that, we’ll work out something not too abominable .
If you have services requiring authentication, use url’s httpAuthKey
attribute. We’ve introduced this to avoid having credentials in the RD,
which, after all, should reside in a version control system which may
be (and in the case of GAVO’s data center is) public. The attribute’s
value is a key into the file ~/.gavo/test.creds
, which contains, line
by line, this key, a username and a password, e.g.:
svc1 testuser notASecret
svc2 regtest NotASecretEither
A test using this would look like this:
<regTest title="Authenticated user can see the light">
<url httpAuthKey="svc1">svc1/qp/light.txt</url>
<code>
self.assertHTTPStatus(200)
</code>
</regTest>
By default, a test will perform a GET request. To change this, set
the httpMethod
attribute. That’s particularly important with
uploads (which must be POSTed).
For uploads, the url element offers two facilities. You can set a
request payload from a file using the postPayload
attribute (the
path is interpreted relative to the resource directory), but it’s much
more common to do a file upload like browsers do them. Use the
httpUpload
element for this, as in:
<url> <httpUpload name="UPLOAD"
fileName="remote.txt">a,b,c</httpUpload> svc1/async </url>
(which will work as if the user had selected a file remote.txt containing “a,b,c” in a browser with a file element named UPLOAD), or as in:
<url>
<httpUpload name="UPLOAD" fileName="remote.vot"
source="res/sample.regtest"/>
svc1/async
</url>
(which will upload the file referenced in source
, giving the remote
server the filename remote.vot
). The fileName
attribute is
optional.
Finally, you can pass arbitrary HTTP headers using the httpHeader
element. This has an attribute key
; the header’s value is taken
from the element content, like this:
<url postPayload="res/testData.regtest" httpMethod="POST">
<httpHeader key="content-type">image/jpeg</httpHeader>
>upload/custom</url>
RegTest Tests¶
Since regression tests are just procDefs, the actual assertions are
contained in the code
child of the regTest
. The code in there
sees the test itself in self, and it can access
self.data
(the response content as a byte string),self.headers
(a sequence of header name, value pairs; note that you should match the names case-insensitively here),self.status
(the HTTP response code),self.requestTime
(the time the URI request actually took; this will be None for failed requests), and- the URL actually retrieved in
self.url.httpURL
.
Incidentally, that last name is right; the regression framework only supports http, and it’s not terribly likely that we’ll change that.
You should probably only access those attributes in a pinch and instead use the pre-defined assertions, which are methods on the test objects as in pyunit – conventional assertions are clearer to read and less likely to break if fixes to the regression test API become necessary. If you still want to have custom tests, raise AssertionErrors to indicate a failure.
Here’s a list of assertion methods defined right now:
All of these are methods, so you would actually write
self.assertHasStrings('a', 'b', 'c')
in your test code (rather than
pass self explicitly).
When writing tests, you can, in addition, use assertions from python’s
unittest TestCases (e.g., assertEqual and friends). This is provided in
particular for use to check values in VOTables coming back from services
together with the getFirstVOTableRow
method.
Also please note that, like all procDef’s bodies, the test code is macro-expanded by DaCHS. This means that every backslash that should be seen by python needs to be escaped itself (i.e., doubled). An escaped backslash in python thus is four backslashes in the RD.
Finally, here’s a piece of .vimrc
that inserts a regTest
skeleton if you type ge in command mode (preferably at the start of
a line; you may need to fix the indentation if you’re not indenting with
tabs. We’ve thrown in a column skeleton on gn as well:
augroup rd
au!
autocmd BufRead,BufNewFile *.rd set ts=2 tw=79
au BufNewFile,BufRead *.rd map gn i<tab><tab><lt>column name="" type=""<CR><tab>unit="" ucd=""<CR>tablehead=""<CR>description=""<CR>verbLevel=""/><CR><ESC>5kf"a
au BufNewFile,BufRead *.rd map ge i<tab><tab><lt>regTest title=""><CR><tab><lt>url><lt>/url><CR><lt>code><CR><lt>/code><CR><BS><lt>/regTest><ESC>4k
augroup END
Running Tests¶
The first mode to run the regression tests is through dachs val
. If
you give it a -t
flag, it will collect regression tests from all the
RDs it touches and run them. It will then output a brief report listing
the RDs that had failed tests for closer inspection.
It is recommended to run something like:
dachs val -tv ALL
before committing changes into your inputs repository. That way, regressions should be caught.
The tests are ran against the server described through the
[web]serverURL
config item. In the recommended setup, this would be
a server started on your own development machine, which then would
actually test the changes you made.
There is also a dedicated gavo sub-command test
for executing the
tests. This is what you should be using for developing tests or
investigating failures flagged with dachs val
. On its command line,
you can give on of an RD id or a cross-rd reference to a test suite,
or a cross-rd reference to an individual test. For example,
dachs test res1/q
dachs test res2/q#suite1
dachs test res2/q#test45
would run all the tests given in the RD res1/q
, the tests in
the regSuite with the id
suite1 in res2/q
, and a test with
id="test45
in res2/q
, respectively.
To traverse inputs and run tests from all RDs found there, as well as tests from the built-in RDs, run:
dachs test ALL
dachs test
by default has a very terse output. To see which tests
are failing and what they gave as reasons, run it with the ‘-v’ option.
To debug failing regression tests (or maybe to come up with good things to test for), use ‘-d’, which dumps the server response of failing tests to stdout.
In the recommended setup with a production server and a development
machine sharing a checkout of the same inputs, you can exercise
production server from the development machine by giving the -u
option with what your production server has in its [web]serverURL
configuration item. So,
dachs test -u http://production.example.com ALL
is what might help your night’s sleep.
Examples¶
Here are some examples how these constructs can be used. First, a simple test for string presence (which is often preferred even when checking XML, as it’s less likely to break on schema changes; these usually count as noise in regression testing). Also note how we have escaped embedded XML fragments; an alternative to this shown below is making the code a CDATA section:
<regTest title="Info page looks ok"
url="siap/info">
<code>
self.assertHasStrings("SIAP Query", "siap.xml", "form",
"Other services", "SIZE</td>", "Verb. Level")
</code>
</regTest>
The next is a test with a “rooted” URL that’s spanning lines, has embedded parameters (not recommended), plus an assertion on binary data:
<regTest title="NV Maidanak product delivery"
url="/getproduct/maidanak/data/Q2237p0305/Johnson_R/
red_kk050001.fits.gz?siap=true">
<code>
self.assertHasStrings('\\x1f\\x8b\\x08\\x08')
</code>
</regTest>
This is how parameters should be passed into the request:
<regTest title="NV Maidanak SIAP returns accref.">
<url POS="340.12,3.3586" SIZE="0.1" INTERSECT="OVERLAPS"
_TDENC="True" _DBOPTIONS_LIMIT="10">siap/siap.xml</url>
<code>
self.assertHasStrings('<TD>AZT 22')
</code>
</regTest>
Here’s an example for a test with URL parameters and xpath assertions:
<regTest title="NV Maidanak SIAP metadata query"
url="siap/siap.xml?FORMAT=METADATA">
<code>
self.assertXpath("//v1:FIELD[@name='wcs_cdmatrix']", {
"datatype": "double",
"ucd": "VOX:WCS_CDMatrix",
"arraysize": "*",
"unit": "deg/pix"})
self.assertXpath("//v1:INFO[@name='QUERY_STATUS']", {
"value": "OK",
None: "OK",})
self.assertXpath("//v1:PARAM[@name='INPUT:POS']", {
"datatype": "char",
"ucd": "pos.eq",
"unit": "deg"})
</code>
</regTest>
The following is a fairly complex example for a stateful suite doing inline uploads (and simple tests):
<regSuite title="GAVO roster publication cycle" sequential="True">
<regTest title="Complete record yields some credible output">
<url httpAuthKey="gvo" parSet="form" httpMethod="POST">
<httpUpload name="inFile" fileName="testing_ignore.rd"
><![CDATA[
<resource schema="gvo">
<meta name="description">x</meta>
<meta name="title">A test service</meta>
<meta name="creationDate">2010-04-26T11:45:00</meta>
<meta name="subject">Testing</meta>
<meta name="referenceURL">http://foo.bar</meta>
<nullCore id="null"/>
<service id="run" core="null" allowed="external">
<meta name="shortName">u</meta>
<publish render="external" sets="gavo">
<meta name="accessURL">http://foo/bar</meta>
</publish></service></resource>
]]></httpUpload>upload/form</url>
<code><![CDATA[
self.assertHasStrings("#Published</th><td>1</td>")
]]></code>
</regTest>
<regTest title="Publication leaves traces on GAVO list" url="list/custom">
<code>
self.assertHasStrings(
'"/gvo/data/testing_ignore/run/external">A test service')
</code>
</regTest>
<regTest title="Unpublication yields some credible output">
<url httpAuthKey="gvo" parSet="form" httpMethod="POST">
<httpUpload name="inFile" fileName="testing_ignore.rd"
><![CDATA[
<resource schema="gvo">
<meta name="description">x</meta>
<meta name="title">A test service</meta>
<meta name="creationDate">2010-04-26T11:45:00</meta>
<meta name="subject">Testing</meta>
<meta name="referenceURL">http://foo.bar</meta>
<service id="run" allowed="external">
<nullCore/>
<meta name="shortName">u</meta></service></resource>
]]></httpUpload>upload/form</url>
<code><![CDATA[
self.assertHasStrings("#Published</th><td>0</td>")
]]></code>
</regTest>
<regTest title="Unpublication leaves traces on GAVO list"
url="list/custom">
<code>
self.assertLacksStrings(
'"/gvo/data/testing_ignore/run/external">A test service')
</code>
</regTest>
</regSuite>
If you still run SOAP services, here’s one way to test them:
<regTest id="soaptest" title="APFS SOAP returns something reasonable">
<url postPayload="res/soapRequest.regtest" httpMethod="POST">
<httpHeader key="SOAPAction">'"useService"'</httpHeader>
<httpHeader key="content-type">text/xml</httpHeader
>qall/soap/go</url>
<code>
self.assertHasStrings(
'="xsd:date">2008-02-03Z</tns:isodate>',
'<tns:raCio xsi:type="xsd:double">25.35')
</code>
</regTest>
– here, res/soapRequest.regtest
would contain the request body that
you could, for example, extract from a tcpdump log.
Datalink and SODA¶
[Datalink] is an IVOA protocol that allows associating various products and artifacts with a data set id. Think the association of error or mask maps, progenitor datasets, or processed data products, with a data set.
It also lets you associate data processing services with datasets, which allows on-the-fly generation of cutouts, format conversions or recalibrations; a particular set of parameters for working with certain kinds of cubes is described in a standard called [SODA] (Server-side Operations for Data Access). Hence, we sometimes call the processing part of datalink SODA.
In DaCHS, Datalink is implemented by the dlmeta
renderer, SODA by
the dlget
renderer. In all but fairly exotic cases, both renderers
are used on the same service. While in DaCHS, you cannot use SODA
without Datalink, there are perfectly sensible datalink services without
SODA. In the following, we first treat the generation of “normal”
datalinks and discuss processing services later.
A central term for datalink is the pubDID, or publisher DID. This is an identifier assigned (essentially) by you that points to a concrete dataset. In DaCHS, datalink services always use pubDIDs as the values of the datalink ID parameter.
Unless you arrange things differently (for which you should have good reasons), the pubDIDs used by DaCHS are formed as:
<authority>/~?<accref>
where the accref usually is the inputsDir-relative path to the file. If
you use datalinks of that form, you should at some point run dachs pub
//products
; this will register the products deliverer as
<authority>/~
, which means that pubDIDs of this form are compliant
with [IVOA Identifiers]_
When developing datalink services, it sometimes is useful to access
datalink services directly, in particular because they don’t usually
have a useful web interface. Armed with the knowledge about the
structure of DaCHS standard PubDIDs, you can easily build the URLs and
parameters. For instance, to retrieve the datalink document for
mlqso/data/FBQ0951_data.fits
on the server dc.g-vo.org using the
datalink renderer on the mlqso/q/d
service, you’d write:
curl -FID=ivo://org.gavo.dc/~?mlqso/data/slits/FBQ0951_data.fits \
http://dc.g-vo.org/mlqso/q/d/dlmeta | xmlstarlet fo
(of course, xmlstarlet
isn’t actually necessary, and you can use
wget
if you want, but you get the idea). Going on, you could pull
out what parameters are mentioned somewhat like this:
curl -s -FID=ivo://org.gavo.dc/~?mlqso/data/slits/FBQ0951_data.fits \
http://dc.g-vo.org/mlqso/q/d/dlmeta | \
xmlstarlet sel -N v=http://www.ivoa.net/xml/VOTable/v1.3 -T \
-t -m "//v:PARAM" -v "@name" -nl
In the remainder of this section, we first discuss the generation of datalinks and processing services “by example”, which should do for a basic use of the facilities. We continue with a somewhat more in-depth look at the processing of a SODA request, after which we look more closely at the various elements that make up Datalink/SODA services.
Integrating Datalink Services¶
You generally declare datalink services on the table(s) that contain the
identifiers the datalink service accepts. For that, you include
two pieces of metadata: The identifier of the datalink service (which
can be a cross-RD id with a hash; use the
_associatedDatalinkService.serviceId
meta key) and the column name
within the table (use the _associatedDatalinkService.idColumn
meta key).
Both items will only be checked at run time, and broken links will be
reported as warnings. If the following doesn’t give you the datalink
resources in results involving the tables, be sure to check the
dcInfos
log file.
The following example is a table that contains two sorts of identifiers
that are understood by two different datalink services; one, dlsvc
within the same RD, works on values in the accref
column, the other,
taken from a (hypothetical) doires/q
RD, would work on the doi
column:
<table id="datasets" onDisk="True">
<meta name="_associatedDatalinkService">
<meta name="serviceId">dlsvc</meta>
<meta name="idColumn">accref</meta>
</meta>
<meta name="_associatedDatalinkService">
<meta name="serviceId">doires/q#doidl</meta>
<meta name="idColumn">doi</meta>
</meta>
<column name="accref" type="text".../>
<column name="doi" type="text".../>
</table>
<service id="dlsvc" allowed="dlmeta,dlget">
<meta name="dlget.description">A service for
slicing and dicing.</meta>
...
</service>
Note that forward references, which are generally not allowed in DaCHS,
are possible in serviceId
and idColumn
.
An older way to associate datalink services with tables is to give
certain services (most notably, SSA ones) a datalink
property. This
is deprecated now. If you see it in examples, please tell us so we can
fix it.
Sometimes it makes sense to directly have datalinks in table columns. To help clients notice that that is what they are, declare a target type by adding:
<property name="targetType"
>application/x-votable+xml;content=datalink</property>
<property name="targetTitle">Datalink</property>
into the element (column
or outputField
) content. In VOTables,
this will be turned into LINK elements.
Making Datalinks¶
A dataset frequently has associated data, like error or weight maps, derived data, or pieces of provenance. Datalink lets you tie these together algorithmically, using a specialised core (see `element DatalinkCore`_) and `the dlmeta renderer`_.
To produce datalinks, the datalink core must be furnished with
- exactly one descriptor generator (you can let DaCHS fall back to a default),
- one or more meta makers, generating related links.
Here is an example, adapted from boydende/q:
<datalinkCore>
<descriptorGenerator procDef="//soda#fits_genDesc"/>
<metaMaker semantics="#isMetadataFor">
<code>
basename = descriptor.accref.split("/")[-1].split(".")[0]
envPath = "data/static/envelopes/{0}.jpg".format(basename)
yield descriptor.makeLinkFromFile(
envPath,
description="Scan of the plate envelope")
</code>
</metaMaker>
</datalinkCore>
A descriptor generator – in the example, one that has additional
functionality for FITS files, although the default
(`//soda#fromStandardPubDID`_) would work here, too – is passed
the pubDID and returns an instance of datalink.ProductDescriptor
(or
a derived class). If a descriptor generator returns
None, the datalink request will be rejected with a 404.
Whatever is returned by the descriptor generator is then available as
descriptor
to the remaining datalink procs (in this case, the meta
makers). The columns of the product table (see `dc.products`_)
are available as attributes of this object. In addition,
subclasses of data.ProductDescriptor
may add more attributes; the
fits_genDesc
used in the example, for instance, provides a hdr
attribute containing the primary header as given by pyfits.
The descriptor is then passed, in turn, to all meta makers given. These
must yield
LinkDef
instances that describe additional data
products; a single meta maker may yield zero or more of these. You
generally should not construct LinkDefs yourself, as there are
convenience methods doing that on the descriptor which prevent some
common errors.
These methods are descriptor.makeLink
(for when you have external
links) and descriptor.makeLinkFromFile
(for when you link to files
published through a static renderer by DaCHS itself). These take some
common arguments:
semantics
– you must give something here (this is a lie, but it’s a benevolent one). Take the value for that attribute from the controlled vocabulary at http://www.ivoa.net/rdf/datalink/core (and complain to the semantics working group if you can’t find an appropriate term in there). Keep a leading hash on these words for datalink. In a pinch, you can pass a semantics argument tomakeLink
, but it is much better if you don’t and instead define thesemantics
attribute on the metaMaker element (because then any error message that may result will have the right semantics).description
– human-readable information about what the link references. Feel free to be verbose here and consider that many consumers of your datalink document will not be familiar with your particular instrument or pipeline at all.contentType
– a media type. Make sure it is consistent with what the server actually returns.contentLength
– the (approximate) size of the resource at accessURL, in bytes (not formakeLinkFromFile
, which takes it from the file system)localSemantics
– some string that identifies the particular kind of link in some opaque and only locally meaningful way. This is intended for clients that always want to show the same sort of link when people browse multiple datasets from a single service. When there are multiple links for a single piece of semantics, localSemantics can be used to tell apart the various items. If you only have one link per semantics, or if all links for a certain piece of semantics are equivalent, do not touch this.contentQualifier
– this can be an identifier of what sort of data is at the other end of the link, usually used to distinguish between spectra, time series, images, etc. Like semantics, there is a default vocabulary for these, http://www.ivoa.net/rdf/product-type. To state that the link points to an image, you would thus put#image
into content qualifier. Clients are intended to use this information to route the links to whatever SAMP clients can be expected to handle them (version 2.8+).
Except for semantics
, all of these are optional and must be passed
in as keyword arguments.
makeLinkFromFile
additionally accepts a positional argument
containing a local file name; makeLink
instead has a URL in a
string.
When returning link definitions, the tricky part mostly is to come up
with the URLs. Use the makeAbsoluteURL
rowmaker function to make
them from relative URLs; the rest just depends on your URL scheme. An
example could look like this:
<metaMaker semantics="#error">
<code>
yield descriptor.makeLink(
makeAbsoluteURL("get/"+descriptor.accref[:-5]+".err.fits"),
contentType="image/fits",
description="Errors for this dataset")
</code>
<metaMaker semantics="#progenitor">
<code>
yield descriptor.makeLink(
"http://foo.bar/raw/"+descriptor.accref.split("/")[-1],
contentType="image/fits",
description="Un-flatfielded, uncalibrated source data")
</code>
</metaMaker>
makeLinkFromFile
will create NotFoundFault
error links if the
file does not exist, thus alerting the user (and possibly you) that an
expected file was not there. When missing files are expectable and
should not cause diagnostics, pass a suppressMissing=True
to
makeLinkFromFile
.
To make this work, DaCHS will have to know how the file can be accessed from the web to be able to produce the link. The recommended pattern is shown in the example: the datalink service itself is used to deliver the static, non-product files. This is effected by declaring the service embedding the core somewhat like this:
<service id="dl" allowed="dlget,dlmeta,static">
<property name="staticData">data/static</property>
<datalinkCore .../>
</service>
Note that, of course, exposing a directory via the static renderer like this bypasses any access restrictions (e.g., embargoes) on the respective data. So, do not do with with your primary data if you want to enforce access control. Also, there currently is no way to control the media types returned in this way except by editing the system mime.types information. Let us know if that is a problem for you.
A LinkDef
for the product itself (semantics #this
) and, if
defined in the product table, a preview (semantics #preview
) is
automatically added by DaCHS unless a suppressAutoLinks
attribute is
set on the descriptor (you can set that in a meta maker or the
descriptor generator).
Defining Processing Services¶
In DaCHS data processing services (“SODA services”) use the same datalink cores as the datalink services, and they share the same descriptor. A datalink core does data processing when used by `the dlget renderer`_.
To enable data processing, datalink cores additionally need data
functions (see `element dataFunction`_) and up to one data formatter
(see `element dataFormatter`_). The first data function must add a
data
attribute to the descriptor and thus plays a somewhat special
role.
Processing services also use meta makers, but instead of links, these yield parameter definitions in the form of InputKeys (they are used by the datalink services, too, because the datalink documents contain the metadata of the processing services). So, typically, a given piece of SODA functionality comes as a pair of a meta maker and a data function, which then normally are combined in a STREAM (cf. Datalink-related Streams).
Processing services usually are a good deal more stereotypical than
metadata generation; it is actually beneficial if different services
have identical behaviour to facilitate the creation of interoperable
clients. SODA itself essentially enumerates what in DaCHS are
pre-defined meta makers and data functions. So, most of the time data
processing will just re-use STREAMs and procDefs from the //soda
RD.
The two most common cases are cutouts over FITS cubes and over spectra.
Processing services are referenced from the links table. In DaCHS, the description column in the links table is set from the services description meta. This falls back to the resource’s description meta, which is almost never what you want. So, make sure you include something reasonably concise into the service element like this:
<meta name="description">Slicing and dicing the images from the
wonderous survey</meta>
Datalink services identify themselves as supporting some standard. Whenever DaCHS sees a dlget, it will declare the service a SODA service; this is harmless as long as you don’t define SODA parameters that do something different from what SODA says they should do. Still, if you have to, you can override the standardID meta to declare support of a different standard, or write:
<meta name="standardID"/>
to entirely suppress the declaration of a standard identifier.
FITS/SODA processing¶
In the first case, the core would like this piece extracted from
the dl
service in califa/q3:
<datalinkCore>
<descriptorGenerator procDef="//soda#fits_genDesc"
name="genFITSDesc">
<bind key="accrefPrefix">'califa/datadr3'</bind>
<bind key="descClass">DLFITSProductDescriptor</bind>
</descriptorGenerator>
<FEED source="//soda#fits_standardDLFuncs" spectralAxis="3"/>
</datalinkCore>
Here, we use the `//soda#fits_genDesc`_ descriptor generator with a
DLFITSProductDescriptor because CALIFA DR3 stores datalink URLs rather
than actual file paths in the product table. You would leave the
descClass
parameter out when your products are the FITS files
themselves.
Giving an accrefPrefix
to anything using the product table to get
accrefs (`//soda#fromStandardPubDID`_ is another example for these)
usually is a good idea. If you don’t give
it, users can apply the datalink service to any dataset you publish,
which might lead to information leaks and hard-to-understand error
messages on the user side. accrefPrefix
is simply a string that the
accref of the product being processed must match. Since in the usual
setup, the accref is the inputsDir-relative path of the file, you’re
usually fine if you just give the path to the directory containing the
products in question.
The `//soda#fits_standardDLFuncs`_ STREAM arrange for all general FITS processing functions to be pulled in; these encompass the SODA parameters where applicable (at the time of this writing, there is no support for TIME and POL yet, but if you have such data, we’ll be glad to add it), and some additional ones.
If you need extended functionality, it is a good idea to start from this
STREAM. Copy it from dachs adm dumpDF //soda
and hack from there.
SDM processing¶
The other very common sort of SODA-like processing is for spectra. A
sketch for these from the sdl
service in flashheros/q:
<datalinkCore>
<descriptorGenerator procDef="//soda#sdm_genDesc">
<bind key="ssaTD">"\rdId#data"</bind>
</descriptorGenerator>
<dataFunction procDef="//soda#sdm_genData">
<bind key="builder">"\rdId#build_sdm_data"</bind>
</dataFunction>
<FEED source="//soda#sdm_plainfluxcalib"/>
<FEED source="//soda#sdm_cutout"/>
<FEED source="//soda#sdm_format"/>
</datalinkCore>
Here, the descriptor generator will in general be `//soda#sdm_genDesc`_.
It builds a special descriptor that contains the full metadata from an
associated SSA row, which is why you need to give the id of the SSA
table in the ssaTD
parameter. Since pubDIDs will only be resolved
within this table, no accrefPrefix
is necessary or supported.
The first data function for spectra usually will be
`//soda#sdm_genData`_. This will read the entire spectrum into memory
using a data item, the id of which is given in the builder
parameter. This has to build an SDM-compliant spectrum. Some examples
of how to do this can be found in cdfspect/q.rd (reading
from half-broken FITS files), c8spect/q.rd (which shows how
to create spectra that don’t exist on disk as files),
pcslg/q.rd (which nicely uses WCSAxis for parsing spectra
that come as 1D-array, “IRAF-style”), or theossa/q.rd (which
pulls the source files from a remote server and caches it). For more on
generating SDM-compliant spectra, see SDM compliant tables.
For large spectra, reading the spectrum in its entirety may incur a significant CPU cost. When that becomes a problem for you, you’ll need to write different data functions, perhaps only parsing a header, and implement, e.g., cutouts directly in a subsequent data function.
The two next STREAMs pulled in are just combinations of data functions and meta makers, one for optionally re-calibrating the spectrum (right now, only maximum normalisation is supported), the other for providing a SODA-like cutout.
Finally, `//soda#sdm_format`_ pulls in a meta maker defining a FORMAT parameter (letting people order several formats including VOTable, FITS binary table, and CSV) and a formatter that interprets it.
Multiple processing services¶
If you yield InputKeys from meta makers, all of them will end up in a single processing service, and all data functions will contribute to that same processing service.
Sometimes, however, you want to have two different processing services in a single datalink document. In that case. define a second datalink service in DaCHS, usually with only a dlget renderer; that way, it will always be clear where the actual datalink information for data sets belonging to some collection can be obtained from.
You can then yield a ProcLinkDef object (constructed with the current pubDID and the second datalink service) from a meta maker in the main datalink service. Make sure that there is a description meta in the dlget-only datalink service so users have a chance to figure out why it is there.
Since this is slightly subtle, here is a sketch of how this works:
<service id="dl" allowed="dlmeta,dlget">
<meta name="description">Data Collection's datalink service</meta>
<datalinkCore>
<descriptorGenerator id="gen" procDef="//datalink#fromtable">
<bind key="tableName">"my.table"</bind>
<bind key="idColumn">"prikey"</bind>
</descriptorGenerator>
<metaMaker>
<code>
# that's an argument for the built-in dlget service
yield MS(InputKey, name="arg1", type="text",
description="First service's argument 1")
</code>
</metaMaker>
<metaMaker>
<code>
# This links the second datalink service
yield ProcLinkDef(descriptor.pubDID, rd.getById("dl2"))
</code>
</metaMaker>
<!-- now give the main service its functionality -->
<dataFunction .../>
</datalinkCore>
</service>
<service id="dl2" allowed="dlget">
<meta name="description">Mogrification of the shlabudl</meta>
<meta name="standardID"/> <!-- stresses that this is higher magic -->
<datalinkCore>
<descriptorGenerator original="gen"/>
<metaMaker>
<code>
yield MS(InputKey, name="mogrification_level", type="real",
description="Second service's argument")
</code>
</metaMaker>
<!-- and again the meat of the service: -->
<dataFunction .../>
</datalinkCore>
</service>
General Notes on Processing Services¶
This section contains an overview over how data processing services are built and executed. You should read it if you want to write data processing functions; for just using them, don’t bother.
When a request for processed data comes in, the descriptor generator is used to make a product descriptor, and the input keys are adapted to the concrete dataset. This means that, contrary to normal DaCHS services, services with a Datalink core have a variable interface; in particular, the interface on the dlmeta renderer (essentially, just ID) is very different from the one on the dlget renderer (ID plus whatever the meta makers produce).
The input key so produced are used to build a context grammar that
parses the request. If this succeeds, the data descriptor is passed to the
initial data function together with the arguments parsed. This must set
the data
attribute of the descriptor or raise a ValidationError
on the ID parameter; leaving data
as None
results in a 500 server
error. Descriptor.data could an rsc.InMemoryTable
(e.g., in SDM
processing)
or a products.Products
instance, but as long as the other data
functions and the formatter agree on what it is, anything goes.
The remaining data functions can change the data in place or potentially
replace descriptor.data
. When writing code, be aware, though, that
a data function should only do something when the corresponding
parameter has actually been used. When you change descriptor.data
fundamentally, you’ll probably make the lives of further data functions
and the formatter a good deal harder.
Finally, the data enters the formatter, which actually generates the output, usually returning a pair of mime type and string to be delivered.
It is a design decision of the service creator which manipulations are done in the initial data function, which are in later filters, and which perhaps only in the formatter. The advantage of filters is that they are more flexible and can more easily be reused, while doing it things in the data generator itself will usually be more efficient, sometimes much so (e.g., sums being computed within a database rather than in a filter after all the data had to go through the interface of the database).
Descriptor Generators¶
Descriptor generators (see `element descriptorGenerator`_) are procedure
applications that, roughly, see a pubDID value and are expected to return a
datalink.ProductDescriptor
instance, or something derived from it.
Simple Product Descriptor Generators¶
In the end, this usually boils down to figuring out the value of accref
in the product table and using what’s there to construct the descriptor
generator. In the simplest case, the pubDID will be in DaCHS’
“standard” format (see the getStandardPubDID
rowmaker function or
the `macro standardPubDID`_), in
which case the default descriptor generator works and you don’t have to
specify anything. You could manually insert that default by saying:
<descriptorGenerator procDef="//soda#fromStandardPubDID"/>
This happens to be DaCHS’ default if no descriptor generator is given, but as said above that is suboptimal as no accrefPrefix constrains what the service will run on.
The easiest way to furnish your descriptors with additional information
is to grab that code (use dachs adm dumpDF //soda
) and just add
attributes to the ProductDescriptor
generated in this way.
The default ProductDescriptor
class exposes as attributes
all the columns from the products table. See `dc.products`_ for their
names and descriptions.
Spectrum Product Descriptor Generators¶
A slightly more interesting example is provided by datalink for SSA,
where cutouts and similar is generated from spectra. The actual
definition is in //soda#sdm_genDesc
, but the gist of it is:
<procDef type="descriptorGenerator" id="sdm_genDesc">
<setup imports="gavo.api,gavo.protocols.ssap">
<par key="ssaTD" description="Full reference (like path/rdname#id)
to the SSA table the spectrum's PubDID can be found in."/>
<par key="descriptorClass" description="The SSA descriptor
class to use. You'll need to override this if the dc.products
path doesn't actually lead to the file (see
`custom generators <#custom-product-descriptor-generators>`_)."
late="True">ssap.SSADescriptor</par>
<code>
ssaTD = api.resolveCrossId(ssaTD, api.TableDef)
</code>
</setup>
<code>
with api.getTableConn() as conn:
ssaTable = api.TableForDef(ssaTD, connection=conn)
matchingRows = list(ssaTable.iterQuery(ssaTable.tableDef,
"ssa_pubdid=%(pubdid)s", {"pubdid": pubDID}))
if not matchingRows:
return DatalinkFault.NotFoundFault(pubDID,
"No spectrum with this pubDID known here")
# the relevant metadata for all rows with the same PubDID should
# be identical, and hence we can blindly take the first result.
return descriptorClass.fromSSARow(matchingRows[0],
ssaTable.getParamDict())
</code>
</procDef>
Here, we use ssa.SSADescriptor
, derived from ProductDescriptor
,
rather than monkeypatching the extra ssaRow
attribute the former
provides; being explicit here may help when debugging.
As usual, the descriptor generates encodes how to resolve a
pubDID to an accref, in this case using an SSA table.
If the product table just lists a datalink
URL, you will want to override the accessPath this comes up with.
See, for instance, pcslg/q for how to do this.
Incidentally, in this case you could stuff the entire code into the
main code
element, saving on the extra setup
element.
However, apart from a
minor speed benefit, keeping things like function or class definitions
in setup allows easier re-use of such definitions in procedure
applications and is therefore recommended.
FITS Product Descriptor Generators¶
For FITS files, you will usually just use `//soda#fits_genDesc`_,
defining the accrefStart
as discussed in FITS/SODA processing.
This will produce datalink.FITSProductDescriptor instances. As in the
SSA/SDM case, you may need different descriptor classes in special
situations. Since for large FITS files, just delivering datalink files
is a fairly compelling proposition, there is actually a predefined
descriptor class to use with datalink access paths,
DLFITSProductDescriptor; the dl
service in califa/q3
shows how to use it.
Non-Product Descriptor Generators¶
Sometimes, you want to produce datalinks for tables that do not manage products – most likely because all you have is URLs, but possibly also because there simply are no products in the first place. In that case, you probably want to use the the `//datalink#fromtable`_ product descriptor.
To use this, you have to pass the name of the table with the items you
want to link from (tableName
) and the column to match the identifier
against (idColumn
). The descriptor generator then does the database
query, makes sure exactly one row matched and, if so, puts the result
into the metadata
attribute of the descriptor.
A simple case is to associate some sort of preview with an EPN-TAP table row:
<service id="dl" allowed="dlmeta">
<meta name="title">SoHO EIT Synoptic maps datalink service</meta>
<datalinkCore>
<descriptorGenerator procDef="//datalink#fromtable">
<bind key="tableName">"\schema.epn_core"</bind>
<bind key="idColumn">"granule_uid"</bind>
</descriptorGenerator>
<metaMaker semantics="#preview">
<code>
yield descriptor.makeLink(
descriptor.metadata['thumbnail_url'],
description="Preview image",
contentType='image/jpeg')
</code>
</metaMaker>
</datalinkCore>
</service>
This service will accept any value from the granule_uid
column as
ID.
If the value in the ID column actually contains IVOA publisher DIDs,
you may want to also accept “relative” identifiers, for instance, just
dataset_15
instead of ivo://myauthority/~?/data/foo/dataset_15
.
In that case, bind the prefix to the descriptor generator’s
didPrefix
parameter, like this:
<descriptorGenerator procDef="//datalink#fromtable">
<bind key="tableName">"\schema.myssa"</bind>
<bind key="idColumn">"ssa_pubDID"</bind>
<bind key="didPrefix">"ivo://myauthority/~?/data/foo/"</bind>
</descriptorGenerator>
(but that’s really only convenience).
Since it cannot know about them, the fromtable
descriptor does not
automatically add the #this and #preview links (i.e., it sets
suppressAutoLinks
). I personally consider datalink documents
without #this as flakey, so if you can, add a #this link manually. In
the EPN-TAP case, an obvious choice would be:
<metaMaker semantics="#this">
<code>
yield descriptor.makeLink(
descriptor.metadata['access_url'],
description="The full dataset",
contentType="image/fits")
</code>
</metaMaker>
As a last point regarding this non-local use case, if you want to enable
the dlget
renderer here, there is DataFromURL
visible in
dataFunction
-s. In the simplest case, you can write something like
(continuing the EPN-TAP example):
<dataFunction>
<code>
descriptor.data = DataFromURL(
descriptor.metadata["access_url"])
</code>
</dataFunction>
When this gets rendered, the client will be redirected to whatever
access_url
points to.
Meta Makers¶
Link Definitions¶
The use of meta makers to produce link rows was already discussed in Making Datalinks.
Parameter Definitions¶
To define a datalink service’s processing capabilities, meta makers
yield input keys (InputKey
instances). The classes usually required
to build input keys return (InputKey, Values, Option) are available to
the code as local names. As usual, DaCHS structs
should not be constructed directly but only using the MS
helper
(which is really an alias for base.makeStruct; it takes care that the
special postprocessing of DaCHS structures takes place).
You should make sure that the input keys have proper annotation as regards minima, maxima, or enumerated values; clients, in general, have to way to guess what is sensible here.
The limits can usually be obtained from the descriptor (which, again, is
available as descriptor
in the meta maker. For instance, the
FITS descriptor has a header
attribute describing the instance that
the core operates on, the SSA descriptor an attribute ssaROW
.
A meta maker that generates an extra cutout parameter for radio astronomers (note that this is of course a bad idea – unit adaption should be done on the client side) could be:
<metaMaker>
<setup imports="gavo.utils.unitconv"/>
<code>
yield MS(InputKey, name="FREQ", unit="MHz", ucd="em.freq",
description="Spectral cutout interval",
type="double precision[2]" xtype="interval"
multiplicity="forced-single"
values=MS(Values,
min=1e-6*unitconv.LIGHT_C/(descriptor.ssaRow["ssa_specstart"],
max=1e-6*unitconv.LIGHT_C/descriptor.ssaRow["ssa_specend"]))
</code>
</metaMaker>
The SODA-compliant version of this is in the //soda#sdm_cutout
predefined stream.
The main point here is that you should follow section 4.3 for the [SODA]
spec, i.e., use interval-xtyped parameters. Also, unless you’re
actually prepared to handle multiply-specified parameter values, you
should use the forced-single
mulitplicity, which makes DaCHS reject
requests that contain a parameter more than once.
An extra complication occurs when SODA descriptors are generated for DAL responses. Currently, this is only envisaged for SSA. There, the descriptor has an extra limits attribute that gives, for each eligible column, minimum and maximum values or a set of values for enumerated columns.
Similar (if possibly less useful) mechanisms are conceivable for, say,
partial obscore results or SIAv1. We suggest to keep the attribute name
of this sort of collective characterisation as limits
. DaCHS does
not implement anything of this kind right now, though.
Metadata Error Messages¶
Both descriptor generators and meta makers can return (or yield, in the case of meta makers) error messages instead of either a descriptor or a link definition. This allows more fine-tuned control over the messages generated than raising an exception.
Error messages are constructed using class functions of
DatalinkFault
, which is visible to both procedure types. The class
function names correspond to the message types defined in the datalink
spec and match the semantics given there:
- AuthenticationFault
- AuthorizationFault
- NotFoundFault
- UsageFault
- TransientFault
- FatalFault
- Fault
Thus, a descriptor generator could look like this:
<descriptorGenerator>
<code>
with base.getTableConn() as conn:
matchingRows = list(conn.queryToDicts(
"select physPath from schema.myTable where pub_did=%(pubDID)s",
locals()))
if not matchingRows:
return DatalinkFault.NotFoundFault(pubDID,
"No dataset with this pubDID known here")
return MyCustomDescriptor.fromFile(matchingRows[0]["physPath"])
</code>
</descriptorGenerator>
Where sensible, you should pass (as a keyword argument) semantics (as
for LinkDefs) to the DatalinkFault
’s constructor; this would
indicate what kind of link you wanted to create.
Data Functions¶
Data functions (see `element dataFunction`_) generate or manipulate
data. They see the descriptor and the arguments (as args
),
parsed according to
the input keys produced by the meta makers, where the descriptor’s
data
attribute is filled out by the first data function called (the
“initial data function”).
As described above, DaCHS does not enforce anything on the data
attribute other than that it’s not None after the first data function
has run. It is the RD author’s responsibility to make sure that all
data functions in a given datalink core agree on what data
is.
All code in a request for processed data is also passed the input parameters as processed by the context grammar. Hence, the code can rely on whatever contract is implicit in the context grammar, but not more. In particular, a datalink core has no way of knowing what data functions expects which parameters. If no value for a parameter was provided on input, the corresponding value is None but a data function using it still is called.
An example for a generating data function is `//soda#generateProduct`_, which may be convenient when the manipulations operate on plain local files; it basically looks like this:
<dataFunction>
<code>
descriptor.data = products.getProductForRAccref(descriptor.accref)
</code>
</dataFunction>
(the actual implementation lets you require certain mime types and is therefore a bit more complicated).
You could do whatever you want, however. The following would work perfectly if you make your data functions handle lists of dicts:
<dataFunction>
<setup imports="random"/>
<code>
descriptor.data = [{"pix": i, "val": random.random()}
for i in range(20000)]
</code>
</dataFunction>
It wouldn’t be hard to come up with a formatter that turns this into a nice VOTable.
Filtering data functions should always come with a meta maker declaring their parameters. As an example, continuing the frequency cutout example above, consider this:
<dataFunction>
<code>
if not args.get("FREQ"):
return
lam_min, lam_max = (unitconv.LIGHT_C/(args[FREQ][0]*1e6)
unitconv.LIGHT_C/(args[FREQ][1]*1e6))
from gavo.protocols import sdm
sdm.mangle_cutout(
descriptor.data.getPrimaryTable(),
lam_min, lam_max)
</code>
</dataFunction>
(Ignoring for the moment troubles with half-open intervals).
There are situations in which a data function must shortcut, mostly
because it is doing something other than just “pushing on”
descriptor.data. Examples include preview producers or a data function
that should produce the FITS header only.
For cases like this, data functions can
raise one of DeliverNow
(which means descriptor.data
must be
something servable, see Data Formatters and causes that to be
immediately served) or FormatNow
(which immediately goes to the data
formatter; this is less useful).
Here’s an example for DeliverNow
; a similar thing is contained in the
STREAM //soda#fits_genKindPar
:
<dataFunction>
<setup imports="gavo.utils.fitstools"/>
<code>
if args["KIND"]=="HEADER":
descriptor.data = ("application/fits-header",
fitstools.serializeHeader(descriptor.data[0].header))
raise DeliverNow()
</code>
</dataFunction>
When writing data functions, you should raise
soda.EmptyData()
when a cutout results in empty data (e.g.,
because the cutout limits are out of range). If you don’t, users of
your service might become angry with you when they have to click away
many empty windows (say).
For further examples of data functions, see the //soda
RD coming
with the distribution. If you write some, please consider whether they
might be interesting for other DaCHS users, too, and submit them for
inclusion into //soda.
Data Formatters¶
Data formatters (see `element dataFormatter`_) take a descriptor’s data
attribute and build something servable out of it. Datalink cores do
not absolutely need one; the default is to return descriptor.data
(the //soda#trivialFormatter
, which might be fine if that data is
servable itself).
What is servable? The easiest thing to come up with is a pair of
content type and data in byte strings; if descriptor.data
is a Table
or Data instance, the following could work:
<dataFormatter>
<code>
from gavo import formats
return "text/plain", formats.getAsText(descriptor.data)
</code>
</dataFormatter>
Another example is an excerpt from //soda#sdm_cutout
:
<dataFormatter>
<code>
from gavo.protocols import sdm
if len(descriptor.data.getPrimaryTable().rows)==0:
raise base.ValidationError("Spectrum is empty.", "(various)")
return sdm.formatSDMData(descriptor.data, args["FORMAT"])
</code>
</dataFormatter>
(this goes together with a metaMaker for an input key describing FORMAT).
An alternative is to return something that has a renderHTTP(ctx)
method that works in nevow. This is true for the Product instances that
//soda#generateProduct
generates, for example. You can also
write something yourself by inheriting from
protocols.products.ProductBase and overriding its iterData method.
If you don’t inherit from ProductBase, be aware that this renderHTTP runs in the main server loop. If it blocks, the server blocks, so make sure that this doesn’t happen. The conventional way would be to return, from the renderHTTP method, some twisted producer. Non-Product nevow resources will also not work with asynchronous datalink at this point.
Embedded Datalink Descriptors¶
For certain renderers (currently, only ssap.xml, but we might do it for
SIAP, too), DaCHS will add a direct SODA block if there’s an
_associatedDatalinkService
meta on the table it serves from and that
datalink service has a dlget capability. Here’s how the datalink
declarations could look like in such a case:
<RESOURCE name="links" type="meta" utype="adhoc:service">
<DESCRIPTION>...</DESCRIPTION>
<GROUP name="inputParams">
<PARAM arraysize="*" datatype="char" name="ID" ref="ssa_pubDID"
ucd="meta.id;meta.main" value=""/>
</GROUP>
<PARAM arraysize="*" datatype="char" name="standardID"
value="ivo://ivoa.net/std/DataLink#links-1.0"/>
<PARAM arraysize="*" datatype="char" name="accessURL"
value="http://localhost:8080/gaia/q2/tsdl/dlmeta"/>
</RESOURCE>
<RESOURCE ID="proc_svc" name="proc_svc" type="meta" utype="adhoc:service">
<DESCRIPTION>...</DESCRIPTION>
<GROUP name="inputParams">
<PARAM arraysize="*" datatype="char" name="ID" ref="ssa_pubDID"
ucd="meta.id;meta.main" value="">
<DESCRIPTION>The publisher DID of the dataset of interest</DESCRIPTION>
</PARAM>
<PARAM arraysize="*" datatype="char" name="BANDPASS" value="">
<DESCRIPTION>Gaia bandpass to generate the time series
for.</DESCRIPTION>
<VALUES>
<OPTION name="G" value="G"/>
<OPTION name="BP" value="BP"/>
<OPTION name="RP" value="RP"/>
</VALUES>
</PARAM>
</GROUP>
<PARAM arraysize="*" datatype="char" name="accessURL"
ucd="meta.ref.url" value="http://localhost:8080/gaia/q2/tsdl/dlget"/>
<PARAM arraysize="*" datatype="char" name="standardID"
value="ivo://ivoa.net/std/SODA#sync-1.0"/>
</RESOURCE>
– the first block declares where to obtain full datalink documents by publisher DID from.
The second block lets clients take a shortcut and call a processing service directly, without first retrieving the datalink document; it is essentially an anonymised version of the processing declaration fromt the datalink block.
To generate these, DaCHS also calls the dlmeta procs, but with pubDID set to None. Whenever you need a concrete pubDID in a dlmeta proc used with SSA, you should therefore add something like:
if descriptor.pubDID is None:
return
Also note that in these cases, a special descriptor type is being used
rather than whatever you put into your descriptor generator, and hence
you can’t use any special attributes you defined there. On the other
hand, you’ll have a limits
attribute with a dictionary giving ranges
of values within the concrete (SSA) result. This should be used to
build Values
objects tailored to the specific result.
All this is admittedly painful; the shortcut SODA blocks that cause all that pain can probably count as a classic case of premature optimisation.
Registry Matters¶
You can publish the metadata generating endpoint on your service by
saying <publish render="dlmeta" sets="ivo_managed"/>
. However, that
is not recommended, as it clutters the registry with services that
are not really usable after discovery.
Datalink services will, however, appear as capabilities of services that publish tables that have associated datalink services.
While it might be a good idea to provide some _example
meta for all
datalink services, when you register them, you really should provide one in any
case so validators can pick up IDs and parameters to use when valdiating
your service. Here is an example, taken from califa/q3:
CALIFA cubes can be cut out along RA, DEC, and spectral axes.
CIRCLE and POLYGON cutouts yield bounding boxes. Also note that the
coverage of CALIFA cubes is hexagonal in space. This explains
the empty area when cutting out :genparam:`CIRCLE(225.5202 1.8486 0.001)`
:genparam:`BAND(366e-9 370e-9)` on
:dl-id:`ivo://org.gavo.dc/~?califa/datadr3/V1200/UGC9661.V1200.rscube.fits`.
Essentially, an identifier to use is given as the dl-id
interpreted
text role, whereas processing parameters are given as DALI genparams.
In DaCHS, they are written as the parameter name and its value in
parentheses.
Datalinks as Product URLs¶
In particular for larger datasets like cubes, it is rude to put the entire dataset into an obscore table. Although obscore gives expected download sizes, clients nevertheless do not usually expect to have to retrieve several gigabytes or even terabytes of data when dereferencing an obscore access URL.
While you could define additional datalink URLs and use these in Obscore – this is what lswscans/res/positions does, and there’s a piece of text on this in the tutorial –, you should in general use datalinks as product URLs throughout with datasets larger than a couple of Megabytes. c8spect/q shows how to do that with completely virtual data, califa/q3 and pcslg/q are examples for what to do with FITS cubes or spectra.
This way, of course, without a datalink-enabled client people might be locked out from the dataset entirely. On the other hand, DaCHS comes with a stylesheet that enables datalink operation from a common web browser, so that’s perhaps not too bad.
Aladin likes it when columns containing datalink URLs are marked up.
DaCHS has two properties that let you add that markup, targetType and
targetTitle. On a standalone datalink column that you just add to an
output table, this could look like this (the datalink service would have
an id of “dl” here; this also assumes you have a column named
pub_did
):
<outputField name="datalink" type="text" id="datalink_output"
ucd="meta.ref.url"
select="'\getConfig{web}{serverURL}/\rdId/dl/dlmeta?ID='
|| gavo_urlescape(pub_did)"
tablehead="DL"
description="URL of a datalink document for this dataset."
displayHint="type=url" verbLevel="1">
<property name="targetType"
>application/x-votable+xml;content=datalink</property>
<property name="targetTitle">Datalink</property>
</outputField>
When your product link is a datalink, you have to amend the accref
column in your main table. This stereotypically looks like this:
<column original="accref">
<property name="targetType"
>application/x-votable+xml;content=datalink</property>
<property name="targetTitle">Datalink</property>
</column>
To have datalinks rather than the plain dataset as what the accref points to, you need to change what DaCHS thinks of your dataset; this is what the `//products#define`_ rowfilter in your grammar is for:
<fitsProdGrammar qnd="True">
<rowfilter procDef="//products#define">
<bind key="path">\dlMetaURI{dl}</bind>
<bind key="mime">'application/x-votable+xml;content=datalink'</bind>
<bind key="fize">10000</bind>
[...]
</rowfilter>
[...]
</fitsProdGrammar>
This includes the estimate that the datalink document will have about 10k octets; in that region, there is no need to be precise. Note that the argument to the `macro dlMetaURI`_ is the id of the datalink service; DaCHS has no way to work that out by itself.
When you do this, you must use a datalink-aware descriptor generator in
SODA.
When you use the recommended setup, where the accref is the
inputsDir-relative path to the main file, and you’re dealing with FITS,
you can use the DLFITSProductDescriptor
class. Thus, the base
functionality of a FITS cutout service with datalink products would be:
<service id="dl" allowed="dlget,dlmeta">
<meta name="title">My Cutout Service</meta>
<datalinkCore>
<descriptorGenerator procDef="//soda#fits_genDesc"
name="genFITSDesc">
<bind key="accrefPrefix">'mysvcs/data'</bind>
<bind key="descClass">DLFITSProductDescriptor</bind>
</descriptorGenerator>
<FEED source="//soda#fits_standardDLFuncs"/>
</datalinkCore>
</service>
When not using FITS, you will need to change the descriptor generator’s computation of the local file path yourself, as done, e.g., in pcslg/q.
SDM compliant tables¶
A common use for datalink cores in DaCHS is for server-side generation and processing of spectra as discussed in SDM processing . This almost invariably involves defining tables compliant with the spectral data model and filling them.
The builder
parameter of `//soda#sdm_genData`_ expects a reference
to an SDM compliant data
element. To define it, you first need to
define an instance table. The columns that are in there depend on your
data. In the simplest case, the //ssap#sdm-instance
mixin is
sufficient and adds the columns flux
and spectral
. Here’s how
you’d add flux errors if you needed to:
<table id="instance" onDisk="False">
<mixin ssaTable="slitspectra"
spectralDescription="Wavelength"
fluxDescription="Flux"
>//ssap#sdm-instance</mixin>
<column name="fluxerror"
ucd="stat.error;phot.flux.density;em.wl"
unit="m"
description="Estimate for error in flux based on the procedure
discussed at referenceURL"/>
</table>
What’s referenced in `//soda#sdm_genData`_ is a data
element that
builds this table. Here’s one that fills the table from the database:
<data id="get_slitcomponent">
<!-- datamaker to pull spectra values out of the database -->
<embeddedGrammar>
<iterator>
<code>
obsId = self.sourceToken["accref"].split("/")[-1]
with base.getTableConn() as conn:
for row in conn.queryToDicts(
"SELECT lambda as spectral, flux, error as fluxerror"
" WHERE obsId=%(obsid)s ORDER BY lambda"):
yield row
</code>
</iterator>
</embeddedGrammar>
<make table="instance">
<parmaker>
<apply procDef="//ssap#feedSSAToSDM"/>
</parmaker>
</make>
</data>
– obviously, you can just as well fill it from a file (e.g., cdfspect/q, which also shows what to do when the metadata that comes with the files is boken).
The parmaker
with the //ssap#feedSSAToSDM
call is generic, i.e.,
you won’t usually need any more tricks here.
Product Previews¶
DaCHS has built-in machinery to generate previews from normal, 2D FITS and JPEG files, where these are versions of the original dataset scaled to be about 200 pixels in width, delivered as JPEG files. These previews are shown on mousing over product links in the web interface, and they turn up as preview links in datalink interfaces. This also generates previews for cutouts.
For any other sort of data, DaCHS does not automatically generate
previews. To still provide previews – which is highly recommended –
there is a framework allowing you to compute and serve out custom
previews. This is based on the preview
and preview_mime
columns
which are usually set using parameters in //products#define
.
You could use external previews by having http (or ftp) URLs, which could look like this:
<rowfilter procDef="//products#define">
...
<bind key="preview">("http://example.org/previews/"
+"/".join(\inputRelativePath.split("/")[2:]))</bind>
<bind key="preview_mime">"image/jpeg"/bind>
</rowfilter>
(this assumes takes away to path elements from the relative paths, which
typically reproduces an external hierarchy). If you need to do more
complex manipulations, you can have a custom rowfilter, maybe like
this if you have both FITS files (for which you want DaCHS’ default
behaviour selected with AUTO
) and .complex
files with some
external preview:
<rowfilter name="make_preview_paths">
<code>
srcName = os.path.basename(rowIter.sourceToken)
if srcName.endswith(".fits"):
row["preview"] = 'AUTO'
row["preview_mime"] = None
else:
row["preview"] = ('http://example.com/previews'
+os.path.splitext(srcName)[0]+"-preview.jpeg")
row["preview_mime"] = 'image/jpeg'
yield row
</code>
</rowfilter>
<rowfilter procDef="//products#define">
...
<bind key="preview">@preview</bind>
<bind key="preview_mime">@preview_mime</bind>
</rowfilter>
Precomputed Previews¶
More commonly, however, you’ll have local previews. If they already exist, use a static renderer and enter full local URLs as above.
If you don’t have pre-computed previews, let DaCHS handle them for you. You need to do three things:
define where the preview files are. This happens via a
previewDir
property on the importing data descriptor, like this:<data id="import"> <property key="previewDir">previews</property> ...
say that the previews are standard DaCHS generated in the
//products#define
rowfilter. The main thing you have to decide here is the MIME type of the previews you’re generating. You will usually use either the `macro standardPreviewPath`_ (preferable when you have less than a couple of thousand products) or the `macro splitPreviewPath`_ to fill the preview path, but you can really enter whatever paths are convenient for you here:<rowfilter procDef="//products#define"> <bind key="table">"\schema.data"</bind> <bind key="mime">"image/fits"</bind> <bind key="preview_mime">"image/jpeg"</bind> <bind key="preview">\standardPreviewPath</bind> </rowfilter>
actually compute the previews. This is usually not defined in the RD but rather using DaCHS’ processing framework. Precomputing previews in the processor documentation covers this in more detail; the upshot is that this can be as simple as:
from gavo.helpers import processing class PreviewMaker(processing.SpectralPreviewMaker): sdmId = "build_sdm_data" if __name__=="__main__": processing.procmain(PreviewMaker, "flashheros/q", "import")
Previews on the Fly¶
When you keep the data you want to preview in the database – as is sensible for shortish spectra or time series – it hurts to create files for what otherwise would be neatly in arrays in the database, much more so since such collections are often large, and thus the overwhelming majority of generated files would probably never be retrieved.
So, we would much rather generate the images on the fly. DaCHS can do this, too. The major ingredient is `the qp renderer`_, which lets you write a service taking a single argument from the URL path. This keeps preview URLs tidy.
Here is an example for a preview generating service reading spectral and flux points from the database:
<service id="preview" allowed="qp">
<meta name="title">DFBS spectra preview maker"</meta>
<property name="queryField">specid</property>
<pythonCore>
<inputTable>
<inputKey name="specid" type="text" required="True"
description="ID of the spectrum to produce a preview for"/>
</inputTable>
<coreProc>
<setup imports="gavo.helpers.processing.SpectralPreviewMaker,
gavo.svcs"/>
<code>
with base.getTableConn() as conn:
res = list(conn.query("SELECT spectral, flux"
" FROM \schema.spectra"
" WHERE specid=%(specid)s",
inputTable.args))
if not res:
raise svcs.UnknownURI("No such spectrum known here")
return ("image/png", SpectralPreviewMaker.get2DPlot(
zip(res[0][0], res[0][1]), linear=True))
</code>
</coreProc>
</pythonCore>
</service>
Essentially, we define a service that sticks the rest of a query path
pointing to it into a field specid
in the input table. For
instance, when the query path coming in is
myspecs/q/preview/qp/foo/bar
and the thing sits in the RD
myspecs/q
, then specid
will be foo/bar
.
The python core fetches this specid and does a database query to pull the spectral and flux points out of the database table; if there is an accref in the table, it is probably a good idea to just use that for what specid does here.
Finally, we use the SpectralPreviewMaker mentioned above; this has a static method doing a 2D plot or (x,y) tuples (see the source if you have to) and returning a PNG in a string.
When a core returns a 2-tuple, most DaCHS renderers will interpret the first element as a media type and the second as a byte string to deliver; qp certainly does, and so the last line simply ensures the data is handed back to the client as an image/png.
What’s left to do is tell DaCHS where to find the previews. That you’ll
do in the products#define
rowfilter. In all likelihood, you’ll be
building some artificial accref in such cases. Right now, you will have
to repeat such expressions when declaring the URL at which the preview
is found, perhaps like this:
<rowfilter procDef="//products#define">
<bind key="table">"\schema.spectra"</bind>
<bind key="accref">"\rdId/%s-%s"%(@plate, @objectid[5:])</bind>
<bind key="path">[...]</bind>
<bind key="preview_mime">"image/png"</bind>
<bind key="preview">makeAbsoluteURL("\rdId/preview/qp/%s-%s"%(
@plate, @objectid[5:]))</bind>
</rowfilter>
Custom UWSes¶
Universal Worker Systems (UWSes) allow the asynchronous operation of services, i.e., the server runs a job on behalf of the user without the need for a persistent connection.
DaCHS supports async operations of TAP and datalink out of the box. If you want to run async services defined by your own code, there are a few things to keep in mind.
(1) You’ll need to prepare your database to keep track of your custom jobs (just once):
dachs imp //uws enable_useruws
(2) You’ll have to allow the uws.xml
renderer on the service in
question.
(3) Things running within a UWS are fairly hard to debug in DaCHS right
now. Until we have good ideas on how to make these things a bit more
accessible, it’s a good idea to at least for debugging also allow
synchronous renderers, for instance, form
or api
. If something
goes wrong, you can do a sync query that then drops you in a debugger
in the usual manner (see the debugging chapter in the tutorial).
(4) For now, the usual queryMeta is not pushed into the uws handler (there’s no good reason for that). We do, however, transport on DALI-type RESPONSEFORMAT. To enable that on automatic results (see below), say:
<inputKey name="responseformat" description="Preferred
output format" type="text"/>
in your input table.
(5) All UWS parameters are lowercased and only available in lowercased form to server-side code. To allow cores to run in both sync and async without further worries, just have lowercase-only parameters.
(6) As usual, the core may return either a pair of (media type, content)
or a data item, which then becomes a UWS result named result
with
the proper media type. You can also return None (which will make the
core incompatible with most other renderers). That may be a smart thing
to do if you’re producing multiple files to be returned through UWS. To
do that, there’s a job
attribute on the inputTable that has an
addResult(source, mediatype, name)
method. Source can be a string
(in which case the string will be the result) or a file open for reading
(in which case the result will be the file’s content). Input tables
of course don’t have that attribute unless they come from the uws
rendererer. Hence, a typical pattern to use this would be:
if hasattr(inputTable, "job"):
with inputTable.job.getWritable() as wjob:
wjob.addResult("Hello World.\\n", "text/plain", "aux.txt")
or, to take the results from a file that’s already on-disk:
if hasattr(inputTable, "job"):
with inputTable.job.getWritable() as wjob:
with open("other-result.txt") as src:
wjob.addResult(src, "text/plain", "output.txt")
Right now, there’s no facility for writing directly to UWS result files. Ask if you need that.
(7) UWS lets you add arbitrary files using standard DALI-style uploads.
This is enabled if there are file
-typed inputKeys in the service’s
input table. These inputKeys are otherwise ignored right now.
See [DALI] for details on how these inputs work. To create an
inline upload from a python client (e.g., to write a test), it’s most
convenient to use the requests package, like this:
import requests
requests.post("http://localhost:8080/data/cores/pc/uws.xml/D2hFEJ/parameters",
{"UPLOAD": "stuff,param:upl"},
files = {"upl": open("zw.py")})
From within your core, use the file name (the name of the input key) and pull the file from the UWS working directory:
with open(os.path.join(inputTable.job.getWD(), "mykey")) as f:
...
Hint on debugging: dachs uwsrun
doesn’t check the state the job is
in, it will just try to execute it anyway. So, if your job went into
error and you want to investicate why, just take its id and execute
something like:
dachs --traceback uwsrun i1ypYX
Custom Pages¶
While DaCHS isn’t actually intended to be an all-purpose server for web applications, sometimes you want to have some gadget for the browser that doesn’t need VO protocols. For that, there is customPage, which is essentially a bare-bones nevow page. Hence, all (admittedly sparse) nevow documentation applies. Nevertheless, here are some hints on how to write a custom page.
First, in the RD, define a service allowing a custom page. These normally have a null core (the customPage renderer will ignore it either way):
<service id="ui" allowed="custom"
customPage="res/registration.py">
<meta name="shortName">DOI registration</meta>
<meta name="title">VOiDOI DOI registration web service</meta>
<nullCore/>
</service>
The python module referred to in customPage must define a MainPage
nevow resource. The recommended pattern is like this:
from nevow import tags as T
from gavo import web
from gavo.imp import formal
class MainPage(
formal.ResourceMixin,
web.CustomTemplateMixin,
web.ServiceBasedPage):
name = "custom"
customTemplate = "res/registration.html"
workItems = None
@classmethod
def isBrowseable(self, service):
return True
def form_ivoid(self, ctx, data={}):
form = formal.Form()
form.addField("ivoid", formal.String(required=True), label="IVOID",
description="An IVOID for a registred VO resource"),
form.addAction(self.submitAction, label="Next")
return form
def render_workItems(self, ctx, data):
if self.workItems:
return ctx.tag[T.li[[m for m in self.workItems]]]
return ""
def submitAction(self, ctx, form, data):
self.workItems = ["Working on %s"%data["ivoid"]]
return self
The formal.ResourceMixin
lets you define and interpret forms. The
web.ServiceBasedPage
does all the interfacing to the DaCHS (e.g.,
credential checking and the like). The web.CustomTemplateMixin
lets
you get your template from a DaCHS template (cf. templating guide)
from a resdir-relative directory given in the customTemplate
attribute. For widely distributed code, you should additionally provide
some embedded stan fallback in the defaultDocFactory
attribute – of
course, you can also give the template in stan in the first place.
On form_invoid
and submitAction
see below.
This template could, for this service, look like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:n="http://nevow.com/ns/nevow/0.1">
<head>
<title>VOiDOI: Registration</title>
<n:invisible n:render="commonhead"/>
</head>
<body n:render="withsidebar">
<h1>VOiDOI: Register your VO resource</h1>
<ul n:render="workItems"/>
<p>VOiDOI lets you obtain DOIs for registered VO services.</p>
<p>In the form below, enter the IVOID of the resource you want a DOI for.
If the resource is known to our registry but has no DOI yet, the registred
contact will be sent an e-mail to confirm DOI creation.</p>
<n:invisible n:render="form ivoid"/>
</body>
</html>
Most of the details are explained in the templating guide. The
exception is the form ivoid
. This makes the
formal.ResourceMixin
call the form_ivoid
in MainPage
and put
in whatever HTML/stan that returns. If nevow detects that the request
already results from filling out the form, it will execute what your
registred in addAction
– in this case, it’s the submitAction
method.
Important: anything you do within addAction
runs within the
(cooperative) server thread. If it blocks or performs a long
computation, the server is blocked. You will therefore want to do
non-trivial things either using asynchronous patterns or using
deferToThread
. The latter is less desirable but also easier, so
here’s how this looks like:
def submitAction(self, ctx, form, data):
return threads.deferToThread(
runRegistrationFor, data["ivoid"]
).addCallback(self._renderResponse
).addErrback(self._renderErrors)
def _renderResponse(self, result):
# do something to render a success message (or return Redirect)
return self
def _renderErrors(self, failure):
# do something to render an error message, e.g., from
# failure.getErrorMessage()
return self
The embedding RD is available in the custom pages’s global namespace as
RD
. Thus, the standard pattern for creating a read only table is:
with api.getTableConn() as conn: table =
api.TableForDef(RD.getById("my_table"), connection=conn)
If you need write access, you would write:
with api.getWritableAdminConn() as conn:
table = api.TableForDef(RD.getById("my_table"), connection=conn)
The RD
attribute is not avalailable during module import. This is
a bit annoying if you want to load resources from an RD-dependent place;
this, in particular, applies to importing dependent modules. To provide
a workaround, DaCHS calls a method initModule(**kwargs)
after
loading the module. You should accept arbitrary keyword arguments here
so you code doesn’t fail if we find we want to give initModule
some
further information.
The common case of importing a module from some RD-dependent place thus becomes:
from gavo import utils
def initModule(**kwargs):
global oai2datacite
modName = RD.getAbsPath("doitransfrom/oai2datacite")
oai2datacite, _ = utils.loadPythonModule(modName)
Manufacturing Spectra¶
TODO: Update this for Datalink
Making SDM Tables¶
Compared to images, the formats situation with spectra is a mess. Therefore, in all likelihood, you will need some sort of conversion service to VOTables compliant to the spectral data model. DaCHS has a facility built in to support you with doing this on the fly, which means you only need to keep a single set of files around while letting users obtain the data in some format convenient to them. The tutorial contains examples on how to generate metadata records for such additional formats.
First, you will have to define the “instance table”, i.e., a table definition that will contain a DC-internal representation of the spectrum according to the data model. There’s a mixin for that:
<table id="spectrum">
<mixin ssaTable="hcdtest">//ssap#sdm-instance</mixin>
</table>
In addition to adding lots and lots of params, the mixin also defines
two columns, spectral
and flux
; these have units and ucds as
taken from the SSA metadata. You can add additional columns (e.g., a
flux error depending on the spectral coordinate) as required.
The actual spectral instances can be built by sdmCores and delivered through DaCHS’ product interface.
sdmCores, while potentially useful with common services, are intended to
be used by the product renderer for dcc product table paths. They
contain a data item that must yield a primary table that is basically
sdm compliant. Most of this is done by the //ssap#feedSSAToSDM
apply
proc, but obviously you need to yield the spectral/flux pairs (plus
potentially more stuff like errors, etc, if your spectrum table has more
columns. This comes from the data item’s grammar, which probably must
always be an embedded grammar, since its sourceToken is an SSA row in a
dictionary. Here’s an example:
<sdmCore queriedTable="hcdtest" id="mksdm">
<data id="getdata">
<embeddedGrammar>
<iterator>
<code>
labels = ("spectral", "flux")
relPath = self.sourceToken["accref"].split("?")[-1]
with self.grammar.rd.openRes(relPath) as inF:
for ln in inF:
yield dict(zip(labels,ln.split()))
</code>
</iterator>
</embeddedGrammar>
<make table="spectrum">
<parmaker>
<apply procDef="//ssap#feedSSAToSDM"/>
</parmaker>
</make>
</data>
</sdmCore>
Note: spectral, flux, and possibly further items coming out of the
iterator must be in the units units promised by the SSA metadata
(fluxSI, spectralSI). Declarations to this effect are generated by the
//ssap#sdm-instance
mixin for the spectral and flux columns.
The sdmCores are always combined with the sdm renderer. It passes an
accref into the core that gets turned into an row from queried table;
this must be an “ssa” table (i.e., right now something that mixes in
//ssap#hcd
). This row is the input to the embedded data descriptor.
Hence, this has no sources element, and you must have either a custom
or embedded grammar to deal with this input.
Echelle Spectra¶
Echelle spectrographs “fold” a spectrum into several orders which may be
delivered in several independent mappings from spectral to flux
coordinate. In this split form, they pose some extra problems, dealt
with in an extra system RD, //echelle
. For merged Echelle spectra,
just use the standard SSA framework.
Table¶
Echelle spectra have additional metadata that should end up in their SSA metadata table – these are things like the number of orders, the minimum and maximum (Echelle) order, and the like. To pull these columns into your metadata table, use the ssacols stream, for example like this:
<table id="ordersmeta" onDisk="True" adql="True">
<meta name="description">SSA metadata for split-order
Flash/Heros Echelle spectra</meta>
<mixin
[...]
statSpectError="0.05"
spectralResolution="2.5e-11"
>//ssap#hcd</mixin>
<mixin
calibLevel="1">//obscore#publishSSAPMIXC</mixin>
<column name="localKey" type="text"
ucd="meta.id"
tablehead="Key"
description="Local observation key."
verbLevel="1"/>
<STREAM source="//echelle#ssacols"/>
</table>
Adapting Obscore¶
You may want extra, locally-defined columns in your obscore tables. To
support this, there are three hooks in obscore that you can exploit.
The hooks are in userconfig.rd
(see Userconfig RD in
the operator’s guide to where it is and how to get started with it)
It helps to have a brief look at the //obscore
RD (e.g., using
dachs admin dumpDF //obscore
) to get an idea what these hooks do.
Within the template userconfig.rd
, there are already three STREAMs
with ids starting with obscore.; these are referenced from within the
system //obscore
RD. Here’s an somewhat more elaborate example:
<STREAM id="obscore-extracolumns">
<column name="fill_factor"
description="Fill factor of the SED"
verbLevel="20"/>
</STREAM>
<STREAM id="obscore-extrapars">
<mixinPar name="fillFactor"
description="The SED's fill factor">NULL</mixinPar>
</STREAM>
<STREAM id="obscore-extraevents">
<property name="obscoreClause" cumulate="True">
,
CAST(\\\\fillFactor AS real) AS fill_factor
</property>
</STREAM>
(to be on the safe side: there need to be four backslashes in front of fillFactor; this is just a backslash doubly-escaped. Sorry about this).
The way this is used in an actual mixin would be like this:
<table id="specs" onDisk="True">
<mixin ...>//ssap#hcd</mixin>
<mixin
... (all the usual parameters)
fillFactor="0.3">//obscore#publishSSAPMIXC</mixin>
</table>
What’s going on here? Well, obscore-extracolumns
is easy – this
material is directly inserted into the definition of the obscore view
(see the table with id ObsCore
within the //obscore
RD). You
could abuse it to insert other stuff than columns but probably should
not.
The tricky part is obscore-extraevents
. This goes into the
//obscore#_publishCommon
STREAM and ends up in all the publish
mixins in obscore. Again, you could insert mixinPars and similar at
this point, but the only thing you really must do is add lines to the
big SQL fragment in the obscoreClause
property that the mixin leaves
in the table. This is what is made into the table’s contribution to the
big obscore union. Just follow the example above and, in particular,
always CAST to the type you have in the metadata, since individual tables
might have NULLs in the values, and you do not want misguided attempts
by postgres to do type inference then.
If you actually must know why you need to double-escape fillFactor and
what the magic with the cumulate="True"
is, ask.
Finally, obscore-extrapars
directly goes into a core component of
obscore, one that all the various publish mixins there use. Hence, all
of them grow your functionality. That is also why it is important to
give defaults (i.e., element content) to all mixinPars you give in this
way – without them, all those other publish mixins would fail unless
their applications in the RDs were fixed.
If you change %#obscore-extracolumns
, all the statement fragments
contributed by the obscore-published tables need to be fixed. To spare
you the effort of touching a potentially sizeable number of RDs, there’s
a data element in //obscore that does that for you; so, after every
change just run:
dachs imp //obscore refreshAfterSchemaUpdate
This may fail if you didn’t clean up properly after deleting a resource that once contributed to ivoa.obscore. In that case you’ll see an error message like:
*** Error: table u'whatever.main' could not be located in dc_tables
In that case, just tell DaCHS to forget the offending table:
dachs purge whatever.main
Another problem can arise when a table once was published to obscore but now no longer is while still existing. DaCHS in that case will still have an entry for the table in ivoa._obscoresources, which results in an error like:
Table definition of whatever.main> has no property 'obscoreClause' set
The fastest way to fix this situation is to drop the offending line in the database manually:
psql gavo -c "delete from ivoa._obscoresources where tablename='whatever.main'"
Writing Custom Grammars¶
Note
Before DaCHS 2.6.2, you had to import CustomRowIterator
from
gavo.grammars.customgrammar
rather than gavo.api
.
A custom grammar simply is a python module located within a resource
directory defining a row iterator class derived from
gavo.api.CustomRowIterator
. This class must be called
RowIterator
. You want to override the _iterRows
method. It will have
to yield row dictionaries, i.e., dictionaries mapping string keys to
something (preferably strings, but you will usually get away with
returning complete values even without fancy rowmakers).
So, a custom grammar module could look like this:
from gavo.api import CustomRowIterator
class RowIterator(CustomRowIterator):
def _iterRows(self):
for i in xrange(int(self.sourceToken)):
yield {'index': i, 'square': i**2}
This would be used with a data
material like:
<sources><item>4</item><item>40</item></sources>
<customGrammar module="res/sillygrammar"/>
– self.sourceToken
simply contains
whatever the sources
element produces.
One RowIterator
will be constructed for each item.
Do not override magic methods, since you may lose row filters, sourceFields, and the like if you do. An exception is the constructor. If you must, you can override it, but you must call the parent constructor, like this:
class RowIterator(CustomRowIterator):
def __init__(self, grammar, sourceToken, sourceRow=None):
CustomRowIterator.__init__(self, grammar, sourceToken, sourceRow)
<your code>
In practice (i.e., with <sources pattern="*"/>
) self.sourceToken
will often be a file name. When you call makeData
manually and pass a
forceSource
argument, its value will show up in self.sourceToken
instead.
Also look into EmbeddedGrammar
, which may be a more convenient way to
achieve the same thing.
A fairly complex example for a custom grammar is a provisional Skyglow grammar .
Locators¶
It is highly recommended to keep track of the current position so DaCHS
can give more useful error messages. When an error occurs, DaCHS will
call the iterator’s getLocator
method. This returns an arbitrary
string, where obviously it’s a good idea if that leads users to somewhere
close to where the problem has shown up. Here’s a custom grammar reading
space-separated key-value pairs from a file:
class RowIterator(CustomRowIterator):
def _iterRows(self):
self.lineNumber = 0
with open(self.sourceToken) as f:
for self.lineNumber, line in enumerate(f):
yield dict(zip(["key", "value"], line.split(" ", 1)))
def getLocator(self):
return f"line {self.lineNumber}"
Note that getLocator
does not include the source file name; that will
be inserted into the error message by DaCHS.
Debugging outside of DaCHS¶
For development, it may be convenient to execute your custom grammar as a python module. To enable that, just append a:
if __name__=="__main__":
import sys
from gavo.api import CustomGrammar
ri = RowIterator(CustomGrammar(None), sys.argv[1])
for row in ri:
print(row)
to your module. You can then run things like:
python res/mygrammar.py data/inhabitedplanet.fits
and see the rows as they’re generated.
Data Packs¶
A row iterator will be instantiiated for each source processed. Thus,
you should usually not perform expensive operations in the constructor
unless they depend on sourceToken
. Instead, you should rather define
a function makeDataPack
in the module. Whatever is returned by this
function is available as self.grammar.dataPack
in the row iterator.
The function receives an instance of the customGrammar
as an
argument. This means you can access the resource descriptor and
properties of the grammar. As an example of how this could be used,
consider this RD fragment:
<table id="defTable">
...
</table>
<customGrammar module="res/grammar">
<property name="targetTable">defTable</property>
</customGrammar>
Then you could have the following in res/grammar.py
:
def makeDataPack(grammar):
return grammar.rd.getById(grammar.getProperty("targetTable"))
and access the table in the row iterator.
If you want to do Debugging outside of DaCHS in custom grammars that require data packs, you need to be a bit more careful when you construct your custom grammar, as it will need a proper RD as its parent. This means you will have hard-code your RD id, perhaps like this:
if __name__=="__main__":
import sys
from gavo.api import CustomGrammar
grammar = CustomGrammar(api.getRD("MYRES/q"))
ri = RowIterator(grammar, sys.argv[1])
...
Dispatching Grammars¶
With normal grammars, all rows are fed to all rowmakers of all makes
within a data object. The rowmakers can then decide to not process a
given row by raising IgnoreThisRow
or using the trigger mechanism.
However, when filling complex data models with potentially dozens of
tables, this becomes highly inefficient.
When you write your own grammars, you can do better. Instead of just
yielding a row from _iterRows
, you yield a pair of a role (as
specified in the role
attribute of a make
element) and the row.
The machinery will then pass the row only to the feeder for the table in
the corresponding make.
Currently, the only way to define such a dispatching grammar is to use a
custom grammar or an embedded grammar. For these, just change your
_iterRows
and say isDispatching="True"
in the customGrammar
element. If you implement getParameters
, you can return either
pairs of role and row or just the row; in the latter case, the row will
be broadcast to all parmakers.
Special care needs to be taken when a dispatching grammar parses products, because the product table is fed by a special make inserted from the products mixin. This make of course doesn’t see the rows you are yielding from your dispatching grammar. This means that without further action, your files will not end up in the product table at all. In turn, getproducts will return 404s instead of your products.
To fix this, you need to explicitly yield the rows destined for the
products table with a products role, from within your grammar. Where
the grammar yield rows for the table with metadata (i.e., rows that actually
contain the fields with prodtblAccref, prodtblPath, etc), yield
to the products table, too, like this: yield ("products", newRow)
.
Scripting¶
As much as it is desirable to describe tables in a declarative manner, there are quite a few cases in which some imperative code helps a lot during table building or teardown. Resource descriptors let you embed such imperative code using script elements. These are children of the make elements since they are exclusively executed when actually importing into a table.
Currently, you can enter scripts in SQL and python, which may be called at various phases during the import.
SQL scripts¶
In SQL scripts, you separate statements with semicolons. Note that no statements in an SQL script may fail since that will invalidate the transaction. Use the AC_SQL language to simply ignore failures.
You can use table macros in the SQL scripts to parametrize them; the
most useful among those probably is \qName
containing the fully
qualified name of the table being processed.
You cannot easily produce output from SQL scripts. If you want to give
user feedback in long-running scripts, use RAISE NOTICE
in
procedures or, outside of procedures:
do $$BEGIN raise notice 'My message'; END$$;
Python scripts¶
Python scripts can be indented by a constant amount.
The table object currently processed is accessible as table. In
particular, you can use this to issue queries using
table.connection.execute(query, arguments)
(parallel to dbapi.execute) and to
delete rows using table.deleteMatching(condition, pars)
. The
current RD is accessible as table.tableDef.rd
, so you can access items from
the RD as table.tableDef.rd.getById("some_id")
, and the recommended way to
read stuff from the resource directory is
table.tableDef.rd.openRes("res/some_file)
.
Some types of scripts may have additional names available. Currently:
newSource
andsourceDone
have the namesourceToken
; this is the sourceToken as passed to the grammar; usually, that’s the file name that’s parsed from, but other constellations are possible.sourceDone
hasfeeder
– that is the DaCHS-internal glue to filling tables. The main use of this is that you can call itsflush()
method, followed by atable.commit()
. This may be interesting in updating grammars where you preserve what’s already imported. Note, however, that this may come with a noticeable performance penalty.
Script types¶
The type of a script corresponds to the event triggering its execution. The following types are defined right now:
- preImport – before anything is written to the table
- preIndex – before the indices on the table are built
- preCreation – immediately before the table DDL is executed
- postCreation – after the table (incl. indices) is finished
- afterMeta – after metadata has been written or updated (these are
executed by
gavo imp -m
, too) - beforeDrop – when the table is about to be dropped
- newSource – every time a new source is started
- sourceDone – every time a source has been processed
Note that preImport, preIndex, and postCreation scripts are not executed
when the make’s table is being updated, in particular, in data items with
updating="True"
. The only way to run scripts in such circumstances
is to use newSource and sourceDone scripts.
Examples¶
This snippet sets a flag when importing some source (in this case, that’s an RD, so we can access sourceToken.sourceId:
<script type="newSource" lang="python" id="markDeleted">
table.connection.execute("UPDATE %s SET deleted=True"
" WHERE sourceRD=%%(sourceRD)s"%id,
{"sourceRD": sourceToken.sourceId})
</script>
This is a hacked way of ensuring some sort of referential integrity: When a table containing “products” is dropped, the corresponding entries in the products table are deleted:
<script type="beforeDrop" lang="SQL" name="clean product table">
DELETE FROM products WHERE sourceTable='\qName'
</script>
Note that this is actually quite hazardous because if the table is dropped in any way not using the make element in the RD, this will not be executed. It’s usually much smarter to tell the database to do the housekeeping. Rules are typically set in postCreation scripts:
<script type="postCreation" lang="SQL">
CREATE OR REPLACE RULE cleanupProducts AS
ON DELETE TO \qName DO ALSO
DELETE FROM products WHERE key=OLD.accref
</script>
The decision if such arrangements are make before the import, before the indexing or after the table is finished needs to be made based on the script’s purpose.
Another use for scripts is SQL function definition:
<script type="postCreation" lang="SQL" name="Define USNOB matcher">
CREATE OR REPLACE FUNCTION usnob_getmatch(alpha double precision,
delta double precision, windowSecs float
) RETURNS SETOF usnob.data AS $$
DECLARE
rec RECORD;
BEGIN
FOR rec IN (SELECT * FROM usnob.data WHERE
q3c_join(alpha, delta, raj2000, dej2000, windowSecs/3600.))
LOOP
RETURN NEXT rec;
END LOOP;
END;
$$ LANGUAGE plpgsql;
</script>
You can also load data, most usefully in preIndex scripts (although beforeImport would work as well here):
<script type="preIndex" lang="SQL" name="create USNOB-PPMX crossmatch">
SET work_mem=1000000;
INSERT INTO usnob.ppmxcross (
SELECT q3c_ang2ipix(raj2000, dej2000) AS ipix, p.localid
FROM
ppmx.data AS p,
usnob.data AS u
WHERE q3c_join(p.alphaFloat, p.deltaFloat,
u.raj2000, u.dej2000, 1.5/3600.))
</script>
ReStructuredText¶
Text needing some amount of markup within DaCHS is almost always input as ReStructuredText (RST). The source versions of the DaCHS documentation give examples for such markup, and DaCHS users should at least briefly skim the ReStructuredText primer.
DaCHS contains some RST extensions. Those specifically targeted at writing DALI-compliant examples of them are discussed with `the examples renderer`_
Generally useful extensions include:
- bibcode
This text role formats the argument as a link into ADS when rendered as HTML. For technical reasons, this currently ignores the configured ADS mirror and always uses the Heidelberg one. Complain if this bugs you. To use it, you’d write:
See also :bibcode:`2011AJ....142....3H`.
Extensions for writing DaCHS-related documentation include:
- dachsdoc
- A text role generating a link into the current DaCHS documentation.
The argument is the relative path, e.g.,
:dachsdoc:`opguide.html#userconfig-rd`
. - dachsref
- A text role generating a link into the reference documentation. The
argument is a section header within the reference documentation, e.g.,
:dachsref:`//epntap2#populate-2_0`
or:dachsref:`the form renderer`
. - samplerd
- A text role generating a link to an RD used by the GAVO data center
(exhibiting some feature). The argument is the relative path to
the RD (or, really, anything else in the VCS), e.g.,
:samplerd:`ppmxl/q.rd`
.
(if you add anything here, please also amend the document source’s README).
The DaCHS API¶
User extension code (e.g., custom cores, custom grammars, processors) for DaCHS should only use DaCHS functions from its api as described below. We will try to keep it stable and at any rate warn in the release notes if we change it. For various reasons, the module also contains a few modules. These, and in particular their content, are not part of the API.
Note that this ”api” at this point is not what is in the namespace of rowmakers, rowfilters, and similar in-RD procedures. We do not, at this point, recommend importing the api there. If you do it anyway, we’d appreciate if you told us.
Before using non-API DaCHS functions, please inquire on the dachs-support mailing list (cf. http://docs.g-vo.org/DaCHS).
To access DaCHS API functions, say:
from gavo import api
(perhaps adding an as dachsapi
if there is a risk of confusion) and
reference symbols with the explicit module name (i.e., api.makeData
rather than picking individual names) in order to help others understand
what you’ve written.
In this chapter, we first give the functions that code in row makers see and then document the api available to extension code.
Functions Available for Row Makers¶
In principle, you can use arbitrary python expressions in var, map and
proc elements of row makers. In particular, the namespace in which
these expressions are executed contains math, os, re, time, datetime,
and urllib.parse (for urllib.parse.quote, in particular)
modules as well as gavo.base, gavo.utils, and gavo.coords; in addition,
there’s NaN (which simply is float('nan')
).
However, much of the time you will get by using the following functions that are immediately accessible in the namespace:
API function reference¶
System Tables¶
DaCHS uses a number of tables to manage services and implement protocols. Operators should not normally be concerned with them, but sometimes having a glimpse into them helps with debugging.
If you find yourself wanting to change these tables’ content, please post to dachs-support first describing what you’re trying to do. There should really be commands that do what you want, and it’s relatively easy to introduce subtle problems by manipulating system tables without going through those.
Having said that, here’s a list of the system tables together with brief
descriptions of their role and the columns contained. Note that your
installation might not have all of those; some only appear after a
dachs imp
of the RD they are defined in – which you of course only should
do if you know you want to enable the functionality provided.
The documentation given here is extracted from the resource descriptors,
which, again, you can read in source using dachs admin dumpDF
//<rd-name>
.
[RMI] | (1, 2, 3) Hanisch, R., et al, “Resource Metadata for the Virtual Observatory”, http://www.ivoa.net/Documents/latest/RM.html |
[VOTSTC] | Demleitner, M., Ochsenbein, F., McDowell, J., Rots, A.: “Referencing STC in VOTable”, Version 2.0, http://www.ivoa.net/Documents/Notes/VOTableSTC/20100618/NOTE-VOTableSTC-2.0-20100618.pdf |
[DALI] | Dowler, P, et al, “Data Access Layer Interface Version 1.0”, http://ivoa.net/documents/DALI/20131129/ |
[SODA] | (1, 2) Bonnarel, F., et al, “IVOA Server-side Operations for Data Access”, http://ivoa.net/documents/SODA/ |
[Datalink] | Dowler, P., et al, “IVOA DataLink”, http://ivoa.net/documents/DataLink/ |