In this document we provide a brief introduction to the use of Structured Query Language (SQL) for accessing data in the WFCAM Science Archive (WSA). Sections included in this document are a Primer aimed at users new both to the WSA and to SQL and a Reference which should be of use more experienced and returning users. Readers wanting a fuller introduction to SQL should consult an online tutorial or one of the legion of SQL books available: O'Reilly's SQL in a nutshell is a good introduction. The document concludes with some examples of cross-querying WFCAM and external survey data, which we expect to be a common usage of the archive.

The WSA and its prototype, the SuperCOSMOS Science Archive (SSA) are large databases so, for test purposes, we have produced the "Personal SSA" (PSSA), which is a small subset of the SSA, containing solely the data in the region of the sky with 184 < RA (deg) < 186 and -1.25 < Dec. (deg) < 1.25, which is the same area of sky as the "Personal SkyServer" produced for the Early Data Release (EDR)of the Sloan Digital Sky Survey (SDSS). The PSSA may be downloaded from here (as a .zip file with installation instructions included) or can be queried using a web interface of the same form as for the full SSA. Similarly, the initial release of the WSA (which contains a limited amount of commissioning data) should be used if you are unfamiliar with SQL but wish to exercise queries on that database. WSA and SSA users are strongly encouraged to use these small subsets for developing and debugging queries: with databases as large as the WSA and full SSA it can take a long time to find out that the query you wrote does not do what you intended!

This WSA Cookbook is based on the SSA version. Currently, the latter provides a more comprehensive introduction to SQL and in particular gives real-world examples of astronomy application queries that were used in the construction of the SSA. Novice users are strongly recommended to familiarise themselves with the PSSA and the information given in the SSA Cookbook before attempting real science with the WSA.

2. Primer

2.1. Relational databases

The WSA is a relational database, which means that it stores data in tables composed of rows and columns. Each row comprises the information stored for one data entry – e.g.. a celestial object in the case of the WSA – and there is one column for each of the attributes recorded for that entry – e.g. RA, Dec, ellipticity, etc. The different tables comprising a database may be linked (or related), if they each have columns representing the same data value (integrity constraints can be included in the table definitions which ensure consistency between two related tables, e.g. by preventing the deletion of only one of a pair of rows in different tables thus linked). For ease of use, it is possible to define virtual tables - called views - which are subsets of the data in one or more tables and which can be queried using the same syntax as ordinary tables (which are sometimes called base tables, to distinguish them from these virtual tables). In addition to tables and views, the major constituents of a relational database are indexes (the database community prefer that spelling to "indices"), which can speed up the identification of records which satisfy the particular condition expressed in a query, and various stored procedures and functions which extend the range of operations which can be performed on data held in the tables. The collection of definitions of columns, tables, views, indexes, stored procedures and functions in a database is called its schema.

2.2. The WSA as a relational database

The WSA schema is described in detail elsewhere, but we recap here the basic features which we shall use later. For each WFCAM survey programme in the WSA (e.g. the UKIDSS Large Area, Galactic Plane or Galactic Clusters Surveys etc.) two major tables called, for example, lasDetection and lasSource are available. The columns in *Detection are basically the attributes derived by running the pipeline image analyser over single passband frames, and these single-frame detections are then merged into multi-epoch, multi-colour records for individual celestial objects, which are stored in *Source. In addition to these major tables, there are also a number of metadata tables, which store ancillary information describing the processes involved in obtaining WFCAM imaging data, and which enable the provenance of data values in *Source and *Detection to be traced all the way back to a given image. To aid spatial matching of objects within the WSA and between the WSA and external catalogue datasets (such as the 2MASS Point Source Catalogue and SDSS Data Release catalogues), there are also "neighbour" and "cross-neighbour" tables which record pairs of sources within 10 to 30 arcsec of one another, depending on the catalogues joined.

2.3. Structured Query Language (SQL)

SQL is the standard language for accessing and manipulating data stored in a relational database. In fact, several versions of the SQL standard exist, and most database management systens (DBMSs) actually support a subset of standard SQL, with some vendor-specific additions. The WSA is currently implemented in Microsoft's SQL Server 2000 DBMS, so WSA users will employ its SQL dialect (known as Transact-SQL, or T-SQL), although we have tried to restrict the use of vendor-specific features to a minimum. A fuller reference on T-SQL dialect than presented here is available online here.

The first thing to understand about SQL is that it is a set-based language, not a procedural language, like Fortran or C. A user submitting an SQL query to a relational database is defining the set of properties of the records to be returned from the database, not specifying the list of operations which will lead to their delivery; this latter is the responsibility of the DBMS engine, which will decide the best way to execute a given query from a set of possible execution plans. Many database vendors are adding procedural capabilities to the SQL dialects they support, and these constitute one of the main areas of difference between those dialects. These extensions will not be discussed here, as we shall concentrate on the basics of standard SQL.

2.4. The basic structure of a WSA SQL statement

For security reasons, the WSA does not allow users to execute queries which affect the basic structure and contents of the database, only those which can extract data from it. In SQL terms, this means that only SELECT statements are allowed: N.B. in this tutorial we write all SQL keywords in upper case italics and some column names in mixed case, both for clarity, although the WSA's SQL dialect is case insensitive by default. There are three basic classes of SELECT statement:

2.4.1 Projections

A projection is the retrieval of a set of full columns from a table. To retrieve the nominal RAs and Decs of the centres of all UKIDSS Large Area Survey fields in the WSA, one would type:

SELECT ra, dec FROM lasMergeLog

where lasMergeLog is the name of the WSA table which records information about LAS fields, and ra and dec are the names of the relevant columns in that table.

2.4.2 Selections

A selection is the retrieval of the data values in particular columns for those rows in a table which satisfy certain critieria. So, if one were interested only in fields whose nominal centres lie in a 1 degree strip on the celestial equator, the appropriate SQL query would be:

SELECT ra, dec
FROM lasMergeLog
WHERE dec BETWEEN -0.5 AND +0.5

In this example the SQL statement has been split into three lines to emphasise the SELECT…FROM…WHERE syntax, but this is still one SQL statement. The SQL Query Form in the WSA interface ignores the whitespace at the end of each line of text and generates a single query string from valid multi-line text like this. (Note that this means that users should not used double hyphens to indicate comments in multi-line queries, since this will result in all text after the first comment being ignored.)
Multiple constraints can be included in the WHERE clause of a selection, so, for example, the query above could be rewritten as:

SELECT ra, dec FROM lasMergeLog WHERE (dec >= -0.5) AND (dec <= +0.5)

while the centres of all other fields could be selected using the following statement:

SELECT ra, dec FROM lasMergeLog WHERE (dec < -0.5) OR (dec > +0.5)

The parentheses in these examples have been included for clarity – they are only required when needed to avoid ambiguity, and when necessary to over-rule the standard order of precedence amongst operators, outlined in Section 3.3.6 (users should note that the accidental omission of the WHERE clause from a selection turns it not into an invalid query, but into the projection of the columns contained in its SELECT clause, which, for tables as large as the *Source and *Detection tables of the WSA will return a lot of data). Note the units of spherical coordinates in the WSA: these are almost always decimal degrees for both RA and Dec (a notable exception to this rule is attribute raBase in table Multiframe, where the units are hours for consistency with the FITS files ingested from the processing centre. You can easily convert units in selections or WHERE clauses: e.g. SELECT raBase*15.0 ..., or WHERE (ra/15.0) > 23.0 ... etc.

Note that users should check which attributes in which tables have been indexed in the WSA, since the performance of queries that can make use of them should be significantly better than for those which do not: this information is presented in the Schema Browser as highlighted rows in the lists of table attributes.

2.4.3 Joins

A join is the retrieval of data entries from one or more tables in a database matched under some criterion. Extending our example above, a user may be interested in the dates on which the Y band images in a strip were taken. The Multiframe table in the WSA has an attribute called mjdObs, which records the Modified Julian Date of the observation. The *MergeLog and Multiframe tables are linked by having the common attribute multiframeID, which is a unique identifier for each FITS file ingested into the archive. The SQL query retrieving the desired dates here would be:

SELECT mjdObs, ra, dec
FROM lasMergeLog, Multiframe
WHERE (dec BETWEEN -0.5 AND +0.5)
AND (ymfID = multiframeID) AND (ymfID > 0)

In this query, records in the Multiframe table and the lasMergeLog table are being joined on condition that their values for the multiframeID attribute are equal. Furthermore, we select only those rows for which there is a Y multiframe identifier, i.e. we exclude any rows where there is no Y multiframe in the merged frame set. Such a situation is indicated by a default value for the attribute (see Section 2.5.). Note also that in general, FITS files produced by the pipeline are multi-extension files containing a paw-print of 4 individual images, and fields in the merge logs are made up from sets of individual images in different passbands and at different epochs. The observation dates of the four paw-print images will be nearly the same, so the results set should contain sets of four repeats of the observation dates matched with the different centres of the device images that comprise the multiframe.

The AS keyword can be used to rename the attributes in the SELECT clause so that their names in the output result set differs from those by which they are known in the database table. For example, a user thinking that the column names referred to in the query above are a little cryptic could rewrite it as follows:

SELECT mjdObs AS yModifiedJulianDate, ra AS fieldCentreRA, dec AS fieldCentreDec
FROM lasMergeLog, Multiframe
WHERE (dec BETWEEN -0.5 AND +0.5)
AND (ymfID = multiframeID) AND (ymfID > 0)

and the columns returned in the output result set will be headed yModifiedJulianDate, fieldCentreRA, fieldCentreDec. This useful facility can be misused by the unwary or the very stupid. For example, it would be possible to have a query which started "SELECT ra AS dec, dec AS ra", which could confuse the subsequent analysis of the extracted result set. In the particular case of extracting data in VOTable format from the WSA, the assignment of Unified Content Descriptors (UCDs) to columns - i.e. setting the values of the ucd attributes to <FIELD> elements in the metadata section at the start of the VOTable document - is performed on the basis of the column name in the output result set, so a user swapping ra and dec, as in the example above, would end up with the wrong UCD for both those two columns, causing problems were that VOTable to be used in conjunction with other Virtual Observatory tools reliant on UCDs. For the moment, users wishing to have UCDs included in VOTable output from the WSA must not rename table columns, while, more generally, it is left to the user to ensure that any renaming they do perform is sensible (note: UCDs in the SSA and WSA have been initially assigned according to the UCD1 definitions).

It is also possible to write join queries in a way that makes it more explicit that they are joining the two tables, i.e. the example above becomes:

SELECT mjdObs, ra, dec
FROM Multiframe
JOIN lasMergeLog ON (ymfID = multiframeID)
WHERE (dec BETWEEN -0.5 AND +0.5) AND (ymfID > 0)

This is an inner join, meaning it only returns the (mjdObs,ra,dec) triplets for matched rows, but there are other types of join, which return different combinations of data (see the SSA Cookbook for more details).

2.4.4 Subqueries

The SQL Server dialect of SQL allows the construction of nested SELECT statements, in which the WHERE clause includes a subquery which is itself a valid SELECT statement. For example, the join on lasMergeLog and Multiframe in the previous Section could be rewritten in the following way:

SELECT mjdObs
FROM Multiframe
WHERE multiframeID IN
(SELECT ymfID FROM lasMergeLog WHERE (dec BETWEEN -0.5 AND +0.5) AND (ymfID > 0))

Note that the ra and dec attributes from the lasMergeLog table have now been removed from the SELECT clause of the main query. This is because the lasMergeLog table is only present in the subquery, and its attributes are no longer accessible to the main query. The one exception to this is its ymfID attribute, but that is only accessible by value - i.e. the subquery generates a list of ymfID values and matches between this and the multiframeID column of the Multiframe table are made by use of the IN operator. Note also that now the results set only contains the distinct values of the observation dates, i.e. we get one observation date per multiframe, as opposed to previously where the results sets contained repeated observation dates for the distinct device image centres.

This query could also be written using a second logical operator, ANY, as follows:

SELECT mjdObs
FROM Multiframe
WHERE multiframeID = ANY
(SELECT ymfID FROM lasMergeLog WHERE (dec BETWEEN -0.5 AND +0.5) AND (ymfID > 0))

where the ANY operator is used to match rows in the Multiframe table with any of the rows in the output result set from the subquery which have the same ymfID value. Note that in both these subqueries formulations the list of attributes in the SELECT clause of the subquery must be consistent with the rest of the WHERE clause in the main query, since they are to be matched by value.

These last two queries illustrate the equivalence of IN and = ANY, but care must be taken if the logical operator NOT is added to the outer query, so that it seeks matches with the complement of the set for which matches are sought in the original query. The operator which is equivalent to NOT IN is not < > ANY, as one might initially expect, but rather < > ALL - where ALL is another logical operator, which evaluates to TRUE if all of a set of comparisons are TRUE - which is clear when one thinks through the logic of that query, but perhaps not immediately apparent.

The most common use for subqueries is to express complicated joins in a simpler fashion. Up to 32 levels of nesting are allowed, in principle, although memory restrictions may prevent that from being achieved in practice. To evaluate a complicated join as a nested series of queries would often be much slower, but the database engine should recognise the equivalence of the two forms of the query and recast it in the most efficient form as part of its query optimisation process. So, there should be no difference in the performance of queries submitted as complicated joins or as nested subqueries, and the latter are to be prefered if they help ensure that the query that is executed really is the one that the user wanted to run.

2.5. Null and default values in the WSA

Our discussion to this point has implicitly assumed that the values of the attributes corresponding to each column in every row in a database table are known. This need not always be the case, as a simple example from the WSA illustrates. The *Source tables in the WSA merges information about detections made in the individual frame set images. A very blue object may well not be detected in a K band image, so what should be written for such an object in the column of the *Source table which records, say, the ellipicity of the K band detection? One answer would be a null value, which is a special type of entry to be included in a table if the value of a given attribute is not known (or is indeterminate or is not applicable) for a particular row. In designing the WSA we have decided not to use nulls in these cases, but to define default values for use in these situations instead: e.g. in the example above, we would set the K band ellipticity in the *Source table of a celestial object undetected in that band to be a recognisably meaningless value; in the case of floating point numbers, -0.9999995e9. Nulls and default values are semantically different: the query processor in a database management system (DBMS) recognises that a value marked as null is unknown or indeterminate and will not include it in, say, the computation of the mean value of all entries in a column, while, to the query processor, a default value is like any other value, and will include it, unless the user explicitly excludes it - e.g. by computing the mean magnitude only for those objects with magnitudes brighter than -0.9999995e9, in this case. Other default values in the WSA include -99999999 for 4- and 8-byte integer attributes, -9999 for 2-byte integers, and 0 for 1-byte (unsigned) integers. The schema browser generally indicates the default value for many of the table attributes, and they have all been chosen to lie well beyond the range of legitimate values found in the WSA, so it is simple to exclude them:

SELECT AVG(j_1AperMag3)
FROM lasSource
WHERE j_1AperMag3 > 0.0

As a result of defining default values for some columns, there have to be dummy rows in some tables (i.e. rows for which every attribute takes the appropriate default value). The reason for this is illustrated by consideration of the *MergeLog and Multiframe tables in the WSA. If one of the multiframe UIDs (e.g. the Y image attribute, ymfID) in one of the *MergeLog tables has a default value because that particular frame set does not yet have a Y-band image, and if the attribute ymfID references the attribute multiframeID in table Multiframe, then table Multiframe needs an entire row of defaults for multiframeID=-99999999 in order to maintain the referential integrity of the database.

It is the responsibility of the user to ensure that the presence of default values and dummy rows will not corrupt the results of queries run on the WSA, but our decision to use them, rather than nulls greatly simplifies the logic involved in all queries run on the database. The inclusion of null values for an attribute means that an expression involving it can evaluate to TRUE, FALSE or UNKNOWN, and we believe that the simplification for users in avoiding this three-valued logic greatly outweighs the burden of having to remember that default values exist for some columns in the WSA.

3. Reference: additional options in SELECT statements

Previous sections have described the basic SELECT…FROM…WHERE… structure of an SQL query. This is the basic syntax to be employed for querying the WSA, but there are some additional options in the SELECT clause which users may find useful. Once again, a fuller reference than is presented below is available online here.

3.1 Aggregate Functions

SQL offers a number of useful aggregate functions, which can be used for deriving summary information on columns or selections from them.

3.1.1 Mathematical aggregate functions

The meanings of those mathematical aggregate functions which apply only to numeric columns are very straightforward: AVG, MAX, MIN, SUM, STDEV, STDEVP, VAR, and VARP return, respectively, the mean, maximum, minimum, sum, standard deviation, population standard variation, variance and population variance of all values in the specified column. They can be used in conjunction with a WHERE clause or not, i.e.

SELECT max(dec) FROM lasMergeLog

will return the maximum value for the dec column found in the lasMergeLog table, while

SELECT max(dec) FROM lasMergeLog WHERE (ra BETWEEN 180 AND 360)

returns the maximum value found for fields within one hemisphere of the sky.

N.B. Strictly speaking, these functions only apply to non-null values found within the particular column. As discussed above, the WSA contains no null values by design, but uses default values instead. These will not be ignored in the computation of aggregate functions. For example, the SQL query

SELECT min(mjdObs) FROM Multiframe

returns the value -0.9999995e9, which is clearly nonsensical astronomically, and just illustrates the unintentional selection of the dummy row in the Multiframe table.

3.1.2 COUNT

There are additional aggregate functions which can be run on columns of all types. The most useful of these is COUNT, which can be used in a number of ways, as shown in the following example. The programmeID attribute in the ProgrammeFrame table identifies the multiframes which belong to that programme in the archive: since the WSA combines data from several programmes (e.g. the UKIDSS subsurveys), there are several different values of programmeID. To find out how many, one could use the following query:

SELECT COUNT (DISTINCT programmeID) FROM ProgrammeFrame

In practice, this is best discovered from querying the WSA's Programme table. Note that the inclusion of the DISTINCT keyword means that only the number of distinct programmeID values was returned. If it had been omitted, then the returned value would be the total number of entries in the programmeID column, which is simply equal to the number of rows in the table, since there are no null values for programmeID in the ProgrammeFrame table.
The total number of rows in the table can be also obtained using the following syntax:

SELECT COUNT(*) FROM ProgrammeFrame

Using COUNT(*) like this is a very good way of ensuring that a query is sensible before getting data returned by running it, and users are strongly encouraged to use this method to develop and debug SQL queries before running them on the full WSA. The performance overhead involved in this two-query process is not as high as it might seem, since, depending on how heavily the database is being used, some of the data from the COUNT(*) query will still be in cache when the query is run a second time to extract the desired attributes. Users should note that COUNT should be replaced by COUNT_BIG if there is likely to be more than 2x10⁹ records to be counted: this circumvents an SQL Server bug.

3.1.3 TOP

An interesting function specific to SQL Server’s SQL dialect is TOP, which is illustrated as follows. The query

SELECT TOP 10 mjdObs FROM Multiframe

would return the MJDs for ten rows in the Multiframe table. This will generally not be the ten highest (i.e. most recent) values in the table; remember that SQL is a set-based language, so a query yields the set of rows satisfying the criteria specified in the query, but with, by default, no particular ordering within that set. The ten highest area values can be obtained using TOP, however, with the addition an ORDER BY clause, which does impose an ordering of the rows in the result set: i.e.:

SELECT TOP 10 mjdObs FROM Multiframe ORDER BY mjdObs DESC

Note the presence of the DESC (for descending) keyword in the ORDER BY clause, which is required because the default behviour is for that clause to list rows in ascending order.

3.1.4 GROUP BY and HAVING

The GROUP BY clause allows aggregate functions to return more than a single value. For example, the user running the "SELECT COUNT(DISTINCT programmeID) FROM ProgrammeFrame" query above might want to know how many WSA multiframes are associated with each programme. That information is returned by the following query:

SELECT programmeID, COUNT(*) FROM ProgrammeFrame GROUP BY programmeID

where the rows in ProgrammeFrame are grouped by their programmeID value and separate counts are made of the number of rows in each group.

3.2 Mathematical Functions

The following mathematical functions are supported by SQL Server's SQL dialect.

ABS(x): returns the absolute value of x.

ACOS(x): returns the angle, in radians and in the range 0 to π, whose cosine is x.

ASIN(x): returns the angle, in radians and in the range -π/2 to π/2, whose sine is x.

ATAN(x): returns the angle, in radians and in the range -π/2 to π/2, whose tangent is x.

ATN2(y,x): returns the angle, in radians, whose tangent is y/x.

CEILING(x): returns the smallest integer greater than, or equal to, x.

COS(x): returns the cosine of angle x in radians.

COT(x): returns the cotangent of angle x in radians.

DEGREES(x): returns the angle in degrees corresponding to angle x in radians.

EXP(x): returns the exponential of x.

FLOOR(x): returns the largest integer less than or equal to x.

LOG(x): returns the natural logarithm of x.

LOG10(x): returns the base 10 logarithm of x.

PI(): returns the value of π.

POWER(x,y): returns the value of x raised to the power y.

RADIANS(x): returns the angle in radians corresponding to angle x in degrees.

RAND(x): returns a random number between 0 and 1, using an optional integer seed x if supplied.

ROUND(x,n,m): returns the value of x rounded to a specified length or precision. If n is a positive integer, then x is rounded to the number of decimal places specified by n. If n is a negative integer, then x is rounded n places to the left of the decimal point. ROUND can also be used to truncate x, through use of the optional argument m. If m is a positive integer, then x is truncated to the precision specified by n, rather than rounded to it.

SIGN(x): returns the value 1, 0, -1, depending whether x is positive, zero or negative, respectively.

SIN(x): returns the sine of the angle x in radians.

SQUARE(x): returns the square of x.

SQRT(x): returns the positive square root of x.

TAN(x): returns the tangent of angle x in radians.

Arithmetic functions (such as ABS, CEILING, DEGREES, FLOOR, POWER, RADIANS, and SIGN) return a value having the same data type as the input value, while trigonometric functions and others (such as EXP, LOG, LOG10, SQUARE, and SQRT), cast their input values to float and return a float value; this probably is of no concern to the average WSA user. All mathematical functions, except for RAND, are deterministic functions - i.e. they return the same results each time they are called with a specific set of input values - and RAND is deterministic only when a seed parameter is specified.

3.3 Operators

An operator in SQL is a symbol specifying an action that is performed on one or more expressions. For the present purposes, their major use is to provide greater flexibility in the possible forms of WHERE clauses of queries and in the columns of the result sets they can produce, which need not simply be columns drawn from the table(s) being queried. There are several classes of operator to consider.

3.3.1 Arithmetic operators

SQL Server's SQL dialect supports five arithmetic operators. The four basic ones - Addition (+), Subtraction (-), Multiplication (*) and Division (/) - plus the Modulo operation, (%), which returns the remainder of dividing one integer by another, and is used in the format "dividend%divisor". The use of the four basic arithmetic operators follows straightforwardly from their definitions.

3.3.2 Comparison operators

The Boolean comparison operators are used most frequently to filter rows via the WHERE clause of a SQL query. The most simple comparison operators ( <, >, =) were used above without introduction, but there a total of nine comparison operators which can be applied to pairs of expressions in the SQL Server dialect of SQL: = (Equal to); > (Greater than); < (Less than); >= (Greater than or equal to); <= (Less than or equal to); <>(Not equal to); != (Not equal to); !> (Not greater than); and !< (Not less than).

3.3.3 Logical operators

In a similar fashion, we have used a number of the logical operators (e.g. AND, BETWEEN, etc) above without introduction, but the following is the full list of logical operators supported by SQL Server:

ALL: TRUE if all of a set of comparisons are TRUE.
AND: TRUE if both of two Boolean expressions are TRUE.
ANY: TRUE if at least one of a set of comparisons is TRUE.
BETWEEN: TRUE if the operand is within a specified range.
EXISTS: TRUE if a subquery contains any rows.
IN: TRUE if the operand is equal to one of a list of expressions.
LIKE: TRUE if the operand matches a pattern.
NOT: Reverses the value of any other Boolean expression.
OR: TRUE if either of a pair of Boolean expressions is TRUE.
SOME: TRUE if at least one of a set of comparisons is TRUE.

The use of some of these operators (e.g. AND, BETWEEN, OR) has been illustrated above, but it is worth making a few comments on the remaining ones.

The LIKE operator is used for pattern matching. This is most commonly used for string matching. For example, a user interested in knowing how many UKIDSS LAS multiframes are present in the WSA could issue the query:

SELECT COUNT(*) FROM Multiframe WHERE project LIKE 'U/UKIDSS/LAS%'

Note the use of the percentage sign wildcard character which matches all the UKIRT OMP project strings defined for the UKIDSS LAS. Several other wildcard characters can be used in conjunction with LIKE; these are described further in the SSA Cookbook, as are further examples of logical operators.

3.3.4 String concatenation operator

The plus sign, +, is used as a string concatenation operator in the SQL Server dialect of SQL. This is most likely to be of use to WSA users in the formatting of result sets - i.e. in the definition of SELECT clauses.

3.3.5 Unary operators

The SQL Server dialect of SQL defines three unary operators - i.e. operators which have only one operand - although none of these are likely to be of much use to most WSA users. The first of these is the positive unary operator, +, which is used principally in SQL statements not allowed by the WSA query interface. The second, -, is the negative unary operator, which returns the negative value of a numeric expression, as shown in the following query:

SELECT -MIN(decBase) FROM Multiframe WHERE MultiframeID > 0

The final unary operator, ~, the Bitwise NOT operator, converts each bit in a numeric expression of integer data type into its 1s complement (i.e. 0s are changed to 1 and vice versa). This might be of use in queries involving the quality flags in the *Source and *Detection tables.

3.3.6 Operator precedence

The operators described in this subsection have the following descending levels of precedence:

+ (Positive Unary Operator), - (Negative Unitary Operator), ~ (Bitwise NOT)
* (Multiply), / (Division), % (Modulo)
+ (Add), + (Concatenate), - (Subtract), & (Bitwise AND)
= , > , < , > = ,< = , < > , ! = , ! > , ! < (Comparison operators)
^ (Bitwise Exclusive OR), | (Bitwise OR)
NOT
AND
ALL, ANY, BETWEEN, IN, LIKE, OR, SOME

When two operators in an expression have the same operator precedence level, they are evaluated left to right based on their position in the expression.

4. How to query cross-matched data in the archive

Archive catalogue tables are automatically cross-matched to themselves and to a number of external survey catalogue datasets held locally, e.g. SDSS Data Releases; FIRST, IRAS, ROSAT and 2MASS catalogues; and legacy photographic catalogues like USNO-B (for a complete list, click on "Browser" on a navigation bar or see below). Rather than prejoining any two datasets to create a static, merged table of what are assumed to be associated sources on the basis of a fixed joining algorithm, and choosing a subset of what are assumed to be useful attributes from each of the two catalogue tables to propagate into that merged table, the philosophy is to create a table of pointers between any two matched datasets. This means, for example, that any externally catalogued source co-incident or nearby an archive source is readily available, out to some predefined maximum angular search radius, and all sorts of science usages are possible because the exact matching criteria (e.g. maximum allowable displacement, consistency in morphological parameters or classification, or even the number of possible external matches for a given archive source) can be tuned at query time by simple expressions in SQL. Furthermore, all attributes of both datasets are available to querying, because no decision has been made as to which attributes to propagate into a static merged set. The flip side to this flexibility is the rather opaque syntax for querying cross-matched data in SQL: instead of querying a single merged table, e.g. SELECT * FROM MergedTable WHERE ..., in general you have to query three tables: the two cross-matched tables and the table of pointers, e.g. SELECT * FROM Table1, ExternalDB..Table2, CrossNeighboursTable2 WHERE ..., using the joining techniques discussed above to select out the rows that you require. Some real-world examples best illustrate how to query cross-matched data in the WSA, but first we discuss the naming of WSA objects (databases, tables and attributes) that pertain to catalogue joining.

4.1 Naming convention for tables and attributes

In WSA parlance, a table of pointers that associates sources within one table (i.e. that provides a list of all nearby sources for each source within one table) is called a neighbour table, while a table of pointers that associates a set of externally catalogued sources to a table of WFCAM sources is called a cross-neighbour table. Cross-neighbour tables are likely to be used the most, but neighbour tables have their uses (for example, you may wish to check the internal consistency of WFCAM photometry and astrometry by selecting a set of measurements of the same sources in overlap regions of adjacent frame sets - this is most easily done using a neighbour table).

A list of all available neighbour/cross-neighbour tables is most easily obtained as follows:

SELECT neighbourTable FROM RequiredNeighbours

The naming convention is simple: neighbour tables are named by appending the word Neighbours to the source table that they point to, e.g. lasSource has a neighbour table called lasSourceNeighbours; while cross-neighbour tables are named by concatenating the two associated table names with an X, e.g. the cross-neighbour table between the UKIDSS LAS source table and the 2MASS point source catalogue is called lasSourceXtwomass_psc. Within these tables, there will always be the following three attributes: masterObjID, which points to the source for which neighbours have been created (the central source of the neighbourhood, if you like); slaveObjID, which points to the neighbouring sources (sources lying within the neighbourhood); and finally distanceMins which is the angular separation between the central source and the neighbour, in units of arcminutes. Depending on the external catalogue joined, there may be other attributes in cross-neighbour tables (use the Browser to examine these).

At the time of writing, external catalogue databases held within the WSA consist of:

External survey Catalogue product SQL database name Table name Description

Sloan Digital Sky Survey EDR SkyServerV3 PhotoObj SDSS Early Data Release (as released in the SkyServer RDBMS implementation, not the original SX OODBMS)

DR7 (5,3,2) BestDR7 (5,3,2) PhotoObj SDSS Data Release 7 (5,3,2)

2-Micron All-Sky Survey Point Source Catalogue TWOMASS twomass_psc 2MASS point source catalogue

Extended Source Catalogue TWOMASS twomass_xsc 2MASS point source catalogue

ROSAT All Sky Survey Bright Source Catalogue ROSAT rosat_bsc ROSAT all-sky survey bright source catalogue

Faint Source Catalogue ROSAT rosat_fsc ROSAT all-sky survey faint source catalogue

IRAS PSC IRAS iras_psc Infra-Red Astronomical Satellite point source catalogue

VLA-FIRST FIRST catalogue FIRST firstSource Faint Images of the Radio Sky at Twenty cm from the VLA survey

SSA Southern hemisphere survey SSA Source SuperCOSMOS Sky Survey as ingested into the SuperCOSMOS Science Archive

Millenium Galaxy Catalogue MGC MGC mgcDetection 37.5 square degree, medium-deep B-band imaging survey

NRAO VLA Sky Survey NVSS NVSS nvssSource A 1.4 GHz continuum survey covering the entire sky north of -40 deg declination.

Alternatively, you can use the WSA Browser to look at all the neighbour tables, their attributes and also the external databases that are available in the WSA, or use the following SQL query:

SELECT surveyName, databaseName, extTableName, description
FROM ExternalSurvey AS s, ExternalSurveyTable AS t
WHERE s.surveyID=t.surveyID AND s.surveyID>0

4.2 Selecting nearby cross-matches

Suppose a user wishes to select the identifiers and co-ordinates of all Faint Source Catalogue sources from the ROSAT All-Sky Survey that are within 6 arcsec of a UKIDSS LAS source. The SQL to do this is as follows:

SELECT las.sourceID, las.ra, las.dec, fsc.seqNo, fsc.ra, fsc.dec
FROM lasSource AS las, ROSAT..rosat_fsc AS fsc, lasSourceXrosat_fsc AS x
WHERE x.masterObjID=las.sourceID AND x.slaveObjID=fsc.seqNo AND x.distanceMins<0.1

Note: i) the use of table aliases las and fsc which are a convenient short-hand; ii) the use of attribute prefixes (e.g. las.ra) to distinguish between attributes in different tables that happen to have the same name; iii) the table joining condition in the WHERE clause, which selects associated rows (if you omit this condition, you will get all rows of each table joined to all other rows, i.e. an extremely large and useless dataset!); and finally iv) the specification of a maximum radius of 0.1 arcmin (=6 arcsec) for this query, where the maximum available for ROSAT joins is 30 arcsec. You can see the maximum joining radii (in arcsec) available to you in each neighbour/cross-neighbour table by issuing the following query:

SELECT neighbourTable, joinCriterion*3600.0 as maxRadius
FROM RequiredNeighbours

When selecting nearby cross-matches, users should note that one or more than one rows may result for each master object, since there may be more than one slave neighbour within the neighbourhood defined by the maximum join criterion specified. If you want the nearest match only, then the next section explains how to do this.

4.3 Selecting the nearest cross-match

Suppose a user wishes to select the infrared and optical point source photometry for a sample (say the first 50 rows for speed) in the UKIDSS LAS and SDSS DR2 overlap, taking the nearest SDSS source to each WFCAM source as being the most likely match, where the maximum positional tolerance is set to 2 arcsec. The following query will do the job:

SELECT TOP 50 las.ra, las.dec, yAperMag3, j_1AperMag3, hAperMag3, kAperMag3, psfMag_u, psfMag_g, psfMag_r, psfMag_i, psfMag_z
FROM lasSource AS las, BestDR2..PhotoObj AS dr2, lasSourceXDR2PhotoObj AS x
WHERE masterObjID=las.sourceID AND slaveObjID=dr2.ObjID AND distanceMins<0.033333 AND sdssPrimary=1 AND distanceMins IN (
SELECT MIN(distanceMins)
FROM lasSourceXDR2PhotoObj
WHERE masterObjID=x.masterObjID AND sdssPrimary=1)

Note the following: i) here, we have used table aliases as attribute prefixes only where necessary (i.e. where attributes are not unambiguously identified by their names alone); ii) we select "primary" objects only from the SDSS, as we are interested in a data set with no duplicates; iii) a subquery is used to select the closest neighbour in each case, and the main query selects this nearest object if it is within the specified 2 arcsec maximum radial tolerance.

4.4 Selecting neighbours within one table

For WFCAM survey tables that have neighbour tables defined, you can examine the neighbours of each source in the table. For example, suppose we want to check the internal consistency of WSA photometry and astrometry by selecting a sample of duplicate source measurements from overlap regions between adjacent frame sets in the UKIDSS LAS. The following query will do the job:

SELECT (s1.ra-s2.ra)*3600.0*COS(RADIANS(s1.dec)) as deltaRA,
               (s1.dec-s2.dec)*3600.0 as deltaDec,
               (s1.yAperMag3-s2.yAperMag3) as dy,
               (s1.j_1AperMag3-s2.j_1AperMag3) as dj1,
               (s1.hAperMag3-s2.hAperMag3) as dh,
               (s1.kAperMag3-s2.kAperMag3) as dk
FROM lasSource AS s1, lasSource AS s2, lasSourceNeighbours AS x
WHERE s1.yAperMag3 > 0 AND s2.yAperMag3 > 0 AND s1.j_1AperMag3 > 0 AND s2.j_1AperMag3 > 0
               AND s1.hAperMag3 > 0 AND s2.hAperMag3 > 0 AND s1.kAperMag3 > 0 AND s2.kAperMag3 > 0
               AND masterObjID=s1.sourceID AND slaveObjID=s2.sourceID AND distanceMins<0.01 AND distanceMins IN (
SELECT MIN(distanceMins)
FROM lasSourceNeighbours
WHERE masterObjID=x.masterObjID)

Note that in this case, a stringent pairing tolerance of 0.6 arcsec has been set (this corresponds roughly to a 3-sigma chop given the typical positional errors in WFCAM astrometry). Also, we have used the useful feature of user-defined computed columns to create a tabular dataset of attributes that are most useful to us (i.e. differences in positions in arcsec and differences in magnitudes in each available passband).

5. Refining sample selection using quality bit flags

Within the detection and source tables there are quality bit flags, labelled ppErrBits in the detection tables and *ppErrBits in the source tables, where * is the combined filter name and pass number e.g. for lasSource: yppErrBits, j_1ppErrBits, j_2ppErrBits, hppErrBits and kppErrBits. These attributes encode the quality issues associated with a given detection such that detections subject to more serious quality issues have higher values. For a detection with no known quality issues the attribute's value is zero. The bit flag value encodes up to 32 distinct quality issues, which are divided into four different classes according to severity. This allows you to select a sample of sources where the quality is better than a certain decimal threshold value corresponding to each warning classification listed in this table:

Warning Class	Threshold
Information	256
Warning	65536
Important	16777216
Severe

So, to select a sample that consists of sources with no known quality issues or just minor, informative, warnings we would add ppErrBits < 256 to the WHERE clause of our SQL query. This is partly how the *PointSource views generate a good quality view of the *Source tables for each programme. The reliable*PointSource view selects only those sources with no known quality issues. To consider the quality information from every passband in your selection, you can merge the source table *ppErrBits attributes as: (kppErrBits | hppErrBits | j_1ppErrBits) < 256 for as many passbands as exist in that programme's source table. As an example of such source selections in practice, suppose we want to produce a colour magnitude plot for galaxies in the DXS source table which have no important quality issues:

SELECT kPetroMag, jmkExt
FROM dxsSource WHERE mergedClass=1 AND kPetroMag>-9.99995e+8 AND jmkExt>-9.99995e+8
AND (jppErrBits | kppErrBits) < 65536

Initially one would wish to select just those sources with no known quality issues, and then widen the search to include more and more potentially spurious sources. To further refine our samples we can select or deselect certain quality issues according to their associated bit mask, the full list of which may be found in the glossary entries for the ppErrBits attributes. To select sources marked with a certain quality issue we would add ppErrBits & bitMask != 0 to the WHERE clause of our SQL query, where bitMask is the bit mask value (decimal or hexadecimal) assigned to that quality issue as listed in the ppErrBits glossary entry. Conversely, to exclude a certain quality issue we would add ppErrBits & bitMask = 0 to the WHERE clause of our SQL query.

For example, having initially excluded all sources with important quality issues, we might wish to include a sample that are possibly contaminated by cross-talk artefacts. To do this we just extend the SQL query as follows:

SELECT kPetroMag,jmkExt
FROM dxsSource WHERE mergedClass=1 AND kPetroMag>-9.99995e+8 AND jmkExt>-9.99995e+8
AND ((jppErrBits | kppErrBits) < 65536 OR (jppErrBits | kppErrBits) & 0x00080000 != 0)

where the 0x00080000 is the bit mask, in hexadecimal notation, that represents the "possibly contaminated by cross-talk artefacts" quality issue.

Finally, to select a sample of galaxies that do not lie within a dither offset of the stacked J-band frame boundary (and so contains complete data), but can be affected by any other quality issue:

SELECT kPetroMag,jmkExt
FROM dxsSource WHERE mergedClass=1 AND kPetroMag>-9.99995e+8 AND jmkExt>-9.99995e+8
AND jppErrBits & 0x00400000 = 0

where the 0x00400000 is the bit mask, in hexadecimal notation, that represents the "source within a dither offset of the stacked frame boundary" quality issue.

Link to a fully detailed guide to the design of the quality bit flag attribute, ppErrBits, together with the complete list of quality issues with full descriptions and associated bit masks.

6. Refining sample selection for filter coverage

When making selections from the survey merged source tables *Source, users are reminded to think carefully about their requirements as regards filter coverage in the sample. Because releases of survey data are made prior to full completion of the surveys (obviously to expedite science exploitation), selections will be incomplete in various subtle ways depending on exactly how the sample predicates are specified. Prior to Data Release 2, two survey databases were created for each, for example UKIDSSDR1 and UKIDSSDR1PLUS. The former contained all frames, frame sets associated using those frames, and merged sources within those frame sets where full filter coverage was available at the release point; the later was a superset of the former and consisted of all available frames, frame sets and merged sources. For example, for the LAS, in UKIDSSDR1 the table lasMergeLog consists of frame sets having coverage in all of Y, J (i.e. first epoch, J_1), H and K, and the table lasSource contains sources merged within those frame sets. That is not to say that all sources would necessarily be measured in all wavebands, since objects of extreme and/or unusual colour may not be detected in one or more filters. However, clearly in the "non-plus" database (e.g. UKIDSSDR1), for any source not detected in one or more of the filters specified for a given programme (LAS, GCS etc.), valid and scientifically useful conclusions may be drawn as to the colour of the source. This is most certainly not the case in the "plus" (e.g. UKIDSSDR1PLUS) database, however, because absence in one or more of the specified filters may simply be because the data are missing. For example, the Hyades cluster in the GCS in UKIDSSDR1PLUS appears as a large number of K-only merged source records in gcsSource simply because the ZYJH data (by design we hasten to add) had not been observed at the time of the release.

From Data Release 2 onwards, the situation is different owing to disk space constraints (and other administrative overheads) in curating two distinct, similar sized copies of the release databases. Now, only the "plus" database is created, but views are defined within that DB to provide the same kind of functionality as provided by the subset *MergeLog and *Source tables in the "non-plus" database as was. Consider the LAS: two new views are available, called lasYJHKmergeLog and lasYJHKsource. The defining SQL for each is relatively simple, but for convenience these are nonethless provided for users to easily reproduce equivalent sample selections that previously would have been made in the "non-plus" DB. If you wish to do the equivalent of a SELECT from UKIDSSDR1 in DR2, simply query lasYJHKsource. Consider the SQL defining the view lasYJHKsource:

SELECT s.*
FROM lasSource s, lasYJHKmergeLog l
WHERE s.frameSetID = l.frameSetID

is a join between lasSource and another view, lasYJHKmergelog on the common referencing key frameSetID which uniquely identifies each frame set. The SQL defining this second view is:

SELECT *
FROM lasMergeLog
WHERE ymfID > 0 AND j_1mfID > 0 AND hmfID > 0 AND kmfID > 0

where this simply selects all frame sets that have coverage in all of YJHK; consequently the join query between this and lasSource returns sources from frame sets having complete filter coverage - exactly what would have resulted from SELECT * FROM lasSource in a "non-plus" DB. One could achieve the same results in SQL without recourse to the view as follows:

SELECT s.*
FROM lasSource s, lasMergeLog l
WHERE s.frameSetID = l.frameSetID AND j_1mfID > 0 AND hmfID > 0 AND kmfID > 0

We reiterate once again the distinction between selecting sources that have measurements in all filters, and sources contained within frame sets that have full filter coverage; the former could be obtained straightforwardly from:

SELECT *
FROM lasSource
WHERE yObjID > 0 AND j_1ObjID > 0 AND hObjID > 0 AND kObjID > 0

and excludes sources (for example) that are too faint in one or more of the passbands YJHK, as well excluding those sources from frame sets that have incomplete passband coverage. It depends on the exact nature of the science application which of the selections is the correct one to choose.

7. Using multi-epoch data: light curves etc

7.1 Selecting variable stars

SELECT v.sourceID
FROM dxsSource AS s,dxsVariability AS v
WHERE s.sourceID=v.sourceID AND s.mergedClass=-1 AND v.variableClass=1

This will select the stars which have been classified as variable using the current algorithm in the Schema Browser glossary.

Other selections using the statistics calculated in the Variability table can be used, e.g. selecting bright objects with high rms in J and K and large absolute skew in each band.

SELECT v.sourceID
FROM dxsSource AS s,dxsVariability AS v,dxsVarFrameSetInfo AS i
WHERE s.sourceID=v.sourceID AND s.mergedClass=-1 AND v.frameSetID=i.frameSetID AND v.kMeanMag<(i.kexpML-3.) AND v.kMeanMag>0. AND v.jMeanMag<(i.jexpML-3.) AND v.jMeanMag>0. AND v.jMagRms>(5.*v.jExpRms) AND v.kMagRms>(5.*v.kExpRms) AND abs(v.jskewness)>10. AND abs(v.kskewness)>10.

7.2 Producing light curves

SELECT m.mjdObs,d.aperMag3,d.aperMag3Err,d.ppErrBits,d.seqNum,x.flag
FROM dxsSourceXDetectionBestMatch AS x,dxsDetection AS d,Multiframe AS m
WHERE x.sourceID=446677514563 AND x.multiframeID=d.multiframeID AND x.extNum=d.extNum AND x.seqNum=d.seqNum AND x.multiframeID=m.multiframeID AND d.filterID=5 ORDER BY m.mjdObs

This gives the necessary data for the k-band light curve for sourceID=446677514563 in the DXS. Objects with d.ppErrBits>0 are flagged and objects with d.seqNum<0 are non-detections. Objects with x.flag=1 are also assigned to another source and those with x.flag=2 are non-detections within one dither offset of the frame edge.

7.3 Selecting variable light curves

SELECT m.mjdObs,d.aperMag3,v.kexpRms,d.ppErrBits,d.seqNum,x.flag
FROM dxsSourceXDetectionBestMatch AS x,dxsDetection AS d,Multiframe AS m,dxsVariability AS v
WHERE x.multiframeID=d.multiframeID AND x.extNum=d.extNum AND x.seqNum=d.seqNum AND x.multiframeID=m.multiframeID AND d.filterID=5 AND v.sourceID=x.sourceID AND x.sourceID IN
(SELECT TOP 1 v.sourceID
FROM dxsVariability AS v,dxsSource AS s, dxsVarFrameSetInfo AS i
WHERE v.sourceID=s.sourceID AND v.frameSetID=i.frameSetID AND v.kMeanMag>0 AND v.kMeanMag<(i.kExpML-3.) AND v.jMeanMag>0 AND v.jMeanMag<(i.jExpML-3.) AND v.kSkewness>2. AND s.mergedClass IN (-1,-2) AND v.knGoodObs>10 AND v.variableClass=1) ORDER BY m.mjdObs

This gives the necessary data for the k-band light curve for the first source selected which meets the following criteria: 02.; star-like; number of k-band observations > 10; overall classification is a variable. In this case the error bars are the expected rms of a non-variable object with the mean magnitude of this source.

7.4 Selecting light curves from correlated filter programmes

SELECT l.meanMjdObs,ss.kaperMag3,v.kexpRms,ss.kppErrBits,ss.kseqNum,ss.synSeqNum,x.flag
FROM calSourceXSynopticSourceBestMatch AS x,calSynopticSource AS ss,calSynopticMergeLog AS l,calVariability AS v
WHERE x.synframeSetID=ss.synframeSetID AND x.synSeqNum=ss.synSeqNum AND x.synFrameSetID=l.synFrameSetID AND v.sourceID=x.sourceID AND x.sourceID IN
(SELECT TOP 1 v.sourceID
FROM calVariability AS v,calSource AS s, calVarFrameSetInfo AS i
WHERE v.sourceID=s.sourceID AND v.frameSetID=i.frameSetID AND v.kMeanMag>0 AND v.kMeanMag<(i.kExpML-3.) AND v.hMeanMag>0 AND v.hMeanMag<(i.hExpML-3.) AND v.kSkewness>2. AND s.mergedClass IN (-1,-2) AND v.knGoodObs>80 AND v.variableClass=1 AND v.hkiWS>1000. AND jhiWS>1000.) ORDER BY l.meanMjdObs

Some programmes, such as the standard star (CAL) observe all filters over a short duration to get colours that are correlated and are independent of variability. In these data sets, we put the colour information at each epoch into a SynopticSource table and match this to the Source table. This selection is the same as 7.4, except the J-band selections have been replaced by H-band selections and we have included a selection on the strength of the correlation between the H and K bands: hkiWS. The magnitudes come from the calSynopticSource table, which is the table of matched filters at a specific epoch, specified in the calSynopticMergeLog table, by the meanMjdObs attribute. The matching is done using the synFrameSetID and synSeqNum, but users must be aware that a synoptic frameset in SynopticMergeLog may not include all filters, if some are deprecated. The flag is also more complicated, see the Glossary for details.

7.5 Selecting colour-light curves from correlated filter programmes

SELECT l.meanMjdObs,ss.hmkPnt,ss.kppErrBits,ss.kseqNum,ss.synSeqNum,x.flag
FROM calSourceXSynopticSourceBestMatch AS x,calSynopticSource AS ss,calSynopticMergeLog AS l,calVariability AS v
WHERE x.synframeSetID=ss.synframeSetID AND x.synSeqNum=ss.synSeqNum AND x.synFrameSetID=l.synFrameSetID AND v.sourceID=x.sourceID AND x.sourceID IN
(SELECT TOP 1 v.sourceID
FROM calVariability AS v,calSource AS s, calVarFrameSetInfo AS i
WHERE v.sourceID=s.sourceID AND v.frameSetID=i.frameSetID AND v.kMeanMag>0 AND v.kMeanMag<(i.kExpML-3.) AND v.hMeanMag>0 AND v.hMeanMag<(i.hExpML-3.) AND v.kSkewness>2. AND s.mergedClass IN (-1,-2) AND v.knGoodObs>10 AND v.variableClass=1) AND v.hkiWS>10000. ORDER BY l.meanMjdObs

This is the same as 7.5, but the magnitude has been replaced with the H-K colour, so in this case the user can see how the colour varies over time.

WFAU, Institute for Astronomy,
Royal Observatory, Blackford Hill
Edinburgh, EH9 3HJ, UK

wsa-support@roe.ac.uk
16/2/2016

An Introductory SQL Tutorial for WSA Users

Contents

1. Introduction