I'm exploring various design options for a query language for elephant. This has implications for how class indicies are created and used and there is quite a variety of ways to specify, optimize and execute queries.
We would like to build off of existing examples of query access to persistent object databases. SymbolicsStatice is one of the original systems created for the Symbolics Lisp Machine. Modern Lisp OODBs include AllegroCache? and Rucksack?, two different approaches to the problem.
Queries in relation and object models are typically performed in terms of lists of constraints on relations/classes over tuples/instances. Objects have more expressiveness than the relation model due in large part to direct object references and inheritance and this needs to be accommodated in the query language implementation.
Query languages typically parse into an intermediate form which is a variation on a RelationalAlgebra. There are a huge number of variations, some of which are discussed under that topic.
I'm working with a syntax similar to the SymbolicsStatice for-each. A query might look something like the following:
;; Example query:
;;
;; All managers making over 100k with subordinates on projects
;; started between july 5th and november 1st who are either
;; in "Marketing" or "Administration"
(query ((mgr person))
(:with (emp person) (proj project))
(:declare (type (department person) department))
(:where
(between (start-date proj) (convert-date "July 5th 1996") (convert-date "November 1st 1996"))
(= (project emp) proj)
(member (name (department emp)) '("Marketing" "Administration"))
(= (supervisor emp) mgr)
(>= (slot salary mgr) 100000)))
- says to identify a set of objects (or object tuples) suitable for mapping or querying that constraints the universe of possible objects to the class 'person' under the following constraints.
- with establishes a correspondence between a label and a class type. This is a placeholder and is critical if we want to identify instances that have specific relations with other instances rather than just filtering on an instance's properties. This is essentially the object database version of Join.
- Each constraint expresses the constraint type in list application form. The arguments are references or values.
- References describe the slot or other attribute of a class member. References can be nested? Because we don't have typed schemas, we cannot always optimize queries using nested references without hints from the user - and that is onerous.
- Values are anything that can be equated or ordered (typically strings and numbers, but could also be objects (equalp obj1 obj2), etc. Do we test the serialized form or the deserialized form during query processing?
One challenge with the semantics of this query form is what happens when parameters are unbound? Is the query recompiled each time? Does it depend on what variables are unbound?
I would also like to see users able to add their own constraints; lisp functions that evaluate a predicate over an object.
Kinds of constraints:
- Slot-value constraints (the string or number value of a slot)
- Indirect slot-value constraints (ref to object & slot)
- Slot-join constraints (obj reference or shared slot value)
- Ordering constraints on query
- Heirarchy constraints (has-class, subclass-of)
To support basic query optimizations prior to query planning we can target the RelationalAlgebra as an intermediate format. This will help motivate the final selection of surface syntax. Query planning is a relatively simple matter of ordering the relation operations and choosing execution options (indexed, scan, merge-join, etc) for each query type to optimize the path cost. Costs are the hard part, so for now we'll guess and later perhaps we can use the query engine to annotated indices with histogram information to improve the query planning. Depends on whether we're getting users that require complex and expensive queries.
Some other notes:
;; Simple loop queries to start ;; - index or iteration? ;; Dynamic programming of repeated iterations over same sets ;; if set large enough; avoids repeated serializer overhead ;; Query plans and query optimization ;; Shared indices for class heirarchy ;; Fast oid-based set joins & sorts? ;; Aggregation operators (count, max, min, etc)
