body { margin:0; padding:0; font-family:times; font-size:9pt; } p { margin-top:5; margin-bottom:5; font-weight:plain; font-family:serif; font-size:9pt; } h1 { margin-top:5; font-size:16pt; font-weight:bold; text-align:center; font-family:sans-serif; } h2 { margin-top:20; font-size:10pt; font-weight:bold; text-align:left; font-family:sans-serif; } h3 { margin-top:20; font-size:9pt; font-weight:bold; text-align:left; font-family:sans-serif; } pre { color:#993333; font-size:9pt; } code { color:#993333; } .section { text-align:justify; padding:5; } .section .console { background-color:#003300; color:#22FF22; padding:5; } .note { font-size:8pt; font-style:italic; color:#000000; padding-left:5; padding-right:5; padding-top:5; padding-bottom:5; background-color:#CCCCCC }
For years there has only been one query langauge that most software developers were aware of - SQL. The Structured Query Language.
Despite unsuitability, object database vendors were forced to invent SQL-like query languages to allow "familiar" access to data.
Recently the RDF and TopicMap worlds have fallen in with this trend
for totally inappropriate query languages.
For all its faults one positive side-effect has emerged from the popularity of XML.
And that is an appreciation that it does not fit well with the relational model. In fact its fits so badly that
it is now commercially acceptable to store and access XML data from non-relational
databases.
The XPath query language is also becoming increasingly accepted as a better and
more convenient way to access XML data.
This is a very interesting development that Microsoft has followed up with the specification of
its own OPath language, for querying of its ObjectSpaces "environment".
Let's not get too excited by either of these languages. XPath is the usual committee
designed cludge that does a reasonable job (can anyone explain how the weird translate
function got into such a simple language?), while OPath is a solution to a
specific problem rather than a well thought out approach to a more general problem domain.
Both tho' are interesting systems to examine and to see how they are used within their environments.
As part of a more general analysis, I wondered what an "XPath-like" language might look like for a relational database?
Let us consider first a restriction that it would only return complete rows, so:
SELECT * FROM CUSTOMER.
Would be the same as:
/CUSTOMER
and..
SELECT * FROM CUSTOMER WHERE NAME='John'.
equivalent to:
/CUSTOMER[@NAME='John']
As soon as you get into navigation (the equivalent of inner/outer joins etc) you need
something equivalent to the id() extension to XPath I discuss
below. In RPath I suggest an explicit join() function, consider the
following:
/CUSTOMER[@NAME='John']/JOIN(ORDER.CUSTOMER, @ID)
Would return the ORDER rows for which the CUSTOMER column
has the value from the ID column in the CUSTOMER record matching
the column NAME with the value 'John' ... phew!
If we now wanted to pick out the associated ORDERITEM rows, the total
query would look something like this:
/CUSTOMER[@NAME='John'] /JOIN(ORDER.CUSTOMER, @ID) /JOIN(ORDERITEM.ORDER, @ID)
Which I think is an interesting idea at least.
SQL queries are often thought powerful (and convenient) because they can package up
name-value pairs from the joined tables. They do not always simply return a
single complete row.
The implicit context of the nested query can be used to define a function I will call
COLLECT - (note this syntax is all in UPPERCASE to make it seem more
SQL COMPATIBLE). Consider the following:
/CUSTOMER[@NAME='John']/COLLECT(ADDRESS) /JOIN(ORDER.CUSTOMER, @ID)/COLLECT(TOTAL, @PRICE)
The result of such a query would be a list of "objects", one for each ORDER
that each referenced the PRICE of each ORDER using the name
TOTAL and the ADDRESS of the customer using the name ADDRESS.
Hopefully, you should be able to appreciate how beautifully scalable and understandable this approach is. I'll pursue this a little further:
We'll simply steal a few SQL examples off the web :-)
SELECT User.LastName, User.FirstName, User.PCType, PC.Disk FROM User INNER JOIN PC ON User.PCType = PC.PCTypeName WHERE User.Dept="Development";
...becomes
User[@Dept='Development]/COLLECT(*) /JOIN(PC.PCTypeName, @PCTYPE)/COLLECT(Disk)
while
SELECT DISTINCT User.FirstName, User.LastName, Software.ProductName FROM (UserSoftware INNER JOIN User ON UserSoftware.EmpNo = User.EmpNo) INNER JOIN Software ON UserSoftware.ProductID = Software.ProductID;
...becomes
User/COLLECT(*) /JOIN(UserSoftware.EmpNo, @EmpNo) /JOIN(Software.ProductID, @ProductID)/COLLECT(ProductName)
To combine the two queries is slightly more problemattical.
SELECT DISTINCT User.FirstName, User.LastName, Software.ProductName, PC.DISK FROM ((UserSoftware INNER JOIN User ON UserSoftware.EmpNo = User.EmpNo) INNER JOIN Software ON UserSoftware.ProductID = Software.ProductID) INNER JOIN PC ON User.PCType = PC.PCTypeName;
There is a temptation to backtrack and state that this complexity is only required to overcome the inherent constraints of relational constructs. That if you query "real objects" that values can be retrieved directly without such contrived approaches.
These sentiments certainly have my sympathy. However, it is also technically interesting to consider an approach to solve the problem proposed :-)
The query pattern is a bit like following down some track, picking up some data, then backtracking to a previous position and following another path.
So, let's see what that looks like, following the path explicitly:
User/COLLECT(FirstName, LastName) /SAVE/JOIN(PC.PCTypeName, @PCType)/COLLECT(PCDISK)/RESTORE /JOIN(Software.ProductID, @ProductID)/COLLECT(ProductName)
Note that the SAVE and RESTORE would simply
save and restore the current ROW focus. The query still proceeds
as for a normal join, just that other values can be "picked up" on the way.
I think that's pretty neat!
These examples only address what are called INNER JOINS.
OUTER JOINS are joins that do not follow such a simple
nested structure, but which allow the collection of values that are
not "fully" satisfied.
This is best understood by considering the "control flow" of the result set iteration. A possible approach is to represent this by some kind of "delimitter function", hence:
SELECT DISTINCT User.FirstName, User.LastName, Software.ProductName FROM (User LEFT JOIN UserSoftware ON User.EmpNo = UserSoftware.EmpNo) LEFT JOIN Software ON UserSoftware.ProductID = Software.ProductID;
..might me specified by
User/COLLECT(FirstName, LastName)/OUTER/ /JOIN(UserSoftware.EmpNo, @EmpNo) /JOIN(Software.ProductID, @ProductID)/COLLECT(ProductName)
It seems pretty clear to me that this approach is going to work pretty well. This isn't surprising since "semantically" this idea maps directly onto a query plan, whereas the first task of any SQL query engine is to attempt to generate just such a plan. So... if the plan can be specified directly and in a simple manner, why bother with SQL?
Okay, I'm getting bored now, it's not my intention to create a better relational query language - though that doesn't seem to have proved too tough - I want a better Object query language.
The "data" navigation features of XPath and OPath can be applied
effectively to the Generic Persistent Object Model. I will define a syntax -
GPath - to provide a convenient mechanism to navigate/query the objects held
within.
id() done rightXPath can be considered a "navigational" syntax. But it is heavily oriented
towards mapping onto the hierarchical XML structures. ("Heavily" is an
understatement)
But...
XML is only really "hierarchical" in the same way that a relational database
has only "flat" tables.
Relational databases work because "foreign keys" reference the contents of one table from another.
In the same way XML structures can reference each other using "references":
<topicmap>
<topic id="conceptA">
<basename>
<basenamestring>Concept A</basenamestring>
</basename>
</topic>
<topic id="someInstance">
<occurrence>
<instanceof>
<topicref xlink:href='#conceptA'/>
</instanceof>
</occurrence
</topic>
</topicmap>
In the above snippet from an XTM topic map a reference is made from one
node to another using the "xlink:href" attribute.
Here is perhaps a more meaningful example:
<company name="Northwind">
<customer id="CUST001" name="A Customer"/>
<supplier id="SUPPLY001" name="A Supplier/>
<product id="PROD001" name="A Widget" unitPrice="1.50" supplier="#SUPPLY001"/>
<order id="ORD001" customer="#CUST001">
<order-line product="#PROD001" quantity="2"/>
</order>
</company>
How will XPath help to retrieve data from this XML?
The answer is in the use of the XPath id() function. The
id() function is the XPath equivalent of the getElementById
W3C DOM API function - with the additional option that a space delimited list of "identities"
can be provided.
It is usually used as follows:
doc.queryXPath("id('someId')/children[someTest(@someAttr)]);
... but syntactically correct of course.
Consider though that the id() function could be passed the value of an
attribute:
doc.queryXPath("id('ORD001')/order-line/id(@product)");
Would then return the set of "product nodes" referenced from the order lines of a specific order. I think this is worth more consideration.
GPath is a query language to navigate an object system of associated objects:
In GPO "global" queries do not make sense. There will be no equivalent to the
"closed world" queries of XPath that seek to match some "sub-tree" pattern.
The "sub-tree" XPath patterns work because a DOM is a strict
hierarchy. Of course this is complete nonsense really. XML elements will have
attribute values that are used to reference other nodes, exactly the same way as foreign keys
do in a relational database. But what the hell, the XPath committee liked the look of
something navigating and returning XML nodes from the DOM tree
so that's what we got.
In GPO there is no implicit containment. Objects simply have relationshiops with
other objects, in either one-one, one-many and many-to-many
associations.
Suppose a GPO references another object using a "parent" property:
child1.set("parent", p);
child2.set("parent", p);
The set the "children" (for want of a better description) could be retrieved by:
p.getLinkSet("parent");
We could consider a "query" that would be evaluated to do something similar:
p.query("parent");
...except that we would define it to return an Iterator as a more generic
and scalable return value than a LinkSet.
In GPO we can use the Striterator patterns to restrict inclusion
Striterator iter = p.getLinkSet("parent").iterator();
final String tstnme = "John";
iter.addFilter(new Filter() {
protected boolean isValid(Object obj) {
return tstnme.equals(((GPOMap) obj).get("name"));
}
} );
...and we could expect the same result from the following query:
p.query("parent[@name='John']");
In XPath there is a concept of "attribute nodes". This idea does not
apply to GPO, objects simply have "values" to named properties.
Considering the XPath enhancement using the id() function,
we can simplify by simply "resolving" the value of the property:
p.query("parent[@name='John']/@company/@manager");
This is a simple translation for the GPOMap properties:
/@property is translated as gpo.get(property) that will resolve
a single reference, while:
/property is translated as gpo.getLinkSet(property) to return
multiple values.
Those who have read more widely around the Generic Persistent Object Model will
be aware of the Striterator values that are used to reference sets of objects.
The query method will return a Striterator also. In fact the
query evaluation is a direct match onto the Striterator patterns as we have
already seen with the examples above.
By providing an additional static evalQuery method on the GPathQuery
class it is straightforward to combine queries:
Striteratior res = p.query("parent[(= @name 'John')]/@company/@manager");
would have the same result as:
Striteratior res = p.query("parent[(= @name 'John')]");
GPath.evalQuery(res, "@company/@manager");
Furthermore, these queries could be combined seamlessly with direct java
code using the standard Striterator filters, for example:
Striterator res = p.query("parent");
final String tst = "John";
res.addFilter(new Filter() {
protected boolead isValid(Object obj) {
return tst.equals(((IGPOMap) obj).get("name"));
}
} );
GPath.evalQuery(res, "@company/@manager");
The implication from this is that while simple queries can be supported by the
standard GPath syntax, more complex queries can be programmed in java
and integrated using the partial query mechanism.
At the simplest level an s-expression syntax may not seem such a good idea:
children[@attr > 0]
Looks nicer than:
children[(> @attr 0)]
The advantage is that s-expression interpretation is unambiguous and amenable
to extension in a simple way.
In the same way that the GPO s-expression formula primitives can
be simply added, so can new functions be defined for GPath. There may be an argument
about whether a query language should be extendable, but it would seem rather churlish to
actively prevent its extendability. At present, such an approach enhances the possibility of
scalable incremental development.
The reason why s-expressions are unambiguous is simple, everything is either
a function or a primitive value, a function is always prefixed by a '(' character
and the first token of the function call is the name of the function. The
function is then responsible for interpreting the values of its arguments. It is just so simple it
must be right. Which is why of all computer language syntax it is the lisp
s-expression syntax that has remained unchanged since 1957.
(and (> @attr 2) (<= @attr 10))
as opposed to:
(@attr > 2) and (@attr <= 10)
The use of s-expressions makes more sense as the integration with the GPO
formulas is understood. The GPath query is first converted to a native
s-expression representation before compiling:
parent[(= @name 'John')]/@company/@manager
becomes...
(query (expand parent (= (resolve name) 'John')) (resolve company) (resolve manager))
This s-expression is then compiled as a GPO formula as
any other formula. It is just that the query primitive will return
a Striterator object when evaluated.
The query primitive also does not cache the result and is not subject
to dataflow updating. Each time a result is requested a new set of values is returned.
In order to unify the formula approach with the query mechanism the primitive expressions
such as expand and resolve must be able to be inserted into
the Striterator structures. Similarly, boolean formula primitives
such as = and >= now derive from an abstract EVAL_BOOL
class that implements the IFilter interface and allows them to be integrated
to provide the predicate evaluation to determine set inclusion.
This approach ensures that the computation defined by the query maps onto the
Striterator patterns in a direct way. The GPath query is simply
a simpler way to describe the navigation required.
When resolving GPath property names these will by default simply be interpreted
as the value supplied. However, a query language is intended to be readable, so long cumbersome
string expressions will cause unnecessary confusion. Each object could therefore provide
a mapping from the "Meta" name and the "physical" name.
For example, the property name "Order" for an OrderItem instance may map to
the "real" property name "orderitem/order".
The translation method would be implemented by any MetaSpecSrc class, an
Alchemist generated OrderItem class may have a method that would
access a static HashMap:
public String mapPropertyName(String property) {
String res = s_propertyMap.get(property);
if (res != null) {
return super.mapPropertyName(property);
} else {
return res;
}
}
This will allow the implementation details to be abstracted further.
Following is a snapshot of a jython session interactively demonstrating
the query language:
>>> from cutthecrap.gpo import GPOMap;
>>> from cutthecrap.gpo.client import OMClient;
>>> from cutthecrap.gpq import GPathQuery;
>>>
>>> client = OMClient();
>>> om = client.getObjectManager();
>>> sc = GPOMap(om);
>>> sc.set("name", "school");
>>>
>>> def addChild(nme) :
... c = GPOMap(om);
... c.set("name", nme);
... c.set("school", sc);
...
>>> addChild("Richard");
>>> addChild("Nicola");
>>> addChild("Samuel");
>>> addChild("Naomi");
>>>
>>> def show(q) :
... while q.hasNext() :
... print q.next();
...
>>>
>>> q1 = sc.query("school[(= 'N' (substr @name 0 1))]/@name");
>>> show(q1);
Nicola
Naomi
>>> q2 = sc.query("school/(str 'the child''s name is ' @name)");
>>> show(q2)
the child's name is Richard
the child's name is Nicola
the child's name is Samuel
the child's name is Naomi
>>>
The example interactions demonstrate the "predicate" syntax for testing
set inclusion as well as integration with the GPO formula
s-expressions.
The GPath query language defines an efficient and highly functional
solution to querying the Generic Object Model.
The analysis that has led to GPath has illuminated some specific deficiences
in XPath that if fixed would provide a significant enhancement.
The thought experiment RPath as a possible query mechanism for relational databases
looks quite interesting and worthy of future investigation.
Yes, it is provided as part of the Cut The Crap Software full distribution available here.