GPath - A Generic Object Query Mechanism

Author: Martyn Cutcher

email martyn@cutthecrap.biz

website www.cutthecrap.biz

For years there has only been one query langauge that most software developers were aware of - SQL. The Structured Query Language.

Despite unsuitability, object database vendors were forced to invent SQL-like query languages to allow "familiar" access to data.

Recently the RDF and TopicMap worlds have fallen in with this trend for totally inappropriate query languages.

XML - XPath

For all its faults one positive side-effect has emerged from the popularity of XML. And that is an appreciation that it does not fit well with the relational model. In fact its fits so badly that it is now commercially acceptable to store and access XML data from non-relational databases.

The XPath query language is also becoming increasingly accepted as a better and more convenient way to access XML data.

This is a very interesting development that Microsoft has followed up with the specification of its own OPath language, for querying of its ObjectSpaces "environment".

Cut The Crap

Let's not get too excited by either of these languages. XPath is the usual committee designed cludge that does a reasonable job (can anyone explain how the weird translate function got into such a simple language?), while OPath is a solution to a specific problem rather than a well thought out approach to a more general problem domain.

Both tho' are interesting systems to examine and to see how they are used within their environments.

RPath?

As part of a more general analysis, I wondered what an "XPath-like" language might look like for a relational database?

Let us consider first a restriction that it would only return complete rows, so:

SELECT * FROM CUSTOMER.

Would be the same as:

/CUSTOMER

and..

SELECT * FROM CUSTOMER WHERE NAME='John'.

equivalent to:

/CUSTOMER[@NAME='John']

As soon as you get into navigation (the equivalent of inner/outer joins etc) you need something equivalent to the id() extension to XPath I discuss below. In RPath I suggest an explicit join() function, consider the following:

/CUSTOMER[@NAME='John']/JOIN(ORDER.CUSTOMER, @ID)

Would return the ORDER rows for which the CUSTOMER column has the value from the ID column in the CUSTOMER record matching the column NAME with the value 'John' ... phew!

If we now wanted to pick out the associated ORDERITEM rows, the total query would look something like this:

/CUSTOMER[@NAME='John']
  /JOIN(ORDER.CUSTOMER, @ID)
  /JOIN(ORDERITEM.ORDER, @ID)

Which I think is an interesting idea at least.

Result Sets

SQL queries are often thought powerful (and convenient) because they can package up name-value pairs from the joined tables. They do not always simply return a single complete row.

The implicit context of the nested query can be used to define a function I will call COLLECT - (note this syntax is all in UPPERCASE to make it seem more SQL COMPATIBLE). Consider the following:

/CUSTOMER[@NAME='John']/COLLECT(ADDRESS)
  /JOIN(ORDER.CUSTOMER, @ID)/COLLECT(TOTAL, @PRICE)

The result of such a query would be a list of "objects", one for each ORDER that each referenced the PRICE of each ORDER using the name TOTAL and the ADDRESS of the customer using the name ADDRESS.

Hopefully, you should be able to appreciate how beautifully scalable and understandable this approach is. I'll pursue this a little further:

Some RPath examples

We'll simply steal a few SQL examples off the web :-)

SELECT User.LastName, User.FirstName, User.PCType, PC.Disk
FROM User INNER JOIN PC ON User.PCType = PC.PCTypeName
WHERE User.Dept="Development";

...becomes

User[@Dept='Development]/COLLECT(*)
  /JOIN(PC.PCTypeName, @PCTYPE)/COLLECT(Disk)

while

SELECT DISTINCT User.FirstName, User.LastName, Software.ProductName
FROM (UserSoftware INNER JOIN User ON UserSoftware.EmpNo = User.EmpNo) 
INNER JOIN Software ON UserSoftware.ProductID = Software.ProductID;

...becomes

User/COLLECT(*)
  /JOIN(UserSoftware.EmpNo, @EmpNo)
  /JOIN(Software.ProductID, @ProductID)/COLLECT(ProductName)

To combine the two queries is slightly more problemattical.

SELECT DISTINCT User.FirstName, User.LastName, Software.ProductName, PC.DISK
FROM ((UserSoftware INNER JOIN User ON UserSoftware.EmpNo = User.EmpNo) 
INNER JOIN Software ON UserSoftware.ProductID = Software.ProductID) 
INNER JOIN PC ON User.PCType = PC.PCTypeName;

There is a temptation to backtrack and state that this complexity is only required to overcome the inherent constraints of relational constructs. That if you query "real objects" that values can be retrieved directly without such contrived approaches.

These sentiments certainly have my sympathy. However, it is also technically interesting to consider an approach to solve the problem proposed :-)

The query pattern is a bit like following down some track, picking up some data, then backtracking to a previous position and following another path.

So, let's see what that looks like, following the path explicitly:

User/COLLECT(FirstName, LastName)
  /SAVE/JOIN(PC.PCTypeName, @PCType)/COLLECT(PCDISK)/RESTORE
  /JOIN(Software.ProductID, @ProductID)/COLLECT(ProductName)

Note that the SAVE and RESTORE would simply save and restore the current ROW focus. The query still proceeds as for a normal join, just that other values can be "picked up" on the way.

I think that's pretty neat!

OUTER JOINS

These examples only address what are called INNER JOINS. OUTER JOINS are joins that do not follow such a simple nested structure, but which allow the collection of values that are not "fully" satisfied.

This is best understood by considering the "control flow" of the result set iteration. A possible approach is to represent this by some kind of "delimitter function", hence:

SELECT DISTINCT User.FirstName, User.LastName, Software.ProductName
FROM (User LEFT JOIN UserSoftware ON User.EmpNo = UserSoftware.EmpNo) 
LEFT JOIN Software ON UserSoftware.ProductID = Software.ProductID;

..might me specified by

User/COLLECT(FirstName, LastName)/OUTER/
  /JOIN(UserSoftware.EmpNo, @EmpNo)
  /JOIN(Software.ProductID, @ProductID)/COLLECT(ProductName)

It seems pretty clear to me that this approach is going to work pretty well. This isn't surprising since "semantically" this idea maps directly onto a query plan, whereas the first task of any SQL query engine is to attempt to generate just such a plan. So... if the plan can be specified directly and in a simple manner, why bother with SQL?

Boring

Okay, I'm getting bored now, it's not my intention to create a better relational query language - though that doesn't seem to have proved too tough - I want a better Object query language.

The "data" navigation features of XPath and OPath can be applied effectively to the Generic Persistent Object Model. I will define a syntax - GPath - to provide a convenient mechanism to navigate/query the objects held within.

First Step, XPath Enhancement - id() done right

XPath can be considered a "navigational" syntax. But it is heavily oriented towards mapping onto the hierarchical XML structures. ("Heavily" is an understatement)

But...

XML is only really "hierarchical" in the same way that a relational database has only "flat" tables.

Relational databases work because "foreign keys" reference the contents of one table from another.

In the same way XML structures can reference each other using "references":

<topicmap>
  <topic id="conceptA">
   <basename>
    <basenamestring>Concept A</basenamestring>
   </basename>
  </topic>
  
  <topic id="someInstance">
   <occurrence>
    <instanceof>
     <topicref xlink:href='#conceptA'/>
    </instanceof>
   </occurrence
  </topic>
</topicmap>

In the above snippet from an XTM topic map a reference is made from one node to another using the "xlink:href" attribute.

Here is perhaps a more meaningful example:

<company name="Northwind">
  <customer id="CUST001" name="A Customer"/>
  
  <supplier id="SUPPLY001" name="A Supplier/>
  
  <product id="PROD001" name="A Widget" unitPrice="1.50" supplier="#SUPPLY001"/>
  
  <order id="ORD001" customer="#CUST001">
    <order-line product="#PROD001" quantity="2"/>
  </order>
</company>

How will XPath help to retrieve data from this XML?

The answer is in the use of the XPath id() function. The id() function is the XPath equivalent of the getElementById W3C DOM API function - with the additional option that a space delimited list of "identities" can be provided.

It is usually used as follows:

doc.queryXPath("id('someId')/children[someTest(@someAttr)]);

... but syntactically correct of course.

Consider though that the id() function could be passed the value of an attribute:

doc.queryXPath("id('ORD001')/order-line/id(@product)");

Would then return the set of "product nodes" referenced from the order lines of a specific order. I think this is worth more consideration.

GPath

GPath is a query language to navigate an object system of associated objects:

Always A Focussed Object

In GPO "global" queries do not make sense. There will be no equivalent to the "closed world" queries of XPath that seek to match some "sub-tree" pattern.

The "sub-tree" XPath patterns work because a DOM is a strict hierarchy. Of course this is complete nonsense really. XML elements will have attribute values that are used to reference other nodes, exactly the same way as foreign keys do in a relational database. But what the hell, the XPath committee liked the look of something navigating and returning XML nodes from the DOM tree so that's what we got.

In GPO there is no implicit containment. Objects simply have relationshiops with other objects, in either one-one, one-many and many-to-many associations.

Simple Example

Suppose a GPO references another object using a "parent" property:

child1.set("parent", p);
child2.set("parent", p);

The set the "children" (for want of a better description) could be retrieved by:

p.getLinkSet("parent");

We could consider a "query" that would be evaluated to do something similar:

p.query("parent");

...except that we would define it to return an Iterator as a more generic and scalable return value than a LinkSet.

In GPO we can use the Striterator patterns to restrict inclusion

Striterator iter = p.getLinkSet("parent").iterator();

final String tstnme = "John";
iter.addFilter(new Filter() {
  protected boolean isValid(Object obj) {
    return tstnme.equals(((GPOMap) obj).get("name"));
  }
} );

...and we could expect the same result from the following query:

p.query("parent[@name='John']");

In XPath there is a concept of "attribute nodes". This idea does not apply to GPO, objects simply have "values" to named properties.

Considering the XPath enhancement using the id() function, we can simplify by simply "resolving" the value of the property:

p.query("parent[@name='John']/@company/@manager");

This is a simple translation for the GPOMap properties:

/@property is translated as gpo.get(property) that will resolve a single reference, while:

/property is translated as gpo.getLinkSet(property) to return multiple values.

Seamless Partial Queries

Those who have read more widely around the Generic Persistent Object Model will be aware of the Striterator values that are used to reference sets of objects.

The query method will return a Striterator also. In fact the query evaluation is a direct match onto the Striterator patterns as we have already seen with the examples above.

By providing an additional static evalQuery method on the GPathQuery class it is straightforward to combine queries:

Striteratior res = p.query("parent[(= @name 'John')]/@company/@manager");

would have the same result as:

Striteratior res = p.query("parent[(= @name 'John')]");

GPath.evalQuery(res, "@company/@manager");

Furthermore, these queries could be combined seamlessly with direct java code using the standard Striterator filters, for example:

Striterator res = p.query("parent");

final String tst = "John";
res.addFilter(new Filter() {
  protected boolead isValid(Object obj) {
    return tst.equals(((IGPOMap) obj).get("name"));
  }
} );

GPath.evalQuery(res, "@company/@manager");

The implication from this is that while simple queries can be supported by the standard GPath syntax, more complex queries can be programmed in java and integrated using the partial query mechanism.

S-expression Functions

At the simplest level an s-expression syntax may not seem such a good idea:

children[@attr > 0]

Looks nicer than:

children[(> @attr 0)]

The advantage is that s-expression interpretation is unambiguous and amenable to extension in a simple way.

In the same way that the GPO s-expression formula primitives can be simply added, so can new functions be defined for GPath. There may be an argument about whether a query language should be extendable, but it would seem rather churlish to actively prevent its extendability. At present, such an approach enhances the possibility of scalable incremental development.

The reason why s-expressions are unambiguous is simple, everything is either a function or a primitive value, a function is always prefixed by a '(' character and the first token of the function call is the name of the function. The function is then responsible for interpreting the values of its arguments. It is just so simple it must be right. Which is why of all computer language syntax it is the lisp s-expression syntax that has remained unchanged since 1957.

(and (> @attr 2) (<= @attr 10))

as opposed to:

(@attr > 2) and (@attr <= 10)

Native Syntax

The use of s-expressions makes more sense as the integration with the GPO formulas is understood. The GPath query is first converted to a native s-expression representation before compiling:

parent[(= @name 'John')]/@company/@manager

becomes...

(query (expand parent (= (resolve name) 'John')) (resolve company) (resolve manager))

This s-expression is then compiled as a GPO formula as any other formula. It is just that the query primitive will return a Striterator object when evaluated.

The query primitive also does not cache the result and is not subject to dataflow updating. Each time a result is requested a new set of values is returned.

Integration with Striterators

In order to unify the formula approach with the query mechanism the primitive expressions such as expand and resolve must be able to be inserted into the Striterator structures. Similarly, boolean formula primitives such as = and >= now derive from an abstract EVAL_BOOL class that implements the IFilter interface and allows them to be integrated to provide the predicate evaluation to determine set inclusion.

This approach ensures that the computation defined by the query maps onto the Striterator patterns in a direct way. The GPath query is simply a simpler way to describe the navigation required.

Alchemist and MetaSpec integration

When resolving GPath property names these will by default simply be interpreted as the value supplied. However, a query language is intended to be readable, so long cumbersome string expressions will cause unnecessary confusion. Each object could therefore provide a mapping from the "Meta" name and the "physical" name.

For example, the property name "Order" for an OrderItem instance may map to the "real" property name "orderitem/order".

The translation method would be implemented by any MetaSpecSrc class, an Alchemist generated OrderItem class may have a method that would access a static HashMap:

public String mapPropertyName(String property) {
  String res = s_propertyMap.get(property);
	
  if (res != null) {
    return super.mapPropertyName(property);
  } else {
    return res;
  }
}

This will allow the implementation details to be abstracted further.

Example Jython Session

Following is a snapshot of a jython session interactively demonstrating the query language:

>>> from cutthecrap.gpo import GPOMap;
>>> from cutthecrap.gpo.client import OMClient;
>>> from cutthecrap.gpq import GPathQuery;
>>>
>>> client = OMClient();
>>> om = client.getObjectManager();
>>> sc = GPOMap(om);
>>> sc.set("name", "school");
>>>
>>> def addChild(nme) :
...     c = GPOMap(om);
...     c.set("name", nme);
...     c.set("school", sc);
...
>>> addChild("Richard");
>>> addChild("Nicola");
>>> addChild("Samuel");
>>> addChild("Naomi");
>>>
>>> def show(q) :
...   while q.hasNext() :
...     print q.next();
...
>>>
>>> q1 = sc.query("school[(= 'N' (substr @name 0 1))]/@name");
>>> show(q1);
Nicola
Naomi
>>> q2 = sc.query("school/(str 'the child''s name is ' @name)");
>>> show(q2)
the child's name is Richard
the child's name is Nicola
the child's name is Samuel
the child's name is Naomi
>>>

The example interactions demonstrate the "predicate" syntax for testing set inclusion as well as integration with the GPO formula s-expressions.

Summary

The GPath query language defines an efficient and highly functional solution to querying the Generic Object Model.

The analysis that has led to GPath has illuminated some specific deficiences in XPath that if fixed would provide a significant enhancement.

The thought experiment RPath as a possible query mechanism for relational databases looks quite interesting and worthy of future investigation.

Is GPath Available Now?

Yes, it is provided as part of the Cut The Crap Software full distribution available here.