Cut The Crap: A Personal History

Author: Martyn Cutcher

email martyn@cutthecrap.biz

website www.cutthecrap.biz

Background is important when trying to understand something. Both your own and an appreciation for that of the thing you are trying to understand. Cut The Crap Software includes a number of key technologies, from the Generic Persistent Object model that provides not only Persistence but also a Generic Object Model (sic) to The Alchemist that builds on GPO to generate a working system.

- Here is the story of its development:

When You're Right You're Right

"Okay, I want you all to take off your watches and put them in your desks .... will you all standup ... now, I want each of you to sit down when you think that one minute has passed, starting from .... NOW!"

I was ten years old. I was quite musical and good at keeping time. I imagined a clock ticking, and counted the seconds pass...

After around 20 seconds kids started to sit down. By the time I'd counted to 40 I was standing alone and there were a few sniggers. I sat down when I reached 60 to the general amusement of everyone.

"You may be interested to know that Martyn sat down at exactly one minute"

I took a valuable lesson from this, and I've always remembered it:

When you're right you're right - no matter who or how many people disagree with you!

Early Computing Experience

My first real experience of computer programming was as a systems programmer on a "retro" mini computer System 25. It can be described as "retro" because it was developed specifically to simulate the old Singer System 10 computer - but with more memory and faster processor etc.

The System 25 was programmed using a rather odd assembler language. Modern programmers would be horrified to discover the restrictions under which such machines were programmed, but it is surprising what can be achieved if you don't know any better.

My first "Cut The Crap" moment came when I was drafted on to a multi-user word processing project. I had observed the team working for several months and had been impressed by the amount of flow charts that had been produced to design the system. I had never drawn a proper flow chart myself, with all those different symbols.

I was told to take over a specific component - Form Design. Apparently it just needed "finishing off". I must have been pretty naive at the time to believe this line. I got the source code listings and started having a look (at that time you rarely viewed code online). It all seemed very complicated as I traced the program flow through the different blocks of code.

After about two hours I was still confused, so I approached the guy who had produced the code:

"I'm sorry Trev, but I can't find any code that does anything, it just seems to jump around testing different state variables"

"Yep, that's right, I've coded all the program flow but not any of the actions"

I checked through the code again. After a few minutes I saw what was happening. The "state variables" were nothing to do with the document state, or editing state, or anything really useful. It was to do with handling the "multi-user" events and returning the program to the right state to process the next user event. "In module#1, goto module#1, in mode#23, goto bit responsible for mode#23, in state#12, goto bit handling state#12, in substate#6, goto bit handling substate#6" ...and so on ...and on..

Bloody Hell! There is no way that this project could ever work if such complexity was needed just to handle a keyboard click. It was certainly more complex than I could handle. And what the hell was the rest of the team of experienced programmers doing going along with this crap?

About a week later, I showed the team leader three lines of code - THREE LINES OF CODE - that achieved what the previous pages of junk attempted to approximate to, simply to be able to remember where a specific user was when the last event had been processed.

Cut The Crap - Just Solve The Problem

The Impedance Mismatch: How Computers Are Programmed

Do you want to know what most computer programs are doing? Pretty much nothing you are bothered about most of the time to be sure.

They are spending most of their time moving data from one format to another.

Before someone coined the term "Information Technology", a more accurate term was used - "Data Processing". This describes what is happening most of the time, except that the "processing" bit is a bit misleading. Really, most programs simply convert data from one format to another. This is mainly because the data must be stored on some "permanent" media, such as tape or disk, but that for a program to "process" the data, it must be read into the computer memory so that other instructions can access it.

But "reading into the computer memory" doesn't quite tell the story. The problem is that it is not only "where" the data is, but what "format" it is in. Most data has to be converted from an external representation to an internal one that computer programs can manipulate more simply.

So, you change some external value:

  1. read data
  2. convert data
  3. modify data
  4. convert data
  5. write data

As data becomes more complex, so the conversion - transformation - process becomes more significant. This is the so called "impedance mismatch". And this is really what most programmers spend 90% of their time doing, even tho' probably 90% of them don't even realise it.

Maybe now could be the moment to mention XML - if that doesn't mean anything to you don't worry ...in fact be grateful.

It's Getting Worse Not Better

Surely not! Surely we always make things simpler. Nope, sorry to disappoint you but nowadays there are now more boundaries that data needs to cross, and more opportunities to screw things up.

How so? Well, GUI's for a start. Graphical applications need not only to transform data so that it can be stored and retrieved, but also to transform it so it can be displayed graphically.

With Web Applications the boundaries keep multiplying. Data must be retrieved from a data source, packaged up and processed as HTML. Any validation is split between parameterized HTML controls, javascript that executes in a browser and further processing when the data is returned.

To configure and define so called "Web Services" is a mass of definition files and custom "objects" that aid the migration of the data. Sure, there are tools available that can take away some of the grunt work, but fundamentally its very complicated.

None of this stuff really smells that good!

So What Can Be Done?

The way that we solve problems is that we develop abstractions.

The way that we solve complex problems is to create good abstractions.

The way to solve really complex problems is to develop really good abstractions!

So... when we discover that the abstractions (models) we have developed are really difficult to use, what should we do? Change the model!!

Its not really that hard is it?

Well apparently it is. Because people get really attached to the models they use to understand things. Take religion for example. Actually "religion" is a good word because it seems to be accepted that people should not get too "religious" about their ideas. What most people mean tho' is that "other" people should not be "religious" about "their" ideas.

Case History: A Client-Server Project

Several years ago I was involved with an ambitious project to automate a large organisation, integrating production, finance, planning etc.

The basic architecture was what is known as a "thick" client-server model, with client GUI application communicating using a defined transaction interface with a VAX-based server.

When I arrived on the project, I was horrified at the complexity that the programmers were dealing with. What horrified me even more tho' was the acceptance of the methodologies used by all members of the team. No-one seemed to be aware of how much effort they were wasting with the approaches they were using.

I was not concerned with the basic model, but simply the lower-level mechanisms that were implemented to enable the client-server implementation. Without going into details, believe me, it was just horrible.

But these mechanisms had been developed so painfully, that no-one would hear anything against them. I'm sorry (well not really), but when you're right you're right! So I persisted. They were paying me for my experience, so it would have been morally wrong not to give them the benefit of it.

Eventually, I was able to demonstrate to them simple mechanisms that both hugely simplified the communication mechanisms, and then to go a couple of stages further and develop a "good model" that genuinely supported the complex problems they were trying to solve.

But still it took a long time to convince people that it wasn't some kind of trick.

I remember the team leader looking amazed at a colleague's code when he finally understood just how much simpler the tasks had become.

"No really, that is all I have to do!"

This project was quite significant for me, because it was the first time that I had developed commercially a "persistent object model". Of course, I never used such a phrase within the project. It would have scared the crap out of them. But I was able to explain the separate elements of what I did (in terms of caching data and such) and they could see for themselves the benefits.

So What Is A Persistent Object Model?

Glad you asked. Essentially it is, or should be, a model that removes the "impedance mismatch" between program memory and external data storage.

Say again?

Okay, some people like to talk about programming "tiers" or "layers". You have the "data access" tier, or the "business object" tier. But personnally I don't like these approaches, this is because they seem concerned with abstracting the "process" of data transformation rather than addressing the more fundamental problem of data "identity".

Stay with me for a while here, and I'll try to explain.

Let's suppose that in our system there is a Customer. Customers have names, addresses, telephone numbers and such. But Customers do not just exist by themselves, they place Orders. So in our system there are references between Orders and Customers. More specifically, we might say that every Order is for one Customer and that each Customer may have made many Orders.

Now, in a "tiered" model we would probably use a Data tier object such as CustomerData to retrieve... customer data - seems to make sense. A Business tier object would know how to retrieve OrderData objects associated with a specified Customer.

But....

If we go around the circle a few times we will find that multiple CustomerData objects may be retrieved that refer to the same Customer. Is this a bad thing? Yes. Why? Because it is that's why, and if you don't see it then there will be little I can say to persuade you otherwise.

In a "true" persistent object model, identity is maintained. The in-memory object should be a cached representation of the external data, and should keep the external representation in sync when it is updated. This has a number of advantages, one being performance since we do not need to refetch an object we already have, but the principle reason is that identity is maintained.

Furthermore, in-memory objects should implement behaviour commonly found in so-called "Business tier" objects, namely object navigation.

I want to be able to say:

cust = system.getCustomer("John Brown");
orders = cust.getOrders();
while orders.hasNext():
  order = orders.next();
  print cust.getName(), "placed order", order.getReference();

A good persistent object model, should solve the problem of "impedance mismatch" and provide the programmer direct access to the data they require.

How Do You Build A Persistent Object Model?

We are going to develop a relatively dumb persistent object system, but hopefully you'll get the point.

Let's start with a relational database. We have our Customer Table and Order Table.

I could get a little religeous myself here over the kind of Primary Key that should be defined, but it's not "that" important for this example.

We will define a "root" system object we can call TheSystem, this will provide two routines:

Customer getCustomer(custId);
Order getOrder(ordId);

TheSystem must implement a table that refers to all the current "in-memory" persistent objects. It does not matter precisely how this is arranged, but the point is that when it is requested an object, it must check first to see if it already exists. If it does then it should return that one, otherwise it should retrieve it from store and store a reference in its table before returning it.

The other little point is that the Customer and Order objects should know about TheSystem, this means that it is easy for the following routine to be written for Order.

Customer getCustomer() {
  return TheSystem.getCustomer(myCustomer);
}

There are other details like "how do you release objects from the table?", and "how do you deal with sets of values?", but I'll leave those for you to ponder.

The Next Generation

After implementing several commercial persistent object models that mapped to relational databases, I was increasingly frustrated with the constraints I was facing. These constraints stopped me building the kind of models I felt I needed to solve really complex problems. Remember I said you need "really good models" to solve really complex problems.

Privately I worked on developing a C++ system that implemented a powerful generic persistent data model. I used elements of this system in several commercial retail projects. The core features were a native binary file storage system to which the objects were saved, and a persistent object referencing type. This meant that an object did not need to know anything about the type of object it referenced in order to be able to retrieve it later on.

Enter Java

My first java project involved persistent knowledge representation. The idea was to build a persistent system that mapped to a relational database, and allowed convenient navigation of knowledge structures. That was fine in itself, but I was keen to convert my C++ system to see how much slower java was. It was just an exercise, I had no expectation that the system would be usable. It didn't take long, and I was hugely surprised at the performance I got from java. Wow! At last a system with automatic memory management that I may be allowed to use commercially! It may be sad, but I was really excited.

After a few weeks I could demonstrate that my native java persistent object system had far better performance than one based on a relational database.

We continued developing java systems for a couple of years and then the team folded.

At this point I began developing the Cut The Crap Software.

Cut The Crap: Just Solve The Problem

I was determined to do this right. I knew I would make mistakes and the acceptance of this made it easier to change my tack, sometimes throwing large amounts of work away. I had not failed, I had learnt something. Failure would have been continuing knowing that it was wrong.

Firstly I wanted to address the file storage mechanism. This would become the IStore component currently available.

It appeared to me that there were two things that were not right with my initial approaches, and I just "knew" that they were related. One was the need to buffer data before allocating it storage, and the second was the mechanism to store large data elements, so called BLOBs (Binary Large Objects).

The problem was solved by understanding what was really happening. I would stream data to a buffer, find out how large the buffer was, and then allocate the external storage, and copy the data.

stream = createStream();
stream.write(data);
address = store.allocate(stream.getBuffer());

I realised that I could change this pattern as follows:

stream = store.allocate();
stream.write(data);
address = stream.save();

This may appear to be a minor thing, but it means that I need not worry about BLOBS fragmenting the store, since I can allocate them incrementally from chunks of smaller blocks. And anyway, when it's right it's right!

While working on the IStore package I decided to provide a second implementation. This would be the WORMStore. The WORMStore is so-called because it is a Write Once Read Many store. No data is ever overwritten. This has a number of advantages, the main one being that it is possible to access previous store states, but also it has an extremely low management overhead since, of course, there is very little management required.

The standard RWStore is also highly functional and scalable. The addressing scheme has a dual purpose, firstly to support constant time allocation and re-allocation, and secondly to support the addressing of huge datastores. Furthermore, the allocation control information is organised in such a way that the storefile can be grown efficiently - there is no need to pre-allocate a large file.

Both stores implement efficient transaction control. The RWStore also provides IO optimisations in the form of buffered and sorted writes to mitigate against the locality of reference issues with random access updates.

Once I was happy with the IStore implementations I turned to the core package, GPO the Generic Persistent Object model.

My previous experience gave me a good appreciation of what I wanted to achieve but there were several problems to be solved. None seemed like they should be too hard tho'. I just had to stay focussed.

The Generic Persistent Object Model

One of the problems I wanted to solve was that of Referential Integrity. This meant that I wanted to ensure that there could never be a situation when one object referred to another that no longer (or never did) exist.

This problem is endemic in data systems worldwide and is generally accepted as something to avoid but manage when it happens. I don't know what the statistics are but I wouldn't mind betting that a very large percentage of database corruptions are caused by such integrity problems.

But in GPO I could not allow such problems to occur. This was not just an aim, it was a requirement, because such a model relies on internal consistency.

The initial approach was to maintain special dependency structures. So if one object referenced another, dependency references were added to help the "target" object find all the "referers". When "lists" of object were considered things got quite complex, but still manageable.

Something wasn't right tho'. I seemed to have so many of these bloody references around it was difficult to tell what was a set of object references and what was a set of dependency references!

The more I looked at it it all seemed crazy. Sure it worked, but it was too complicated and there seemed to be tons of duplication. There had to be a better model!

The moment when I saw the right approach I just laughed out loud. I love throwing code away, and I had quite a lot of junk now.

They were the same thing! The list association structures and the dependency structures did the same thing, they defined a one-to-many object association.

It took a few hours to completely re-implement the structures. Now here was something dead simple, and dead right. You define a one-to-many association between two objects by the "many" each referencing the "one" with the same property.

The "dependency" links are now just the "set" links, here is an example:

many1.set("parent", one);
many2.set("parent", one);
many3.set("parent", one);
one.getLinkSet("parent").size();
3

Okay, don't worry about the names like getLinkSet, the point is that the one object "knows" about all the objects that refer to it using the "parent" property.

So should the one object be removed, it is able to find all those objects that refer to it, and get them to clear their reference.

Basic GPO Structure

Let's see if I can explain the GPO pattern in a few lines.

Firstly, GPO structures are heavily based on a common computer structure called a Linked List. A Linked List is a list defined by linking one member to the next. A Double Linked List is one where there is also a link to the previous member.

A GPO property, like "name", can be simple data, like a string "A String" or a number 1234. But it can also be a reference to another GPO object.

Let's suppose that there is a property "parent" and it references a GPO we can call parentGPO, then the property value is a special type that has these kind of references:

target: parentGPO;
previous: prevGPO;
next: nextGPO;

This means that prevGPO and nextGPO both point to parentGPO with the same property.

In parentGPO is another structure. Let's suppose that there are only three GPOs that point at it, there will be a LinkSet structure like this:

property:"parent";
head: prevGPO;
tail: nextGPO;
count: 3;

Using these structures, parentGPO is able to find all the GPOs that point to it using the property "parent".

Set Identity

This structure has a number of qualities:

It has identity because it is uniquely the property of an identifiable object (parentGPO).

It is scalable in space because no matter how large the set becomes it is represented with simple links between each object.

It is scalable in time because no matter how large the set becomes it is the same cost to add or remove members, find the size of the set and to retrieve each member.

Indexing

The linking structure can be thought of as a highly efficient index into a set of objects.

If you know about relational databases you might like to consider how tables are joined using queries to discover these kind of relations, what kind of indexes you would use and how scalable and efficient they are.

Iterators

An important feature of GPO is that a LinkSet will return an Iterator that allows each object to be retrieved:

gpos = parentGPO.getLinkSet("parent").iterator();

while gpos.hasNext():
  agpo = gpos.next();
  print agpo.get("name");

To be more accurate, the LinkSet will return something called a Striterator. This object allows filters to be applied to the objects to incrementally transform the set, for example:

names = parentGPO.getLinkSet("parent").iterator().resolve("name");

while names.hasNext():
  print names.next();

Lookup: Classification

Although navigating around the object properties and sets is a large part of what is required when using a model, it is also frequently necessary to be able to find some object based on some property value. For example a car registration number.

To enable this, the GPO model defines an object called a Classifier.

A Classifier has a lot in common with a database index, but there are some significant differences.

It is possible to register a classifier like this:

registerClassifier("parent", "name");

This means that whenever a LinkSet is created by one GPO referencing another using the "parent" property, a Classifier will be created that will "classify" the set using the values of the "name" properties of each member.

Read that again slowly until you get it....

You can now use the classifier like this:

ls = parentGPO.getLinkSet("parent");
nclss = ls.getClassifier("name");

aGPO = nclss.getValue("Freddy");

and if there is a GPO in the LinkSet with the value "Freddy" set to the property "name", it will be returned.

Classifiers are different to database indexes in a number of ways:

You can get hold of them rather than being hidden and only accessible to special query processors. You can ask things like "How many members exist within a certain range, like getRangeCount("A", "C"), or you can find the index offset of a specific value, or the value of a specific offset. This kind of functionality makes it possible to build GUI interfaces that navigate large sets of objects based on any classified properties.

They are owned by an identifiable object, meaning that the Classifiers operate over object associations. So, if you have ten thousand parents and one million children, there will be ten thousand classifiers, each with on average one hundred entries.

Listeners

In order to be solve the problem of providing Classifiers I needed to address the problem of changing property values, and adding and removal from a LinkSet, consider the following:

aGPO.set("name", "Richard");
aGPO.set("parent", parentOne);
aGPO.set("parent", parentTwo);
aGPO.set("name", "Nicola");

It should be clear that both the LinkSet membership is being changed and then also the "name" property value. Both these actions will require management of the Classifier objects.

In order to solve this problem, I developed two Listener interfaces. An interface is a protocol, or set of behaviours, that enables different objects to communicate (exchange information) in an agreed manner.

The two interfaces were a SetListener and a PropertyListener.

A SetListener would "listen" to changes to set membership, and a PropertyListener "listen" to changes in property values.

Using these approaches I could arrange that the Classifiers were always kept in sync with changes to the GPO values.

A lot of work went into ensuring that these "listener" structures were also scalable. The obvious approach worked well, but it was clear that certain patterns of use could create problems where the listener structures became extremely large and unscalable.

This problem was solved by introducing more linking structures that self-balanced between the objects.

Formulas

A goal for many years had been an approach that could unify the spreadsheet and the "conventional" programming models.

When desiging and implementing the Classifiers and Listener features, it seemed to me that I was very close to the kind of structures that would allow "spreadsheet-type" formula definitions. The essence behind a spreadsheet is the triggering of computations across both sets and individual values. I was already doing this for the Classifier, so it should be possible to generalize the solution and support formulas.

I thought about this for many months before finally addressing it. Initially I incrementally modified the Classifier and Listener patterns so that they served a more general purpose. Then I defined a new interface - Computation. A Computation could be set to a property value but when the property value was requested, rather than returning the Computation object, it would return the value computed instead.

I tested a few different Computation objects, and defined different Listeners so that their values would be re-computed.

I integrated these patterns with the Classifier patterns so that the values of Computation objects could be classified.

It all worked, but it was too complicated. So I left it for a few months.

When I returned to the problem it was suddenly much clearer. First of all I defined a simple syntax that would let me define formulas, rather than having to write java code. But the main trick was the way that the formulas were created.

Rather than writing an interpreter that parsed and processed the formula I used an approach I had seen elsewhere with a system called Kawa. Kawa is a java based implementation of the language Scheme. Now the approach used in Kawa to parse the expression and generate a set of java objects that could then be evaluated. Maybe a simple example:

(+ 2 3)

Now one might write an interpreter that simply evaluated the expression:

eval("(+ 2 3)");

But behind the scenes the Kawa approach would end up with this:

eval = new Plus(2, 3);

return eval.eval();

Now, the point about this is that we end up with a computation object, and objects can implement many protocols, AND OBJECTS HAVE IDENTITY.

So, with this realisation I set about building some formula objects.

After a few iterations you could say things like:

aGPO.define("whoAmI", "(str 'My name is ' (-> name))");

aGPO.set("name", "Martyn");
aGPO.get("whoAmI");
'My name is Martyn'
aGPO.set("name", "Jackie");
aGPO.get("whoAmI");
'My name is Jackie'

Believe me, I was excited by this. So far the world has not realised how significant a breakthrough this is .... one day maybe

The reason I was excited was that I knew how it worked. I knew that I had designed it to be scalable, that it did not matter how many formulas depended on another, that it was really easy to extend, that.....

Hangon, maybe you didn't get that point there, "formulas depended on another". Yep, you can have one formula provide values to another, and they get calculated just like in a spreadsheet, except that its not a spreadsheet, its a persistent object model that might have millions of objects in it.

When it's right it's right!

Semi-Mechanical

As I developed more models using GPO I became aware of two things.

The first was that this was pretty simple and effective. It really did work, was easy to use and performed reliably.

The second was that it seemed pretty mechanical. I'd think of some model, and directly I was able to write the code to implement it. When pushed I could implement very complex systems in a few days that would normally be expected to take a team weeks if not months.

The simplicity of the GPO model was revealing patterns of use that I had not noticed before. But now that I had noticed them I realised that this could be taken further.

The Alchemist

The Alchemist is a system generator. I've worked on a number of systems where different code generators were used, but The Alchemist is more ambitious than a simple code generator.

The Alchemist will take a relatively simple description and generate a working system ...and it's not a toy. This is no joke.

A really neat thing about The Alchemist is that because it generates pretty much everything, the description you need to provide is a lot simpler than say, a UML Model. The reason is that a UML model needs to provide details for an implementor to create the system, while The Alchemist uses a number of simple rules to generate pattern-based code.

The data model used by The Alchemist is non-trivial, so it should not surprise you to learn that it uses GPO. In fact, the latest version of The Alchemist uses a model that it generated itself (sic).

MetaData

A key feature of the system generated by The Alchemist is that the GPOs generated provide support for a MetaData interface. This means that they provide information about themselves like "When you want to display me, use this property value", and "I own a set of objects and can be used to create them", and stuff like that.

This MetaData interface can be used to dynamically create interfaces to the system. At present there are two approaches used. One is a Web Application based on a servlet using JSPs, and another is a standalone application built using InterActor - a GUI package part of Cut The Crap Software and based on Java 2D.

Type Restriction and Validation

One aspect of the MetaData worthy of mention is the simple but effective idea of "type restriction".

Suppose you have some property called "email", the java type of the property will be java.lang.String. But most strings like "this is a string" is not a valid email.

The model description used by The Alchemist allows the definition of a "restriction", such as "length:5-20", which would mean that the String value must have a length of between 5 and 20 characters.

In the case of "email", you can just state:

restriction="email"

But nothing magik is happening here, The Alchemist has a "restrictions" dictionary that will recognise "email" and convert to a more specific restriction in this case:

restriction="regex:^[-_a-z0-9.]+@[_a-z0-9.]+[a-z]{2,}$"

This restriction specification is available via the MetaData to help an application validate user input. Both the Web Application and standalone application use restrictions to provide "as you type" validation.

Gui Hints

Other kinds of hints can be provided in the model description to help in the interface. For example guihint="rows:5" would indicate that the property value might be a long string and a multirow edit box should be used.

All these annotations are on the model description. These are then propagated to the MetaData available from each object. So you define the model, generate the system, try the application, then adjust the model and try again.

It is genuine "Model-based Rapid Application Development" if you want to give it a label.

Conclusion

The Cut The Crap Software project is in some sense complete. There is always more to do, and much that can be improved. But it is complete for two reasons...

..and the second reason is that "when you're right you're right", and nothing more need be done.

Feedback - Feedback - Feedback

You would probably not believe how difficult it is to get feedback on ideas and products. Please, let me know your thoughts, suggestions, anything really, even the harshest comment will be appreciated simply because you could be bothered. Email martyn@cutthecrap.biz to register your interest.

Links

www.cutthecrap.biz - The main Cut The Crap website.

downloads - the main download page.

Cut The Crap Software - A central page referencing the different packages.

Whitepapers - A number of technology and tutorial whitepapers on Cut The Crap technology.

KAWA - The homepage of the GNU KAWA project.