Tuesday 28 October 2008

ISWC08 Monday p.m.

I've had lunch and coffee and I'm ready for more. This afternoon promises the most relevant sessions so far. Will I sink or swim?

"Moduralisation and Explanation" a.k.a. "Understanding and repairing inferences"
Matthew Horridge & Uli Sattler

There's a customised version of Protege 4 that's used in this session available at the tutorial web site, and a couple of example ontologies to go with it. Specifically, the customised Protege 4 comes with some plug-ins that are part of the TONES project.

Uli starts by leading us in gently. There are lots of ontologies out there & if one contains concepts that you want to model you should reuse them. This is sound engineering technique & makes perfect sense. However, you might want to reuse only a part of an ontology, so how do you go about extracting the useful bit? Well, you can probably identify the concepts that you want to reuse, but they themselves will depend on other concepts within the borrowed ontology & you'll need to import those borrowed concepts as well, and this is called 'coverage'. The bad news is that guaranteeing coverage is, in general, undecidable, but there is a syntactic approximation (phew) that guarantees coverage, but not minimal size.

Unfortunately (again) you can't guarantee the 'safety' of the imported terms. There may be clashes or inconsistencies with your ontology and you'll need to check these manually. This is related to "Conservative extensions" (what ever they may be) and is the subject of ongoing research.

Now Matthew takes over, and we get into some gritty inference stuff. Using a reasoner plugin he identifies a number of Root & Derived Unsatisfiable classes in the ontology. These are generally 'Bad Things' (TM) - after all what's the point of a class that nothing CAN belong to (note this is different from an empty class, that is satisfiable, but has no members). The plugin lists the unsatisfiable classes, and shows which are Root classes - these need to be addressed first, because fixing them might well fix the unsatisfiability of the Derived classes.

In a large ontology (or any ontology) it can be hard for a mere human to determine _why_ a class is unsatisfiable. Lucky for us, reasoners can be induced to tell us. What they can give us is 'justifications' which are "minimal subsets of an ontology that are sufficient for a given entailment (inference) to hold". They are a sort of explanation of how an inference was reached (and are also known as MUPS or MinAs). Each justification is made up of one or more axiom, and the same axiom may occur in more than one justification. For a given inference (such as the unsatisfiability of a class) there may be multiple justifications. To 'repair' an unsatisfiability all you need to do is delete a single axiom from each if the justifications. Of course, nothing is that simple - how do you know which axiom (s) to delete? Well, you need to analyse the axioms and decide which ones represesent logical errors.

So far, so good. Now things begin to get a bit hairy.

Some justifications include 'superfluous' axioms - the same inference would ave been reached if those axioms didn't exist. This might point to problems in the ontology, and there's a plugin to help find these. The superfluous nature of these justifications can be concealed by both 'Internal' and 'External' masking, which I'm not even going to try and explain. Then there are 'Fine-grained' justifications that come in two flavours: 'Laconic' and 'Precise'. The plug-in lets you look at the 'Laconic', which have no superfluous axioms, and all axioms are as weak as possible. 'Precise' justifications are a subset of 'Laconic' justifications and are primarily geared towards repair.

Well, I'm glad that's all clear.


"Data Integration through Ontologies"
Diego Calvanese & Giuseppe De Giacomo

Still to come....

ISWC08 Monday a.m.

Another day, another tutorial. It's raining, but deep in the bowels of the Conference Hall there are no windows... just slide after slide....

Today the tutorial is "Reasoning for Ontology Engineering". It's a big topic and even with my little experience of the subject I know that there are many issues whenever the theory comes face to face with the real world. The tutorial outline is divided into 4 parts: Introduction, Bottom-Up Approach to Designing Ontologies, Understanding and Repairing Inferences and Data Integration through Ontologies. Here goes (and when I fail to make sense you can check out the slides.

"Introduction" - Ralf Moller
Ralf's Introduction is theoretical in nature. He introduces formal notations of a Description Logic language, ALCQ, that, though I'm sure they're useful to some, seem superfluous for my grasp of the problem. He describes a top-down approach to ontology design that seems entirely sensible to an seasoned OO designer like myself. It's all phrased in formal terms, TBox, ABox, Generalized Concept Inclusions (GCI), Grounded Conjunctive Queries (GCQ) but the intention is clear _despite_ that :-). To paraphrase (and probably distort) the TBox contains the 'type system' while the ABox contains the instances.

OK, I'll quote from the presentation:
"A TBox is a set of generalised concept inclusions" - translation: it's a class hierarchy
"An interpretation satisfies a GCI, C is subsumed by D, if all members of C are also members of D" - translation: C is a subclass of D
"An interpretation is model of a TBox if it satisfies all GCIs in the TBox" - translation: a model must conform to the type system
"A concept, C, is satisfiable w.r.t. a TBox, T, if there exists a model, I of T, such that the set C w.r.t. I that is not empty" - translation: is there any way that there can be members of the set.
etc.

Ralf covers the Unique Name and Open World assumptions and demonstrates how standard reasoning (based on RDFS & OWL) can be used to validate design decisions. He shows how Concept membership can be defined using restrictions (something that takes the OO head a while to get used to), and how that can be the base of GCQs. GCQs can be reduced to (standard) instance tests (i.e. simple triple queries), but non-trivial optimization techniques are required, such as those implemented by RacerPro. The number of individuals that can be efficiently queried (using sound & complete reasoning) has increased by orders of magnitude over the past 5 years (from 100 to 10,000 or 100,000 today).


"A Bottom-Up Approach to Designing Ontologies" - Anni-Yasmin Turhan
Anni has a cold but she delivers an interesting session. It has been shown that Domain Experts (as opposed to Knowledge Engineers) don't necessarily have a sufficient grasp of OWL (Description Logics in general) to be able to design ontologies. Instead the team have developed a methodology by which the domain engineer creates instances/individuals in the ABox. A Racer Pro/Porter plugin called Sonic implements Most Specific Concept (MSC) and Least Common Subsumer (LCS) functionality.

The idea is that the Domain Expert creates instances (in the Abox), and then use the MSC algorithm to automatically generate a Concept (class) in the TBox that closely matches the instance - this turns properties into restrictions, for example. They then select several of these MSCs, that they 'know' describe 'similar' instances and use the LCS algorithm to generate a Concept (class) that subsumes all of the MSCs as closely as possible. The tool allows the user to edit/rename the generated Concepts to remove incorrect or unnecessary assertions.

Most Specific Concept:
- the input individual is an instance
- the output is the best-fitting concept description for the input
- available for 'unfoldable' TBoxes
- only appropriate for acyclic ABoxes (but for ALE Description Logics you can compute k-approximations
So, the first two bullets above make perfect sense, but the next 2 require some formal knowledge that I don't really have, but here's what I've dug up.

Unfoldable: For a given TBox, T:
- All axioms are definitional (subclass or equivalence relationships)
- Axioms in T are unique (it's more complicated than this, but enough is enough)
- T is acyclic

Description Logic naming conventions
- AL: Attributive language, a base language which allows: Atomic negation, concept intersection, universal restrictions, limited existential quantification
- E: Full existential qualification

k-approximation: an approximation algorithm. I can't even paraphrase this one, so check it out for yourself.

Least Common Subsumer:
- the output is a Concept (class) that subsumes (is superclass of) each of the input Concepts
- the output is the best-fitting Concept description for the input Concepts
- available for 'unfoldable' TBoxes
- not available for logic more expressive than ALEN (N: Cardinality restrictions)

Are you still here? Fine.

If you have a populated TBox, but with a flat hierarchy, you can use LCS to deepen your class hierarchy, by applying it to 'similar' sibling Concepts. The reason you might want to do this is that if you have a large number of siblings, it makes it harder to navigate & query the data.

Now, there are many ontologies that use more expressive DLs, such as ALC (C: Complex concept negation). The upshot appears to be that if you apply LCS in this context you end up with a lengthy disjunction (union) which is not very useful. This can be handled by either an approximation-based or customization-based approach.

Approximation-based: eliminate disjunction from input concepts (while preserving as much information as possible) and then compute the LCS. This amounts to a translation from the more expressive DL to a less expressive DL.

Customization-based: import the more expressive ontology into your ontology as 'background' terminology, and refine it using terms from a less expressive DL. You refer to concept names from the background ontology, but there's no feasible way to actually use this! So, the proposal is to try using "subsumption-based common subsumer".

I was well lost on this last point, so it was lucky that it was lunchtime.

Sunday 26 October 2008

ISWC08 - Sunday p.m.

International Semantic Web Conference 2008
Tutorial - Introduction to the Semantic Web

Sunday 26th October - Afternoon

"Semantic Interoperability" - Jerome Euzenat & Natasha Noy
Much of the promise of the Semantic Web stems from the claim that we will be able to query heterogeneous data sources. For this to work the ontologies from the data sources need to be 'aligned' and this is not simple. It can be partially automated, but requires human intervention both to identify missed matched and confirm correctness of automated matching. Once the ontologies are aligned a transformation between them can be automatically generated. The transformation may translate queries/triples, or simply create new assertions that make axioms in the data sources equivalent. Combining different techniques can help, though this then introduces the need to aggregate, filter and trim results. In tests precision/recall vary from under 50% to over 80% if ontologies are 'relatively similar'. Best performers in Ontology Alignment Evaluation Initiative (OAEI) tests are Falcon and RiMOM.

"Semantic Web Services" - John Dominingue & David Martin
We start with an overview of Web Services today. They are syntactic, most integration tasks need to be carried out by developers and they can have problems scaling. The goal of Semantic Web Services is to "automate all aspects of application development through reuse". The idea is that we can build clients that can analyse a query and choreograph/orchestrate the interaction with one or more web services to provide potential solutions. David and John describe OWL-S and WSMO respectively. They both provide mechanisms that extend the description of web services through added semantics to make the mediation between client and service more automatable. David describes Fujitsu's Task Computing and John covers eMerge, a system developed to assist with emergency situations in Essex. They then describe the W3C recommendation for SAWSDL that allows traditional 3 extensibility elements for WSDL: modelReference, liftingSchemaMapping, loweringSchemaMapping.

"Linked Data - the Dark Side of the Semantic Web" - Jim Hendler
Darth Vader reared his head at the beginning of this session, but was quickly dispelled. Jim is talking about the unseen side of the Semantic Web, the ability to link data dynamically. An example he gives is the (largely theoretical) wine chooser application that downloads the menu for the restaurant you are eating at, prompts you to pick the dishes that you and your companions are choosing and based upon your (& their) preferences, the downloaded wine list and some online service that matches foods to wine characteristics, recommends what wine(s) to choose. I'm not entirely convinced - my heuristic of pick a colour and don't choose any whose price makes you perspire seems to work fine. He also talked about deployed websites such as Twine, LiveJournal, Freebase and DBpedia and described the huge online RDF resource at the W3C SWEO Community Project LinkingOpenData. A brief discussion of Semantic Gridding/Seeded Tagging followed and the assertion that many of the larger commercial companies were entering the Web 3.0 arena in the belief that it is only a matter of time before it provides winners to join Google (Web 1.0) and Facebook (Web 2.0) Hall of Fame. This is exemplified by Microsoft's recent acquisition of PowerSet earlier this year.

"Using the Semantic Web" - Mathieu d'Aquin
Mathieu was (if this is possible) even more excited than Jim. He had been hoping to demonstrate several application that make use of Semantic Web APIs, but the connectivity at the conference centre is pretty poor. Consequently we had to make do with static images, but he was still convincing. He is the developer of Watson, an online Semantic Web query service (http://watson.kmi.open.ac.uk) that enables you to quickly search all marked up data on the web to discover relevant resources. You can then issue SPARQL queries against the resource, or use some of the precanned API calls, such as subclass/superclass, and he showed how he quickly knocked up a search engine, Wahoo (derived from Yahoo), that used Watson to populate a sidebar with specialisations/generalisations of the search terms entered. There are other services out there too, such as OpenCalais SemanticProxy and Hakia. Also worth a look, apparently, is the Talis platform, that will actually store your data for you!

ISWC08 - Sunday a.m.

International Semantic Web Conference 2008
Tutorial - Introduction to the Semantic Web

Sunday 26th October - Morning

It's Sunday, the clocks have just changed and the sun is shining. I've survived my first night in a small room at the Hotel Barbarossa that usually houses chain smokers - my throat is sore and I've got a long day ahead of me (and the prospect of the same stale cigarette smell for the next 4 nights - the hotel is full and they have no non-smoking rooms).

I'm fairly new to Semantic Web technology, which is why I booked into the introduction tutorial, but I've read "Semantic Web for the Working Ontologist" and have been working with RFD(S)/OWL for a few months now, so I didn't want a simple re-hash of online documentation. I needn't have worried. The day is split into 9 sessions from 10 speakers and paints a wide canvas of what's going on and where we might be heading.

The session is opened by Jim Hendler with his "Introduction to the Introduction to the Semantic Web". Jim is an engaging presenter, with a style that reminded me a little of Jim Coplien (must be that first name :-). His main theme is that there are 2 different interpretations of the Semantic Web, one originating in the AI community (heavyweight ontologies, processing time less relevant than correctness/completeness of answers) and the other from the internet/web community (lightweight ontologies, speed of response much more important than correctness/completeness). These views, he claims, are not irreconcilable, but there remains a large space between the two that still needs to be explored.

Sean Bechhofer takes over with an "Introduction to OWL". In 45 minutes he's never going to cover RDF, RDFS and OWL in any depth , but he does sketch out the landscape all the way from motivating the need for semantics through inference, OWL dialects and DL. He even manages to fit in brief comments on the Necessary/Sufficient Conditions, Open World Assumption and Unique Naming Assumption. Not surprisingly he doesn't get through all 67 slides and skips SKOS and OWL2... I don't know about him, but I needed that coffee.

Half an hour later we hear from Asun Gómez-Pérez about "Ontology Engineering Methodologies". She covered the NeOn methodology for developing ontologies, which defines a process and some process artefacts to assist ontology developers. There's an ontology workflow, with 9 different scenarios defined, a card based system for capturing relevant data and some guidelines about how to reuse existing resources. I found Asun hard to follow, due to a combination of her accent, the speed with which she went through the material and some problems with the microphone. However, I think that I'll be looking closer at the NeOn project when I get back home.

Asun was followed by Aldo Gangemi's "Ontology Design". He covered the motivation for ontologies in general, contrasted them with controlled terminologies and reiterated that ontology design was about the non-trivial task of matching a solution (ontology) to the problem at hand. His main interest appears to be discovering and documenting ontology design patterns, which he divides into categories such as 'Logical Patterns' and 'Content Patterns'. A website has been launched as a repository of ontology patterns at http://www.ontologydesignpatterns.org . He then introduced the concept of unit testing ontologies, which was something I haven't seen elsewhere, and demonstrated by demonstrating some fundamental errors in the conferences own ontology, such as missing inverse properties, missing disjunctions and missing property transitivity.

We still, haven't reached lunchtime! Fabio Ciravegna takes us through "Technologies for Capturing, Sharing and Reusing Knowledge". There are a lot of resources in the world; some of them are text-only, but most are not, which makes automated markup really hard. However, the amount of data out there precludes manual-only annotation in most situations. Fabio talks about hybrid methods of annotating text documents, where the system learns how to do markup by tracking a human annotating a sample of the documents to be marked up. There are simpler automated markup techniques, such as Named Entity Recognition, Terminology Recognition or Known Name Recognition, which can have precision/recall accuracy of up to 80%-95%. However, when more useful (and complex) techniques are attempted, such as capturing links between elements in a document, precision/recall hit a plateau of 60%/70% in 1998 and has stayed there. And that's without trying to automate annotations of multimedia content. Once the data is annotated it becomes easier to share it, but there are still hurdles to overcome - there's a lot of research about searching/querying: keyword/semantic or hybrid. Again, hybrid appears to win, but even merging the results from a hybrid search is tricky. Look at k-now.co.uk a spinoff from Sheffield University, which has been highly rated by Rolls-Royce, no less.

My blood sugar level dipped too low during Fabio's talk, and looking back over the 100+ slides I wonder why I can only remember a fraction of them. Did he skip them or was I dreaming of lunch?

Tuesday 6 May 2008

RichEd20.dll

Recently, I have been dealing with a number of defects that all come home to roost with Microsoft's RichEd20.dll - or more accurately the variation between versions of the DLL.

The first issue was spotted when copying an OLE object from word and pasting it into a Rich Edit control hosted by my client's application. It worked fine when Office 2000 was installed, but the formatting was broken once Office 2007 was installed. (The pasted object was rendered in two parts - an embedded object and a static metafile).

And bizarrely, copying and pasting an embedded image from one Rich Edit control to another crashes when using the RichEd20.dll that ships with XP, but works fine when using the version that ships with Office 2007.

In both cases the presence of some text on the clipboard as well as the OLE object, significantly altered the behaviour of the paste operation.

Joyfully I searched the web, but didn't find much help, except for this blog entry from Murray Sargent that lists the various RichEd20.dll versions and where they may be found.

So, if you experience weird formatting (or worse) when pasting an OLE object into a rich edit control, check out what version(s) of RichEd20.dll you've got installed.