Abstract (or algebraic) Data Types (ADTs) are well-known modeling techniques for specification of data structures in an algebraic style.
Such specifications result useful because they allow formal reasoning on Software at the specification level. That opens the possibility of automated verification and code generation for instance as we may find within the field of program transformation (refactoring, modernization, etc.). However, ADTs are more naturally associated with a declarative programming languages (functional (FP) and logic programming (LP)) style than an object-oriented programming one (OOP). A typical ADT for a tree data structure could be as follows using some hypothetical formalism (we simplify the model to keep things simple):
adt TreeADT<Data>{
import Integer, Boolean, String;
signature:
sort Tree;
oper Empty : -> Tree;
oper Node : Data, Tree, Tree|-> Tree;
oper isEmpty, isNode : Tree-> Boolean;
oper mirror : Tree -> Tree
oper toString : -> String
semantics:
isEmpty(Empty()) = true;
isEmpty(Node(_,_,_)) = false;
mirror(Empty()) = Empty()
mirror(Node(i, l, r)) = Node(i, mirror(r), mirror(l))
}
As we see in this example, ADTs strongly exploit the notion of rules and pattern-matching based programming as a concise way to specify requirements for functionality. The signature part is the interface of the type (also called sort in this jargon) by means of operators; the semantics part indicates which equalities are invariant over operations for every implementation of this specification. Semantics allows recognizing some special kinds of operations. For instance, Empty and Node are free, in the sense that no equation reduces them. They are called constructors.
If we read equations from left to right, we can consider them declarations of functions where the formals are allowed to be complex structures. We also use wildcards (‘_’) to discard those parts of the parameters which are irrelevant in context. It is evident that a normal declaration without using patterns would probably require more code and consequently obscure the original intention expressed by the ADT. In general, we would need destructors (getters), that is, functions to access ADT parts. Hence, operation mirror might look like as follows:
Tree mirror(Tree t){
if(t.isEmpty()) return t;
return new Node( t.getInfo(),
t.getRight().mirror(),
t.getLeft().mirror() );
}
Thus, some clarity usually could get lost when generating code OOP from ADTs: among other things, because navigation (term destruction, decomposition) needs to be explicit in the implementation.
Rules and pattern-matching has not yet been standard in main stream OOP languages with respect to processing general data objects. It has been quite common the use of regular expressions over strings as well as pattern-matching over semi structured data especially querying and transforming XML data. Such processing tends to occur via APIs and external tools outside of the proper language. Hence, the need for rules does exist in very practical applications, thus, during the last years several proposals have been developed to extend OOP languages with rules and pattern-matching. Some interesting attempts to integrate ADTs with OOP which are worthy to mention are Tom (http://tom.loria.fr/ ) and Scala (http://www.scala-lang.org/ ). An interesting place to look at with quite illustrating comparative examples is http://langexplr.blogspot.com/.
We will start in this blog a sequence of posts studying these two models of integration of rules, pattern-matching and OOP in some detail. We first start with a possible version of the ADT above using Scala. In this case the code generation has to be done manually by the implementer. But the language for modeling and implementing is the same which can be advantageous. Scala explicit support (generalized) ADTs, so we practically obtain a direct translation.
object TreeADT{
trait Tree{
def isEmpty:boolean = false
override def toString() = ""
def mirror : Tree = this
def height : int = 0
}
case class Node(info:Any, left:Tree, right:Tree)
extends Tree{
override def toString() =
"("+info+left.toString() + right.toString()+")"
}
object Empty extends Tree{
override def isEmpty = true
}
def isEmpty(t : Tree) : boolean =
t match {
case Empty => true
case _ => false
}
def mirror(t : Tree):Tree =
t match{
case Empty => t
case Node(i, l, r) => Node(i,mirror(r), mirror(l))
}
}
By Migration 2.0 (coined in the context of MacAfee’s Enterprise 2.0 vision) we
want to consider additional ways of addressing Automated Software Modernization
in a Web 2.0 context and why or how this should or not deserve some particular
attention. As a complement of a previous post, we have developed a presentation (albeit
in Spanish) which may help to clarify (or not) some of the ideas around the
notion. Especially relevant are a user-centric notion of migration and
modernization as a sort of “SLATES-ification” of legacy code where legacy need to be understood inside of the 2.0 context. We will revisit and refine this
idea in a forthcoming post.
You will find the slides of the presentation here.
Let us elaborate some loose ideas on a fresher notion of
“software modernization”. In the sequel, we are reviewing the notion of
software migration as a modernization path and we want to look it through the
glass of other contextual elements and events around the IT world which have
evolved during the last 2-3 years. Such elements give us a reason to identify
forces and opportunities on innovation and possible new developments. Let’s
assume, we now have focused more on an implementation oriented modernization;
namely one that is mainly concerned with programming languages,
frameworks, platforms, architectures and the like those have being prevailing
during the last years.
From a technical point of view, we are aware this implementation
based notion is completely valid and will exist for a while as such because still
useful legacy systems are forced to evolve at the implementation level while retaining
as much as possible its original functionality and corresponding business
value, at a reasonable cost. However, we also are aware that the environment
where ISs serve and survive is so strongly evolving in such a way that
implementation details are probably remaining that important only at a
traditional IT/IS level and vision. What kind of environment and forces are
these making pressures on that vision? Is there an opportunity there for
migration?
One important phenomena is definitively the Web 2.0 and, in
analogy to how Internet forced Intranets, Andrew McAfee has recently coined the
Enterprise 2.0 concept which embodies those well-known effects the Web 2.0 as
an ubiquitous trend, as a social movement and how those might be pushing on at the
inside of the IT enterprise nowadays; and as a direct consequence at the kind
of tools employees might be willing
and needing in their regular work environment; those where new information
requirements born faster than they can be incorporated as new features at the
traditional IS platform. Whether so-called Web 2.0 tools will be improving
employee productivity is probably an open and debatable question, no doubt
about it. We still remember not too long ago how e-mail and Internet at the
work was considered as a disturbance source per se.
True is also, however, that being able to search, analyze, discover,
tag, publish, share and trace knowledge at the rhythm of the business and own personal
information needs has become now more important than ever. And traditionally designed
ISs could be becoming a factor contributing to a sort of impedance mismatch
between the huge flexibility and freedom that persons might currently encounter
on the Web (even in private personal milieu) and a rigid traditional IS
platform at the work-place. And, we emphasize, this concern occurs independently
of whether or not such IS platform is “modern” at the implementation level
which is another different dimension of the matter.
As Dion Hinchcliffe entitles, Enterprise 2.0 is a cultural
catalyst (as Web 2.0 is being); we believe and interpret it as a realistic vision
where, more soon than later, ISs will be judged in terms of McAfee’s SLATES
criteria and this will entail a rather stronger force leading to modernization of
higher–level then than the technical one, because such criteria are more
closely related social, common-sense, better understood forces not so directly
related with technical issues.
If such a vision is accepted as a legitimate opportunity, we might
then be looking for spaces for innovation now when we are considering moving to
the next levels of automated supported migration. In such a case we have to consider migration as a
path enabling ISs (not just legacy ones) to evolve into direction Enterprise 2.0-like modernization
as a well-defined strategy.
We speculate software and data migration are frequently urged
by IT infrastructure modernization needs (emergencies would it be more precise
to say perhaps) at the enterprise level which often are nearly related with platform
obsolescence and associated maintenance costs. A possibly even more appreciable
benefit might be the need of integration between heterogeneous information
systems and data bases, the need of being able to easily connect information
sources, to create access ports to facilitate extraction and derivation of
knowledge from such sources helping at the management level at the decision-support
systems level in a more natural and easy way. In other words, the real value of
modernization would be to support BI-like strata pointing -for sure- to a
better CRM platform among others. We feel migration is probably not perceived as
a transitory goal, more as a mean, so the added-value might lie outside possibly hidden
with respect to the whole migration effort per se. It seems to me that the vertical
road which leads from software migration and modernization until reaching BI-end-user
is perceived as a too long one, one that is hard to appreciate at first sight
at the management level. My question to myself would be whether that has to be
so. Shouldn’t migration tools and message be pointing as layers enablers for
the BI layer, in a more directly way. And if the answer is yes we should be
asking ourselves how such migration tools can be adapted for closing the gap.
I was taken a look at some sources showing trends on BI for 2007, especially related to legacy code. I have found interesting the following quotes from a report Knightsbridge Solutions LLC (Trends in Business Intelligence for 2007)
Trend #5
Service-Oriented Architecture: Information Management is Critical to Success (of BI)
The buzz around service-oriented architecture (SOA) continues, with some organizations viewing SOA
as the solution to a wide range of business and technology problems, from improving enterprise agility
to deriving more value from legacy systems (I think we are wintness of that).
In the BI arena, SOA has great potential for delivering
enhanced capabilities to users. An SOA-enabled BI infrastructure could provide seamless access to
both batch and real-time data integrated across operational and analytical sources. SOA also presents
opportunities for innovation in areas such as real-time data collection and real-time analytic services.
However, companies that approach SOA without a strong information management methodology will
have difficulty achieving the benefits they seek. When implementing SOA on a large scale, companies will
face the same barriers they do in large BI integration projects. For example, some early adopters of SOA
found that semantic incompatibilities in the data between disparate systems are a significant impediment
to SOA adoption. These organizations are discovering that master data management and data governance
efforts must precede SOA adoption to provide a “common language” that supports integration. SOA has
the potential to deliver benefits in BI and many other areas, but not without a solid information management
foundation.And also
Trend #8
Influence of Large Vendors: Market Consolidation Expected in 2007
Speculation abounded in 2006 regarding potential acquisitions of pure-play vendors in the BI space.
Business Objects, Cognos, Hyperion, and Informatica were seen as potential acquisition targets with
likely acquirers being enterprise software and infrastructure vendors like IBM, Microsoft, Oracle,
and SAP. Large vendors have moved aggressively into the BI space, building their capabilities through
both acquisitions and internal development (as evidenced by Hewlett-Packard’s recent acquisition
of Knightsbridge).
Strongly consistent with this
http://www.intelligententerprise.com/channels/bi/showArticle.jhtml;jsessionid=ADNU0C0TP01BYQSNDLRSKH0CJUNN2JVN?articleID=199501674
This is also interesting to see in more detail
http://www.spagoworld.org/ecm/faces/public/guest/home/solutions/spago
"Naja" let's see
Sometimes
negative experiences may turn out positive in the long-run, everybody should
hope to believe in whenever things do not happen the way we expected. Though already
old, maybe questionable and controversial, I must realize this speech made me reflect a
lot about it, I found it very rich in philosophical material; it changed me
somehow. If you haven’t yet, you could find interesting to watch it
Video: http://www.youtube.com/watch?v=D1R-jKKp3NA
Text: http://news-service.stanford.edu/news/2005/june15/jobs-061505.html
Let us gently start meeting PaXQuAL, a prototype language that we are shaping out and adjusting
for the emergent purpose of symbolically expressing analysis and transformation
simple tasks around Web Pages, all this circumscribed in the
context of refactoring, as we have
been done in previous posts.
And as we also have already done days before, sometimes we
just want to digress a little bit from practical and realistic issues, just to
expose some theoretical ideas we find somehow interesting (and probably nobody
else). I just can promise that no Greek letter will be used, at all (in part
because that Font is not allowed by the publishing tool, I confess).
Anybody (if any) still reading this post is allowed to right
away express the nowadays classical: “Yet another language? How many do we count
up by now?” Claim is justified because everybody knows that every problem in
Computer Science is solved proposing a (the) new one. Now it is my turn, why
not. It’s a free world. For the interested reader a technical paper will be hopefully
available with further details at this site, soon.
Actually PaXQuAL (Path
based Transformation and Querying Attribute Language is his real name; is pronounced Pascual) is not that new and different from many other languages,
developed for real researchers at the academia and industry. We wanted to imagine
a language for querying and transforming structured data (eg. XML, HTML) and from that sort we have many available as we know. What
new material can be proposed at this field for someone like us? Actually,
what we really want is to operationally relate CSS with some special sort of theoretical weird artifact we had
been exploring some years ago that we may dare to call Object-Oriented Rewrite Systems or Term-rewriting Systems (TRS) with extra variables and state
(as a result of some work developed by and joint with actual researchers some
years ago). Considering TRS in this case
natural because CSS is indeed a kind of them and that field has a rich offering
of tools for useful automated reasoning. And we can find them useful here, we
guess.
The question that pushed us back to the old days is: given an
interesting, so simple and practical language, like CSS is, what kind of object-oriented
rewriting logic can be used to describe its operational semantics. You may not
believe it but this is a very important issue if we are interested in reasoning
about CSS and HTML for refactoring purposes among others. And we are, don’t we?
CSS is rule-based, includes path-based pattern matching and
is feature (semantically attributed) equipped, which all together yields a nice
combination. CSS can be considered “destructive” because it allows adding or
changing (styling) only attributes of tags where remaining “proper content”
does not result destructively rewritten. It is not generative, by such a reason
(in contrast to XSLT and XQUERY). And that leads to an
interesting paradigm. For instance, following is a typical simple CSS rule for
setting some properties of every tag of the kind body.
body {
font-family:
Arial, Helvetica, sans-serif;
background-color:
#423f43;
text-align:
center;
}
Of course more explicit rules like this one can be declared
but further, an inheritance (cascading) mechanism implicitly allows that
attributes may be pushed down or synthesized as we know from attribute grammars.
That all is nice but we feel we had to be original and want
to propose the crazy idea of using something similar to CSS for purposes beyond
setting style attributes, for instance for expressing classification rules
allowing to recognize patterns like
the ones we explained in previous posts. For instance, that a table is actually
a sort of layout object, navigation bar or a menu, among others. Hence, we
would have a human-readable, querying and transformation language for Web
Pages, a sort of CSS superset (keeping CSS as a metaphor what we think might be
a good idea):
Let us by now just expose some examples (where we advert
concrete syntax in PaXQuAaL is not yet definitive). For instance, we may want
to eliminate the bgcolor attribute of any table having it because is
considered deprecated in XHTML. We use symbol “-:“ for denoting execution of
the query/transformation as in Prolog.
:- table[bgcolor]{bgcolor:null;}
We may want to add a special semantic attribute to every
table directly hanging from a body, indicating it may be a surface object for
some latter processing. We first must statically declarate a kind of table, “sTable”,
accepting a surface attribute because we are attempting to use static typing as
much as possible (“Yes I am still a typeoholic”)
@:- sTable::table[surface:boolean;]{surface:false}
Symbol “@:-” is like “:-” but operating at the
terminological level. And then we have the rule for classifying any table
instance hanging from the body tag, directly:
:- body sTable{surface:true;}
Many more "interesting" issues and features still need to be introduced;
we will do that in forthcoming post. Hence, stay tuned.
By taking at look at Web Pages, we may expect to discover that
some patterns of semantics are encoded using very few HTML sets of, let us say, “combinators”
(HTML parts); this may be due to the lack of abstraction capabilities which is
inherent to the HTML alone. We have compared this situation to the Noisy-Channel model in a previous post
where we presented some interesting figures and data illustrating the claim.
Let us continue our journey showing further instances of this phenomenon whose formal
analysis is crucial for intelligent refactoring tools as the kind we have been pursued to introduce by
means of this sequel of posts. In other words, let us know other forms of “HTML noise”. As a word of warning, we recall
that the data is the result of a particular muster of crawled pages by the way we explained before.
For this post, we are experimenting
with tables that are potentially used as page layouts or page structure. For
those kinds of tables, we want to study the table shape or page surface, no the specific content; we may think of that as a
way to filter potential candidates for further deeper semantic analysis. (We briefly
recall that our muster contains 819 pages and about 5000 table instances,
roughly speaking.).
The exercise is simple: we
postulate an intuitive definition for a table as surface and see how well it is
supported by our data in muster.
Let us try our shallow analysis
by classifying a table as a page layout candidate
if its container is the page body
tag, eventually followed by a chain of div
tags (assuming such div tags are intended to be organizers or formatters of the
table), it has at least two rows and at least 2 columns (two columns is the
most interesting case, we consider it as a base).
Such a pattern definition sounds
reasonable in appearance; however, we will see that its empirical support is
not as high as one may expect, at least in our muster.
We find 261 of such candidates;
they represent a 31% of all pages, which is a quite interesting amount; however
it is unexpectedly small because one may guess there should be at least one per
page. Among these 261, we have 83 where the table is hanging directly from the
body tag (32% of the candidates; 10% of the whole muster). As a matter of fact,
such 83 tables present irregular patterns, albeit often we find 2 columns (65%)
with a high variance. For instance, we may find a pattern of the form 6.2.2.2.2.2.2.2,
where we use our convention of showing a table of n rows as a sequence of n
numbers, each of one being the number of cols (in example 8 rows, the first of
them with 6 columns the rest having 2 columns). But even worst, we find the
irregular pattern 2.2.7.2.7.7.6.5.5.4.4.5.2.3.2.7.2.7.
And talking about irregularity, let us take a look at this interesting one: 19.2.7.4.6.2.2.2.2.2.2.2.2.2.2.5.7.2.2.2.2.4.4.2,
whatever it means.
With this simple analysis, we may
learn that, perhaps, some intuitive definitions occur not as frequent as we may
expect in our muster. Actually, and after seeing in detail some of the irregular
cases, a sound conclusion might be that we may need first to pre-classify some
parts of the page before using general patterns like the one we directly tried.
In other words, we see that some noise needs to be filtered out for such a kind
of pattern.
In a forthcoming post, we will continue
studying that kind of patterns and their support.
We continue our regular series of
posts talking about refactoring of Web Pages based on semantic approaches; we
invite the interested new reader to take a look at the previous contributions
to get a general picture of our intentions.
In this particular and brief post,
we just want to present and describe some simple but interesting empirical data
which are related with the structural (syntactic) content of some given muster
of pages we have been analyzing during the last days. The results are part of a
white page we are preparing, currently; it will be available at this site in
short time.
We may remember from our first
post that we may want to recover semantics from structure given particular clues
and patterns we usually may come across when analyzing pages. The approach is
simpler to describe than to put into practice: Once semantics could be somehow detected,
refactoring steps can be applied on some places at the page and, by doing so, some
expected benefits can be gained.
However, syntactic structure is
the result of encoding some specific semantics and intentions on a web page
using HTML elements and functionality; the HTML language is (expressively
speaking) rather limited (where too much emphasis on presentation issues is the
case, for instance) and some common programming “bad practices” increase the
complexity of recovering semantics mainly based on syntactic content as input.
And being HTML quite declarative, such complexity can make the discovering
problem quite challenging in a pragmatic context, indeed. That is our more
general goal, however, we do not want to go that far in this post, we just want
to keep this perspective in mind and give the reader some insight and data to
think about it. We will be elaborating more on recovering in forthcoming posts.
As usual in NLP field, it is
interesting to use the so-called Noisy-Channel model as point of reference and
analogy. We may think of the initial semantics as the input message to the
channel (the programmer); the web page is the output message. The programmer
uses syntactic rules to encode semantics during coding adding more or less noisy
elements. Different encodings forms do normally exist, noisy can be greater
when too much structure is engaged for expressing some piece of the message.
A typical example of noisy
encoding is the use of tables for handling style, presentation or layout
purposes beyond the hypothetically primary intention of such kind of table
element: just to be an arrange of data. Complex software maintenance and sometimes
lower performance may be a consequence of too much noise, among others matters.
Let us take a look at some data
concerning questions like: how much noise in page? What kind of noise? What
kind of regular encodings could be found?
As a warning, we do not claim
anything on statistical significance because our muster is clearly too small
and was based on biased selection criteria. Our results are very preliminary,
in general. However, we feel they may be sound and believable, in some way
consistent with the noisy model.
Our “corpus” comes from of 834
pages which were crawled starting for convenience at a given root page in Costa Rica,
namely: http://www.casapres.go.cr/. The size depended of a predetermined
maximal quantity of 1000 nodes to visit; we never took more than 50 paths of
those pointed in a page and we rather preferred visiting homepages to avoid
traps.
Let us see some descriptive profile
of the data. For current limitations of the publishing tool, we are not
presenting some charts complementing the raw numbers.
Just 108 kinds of tags were
detected and we have 523.016 instances of them in corpus. That means, very
roughly, 6 kinds of tags per page, 627 instances per page. We feel that suggests the
use of the same tags for saying probably different things (we remark that many
pages are homepages for choice).
The top 10 of tags are: pure text,
a, td, tr, br, div, li, img, p and font (according to absolute frequency). Together
text, a (anchor) and img correspond to more than 60% all instances. Hence 60%
of pages are some form of data.
We notice that ‘table’ is 1% and
td 8.5% of all instances, against 42% from text, 15% from anchors. In average,
we have 7 tables per page and 54 tds per page, 6 td per table, roughly
speaking.
Likewise we just saw 198
attributes and 545.585 instances of attributes. The 10 most popular are: href,
shape, colspan, rowspan, class, width, clear and height, which is relatively consistent
with the observed tag frequency (egg. href for anchor, colspan and rowspan for
td).
We pay some special attention to
tables in the following lines. Our corpus has 5501 tables. It is worth to
mention that 65% of them are children of td; in other words nested into another
table. Hence a high proportion of nesting which suggests complexity in table
design. We see that 77% of data (text, a, img) in muster are dominated by tds
(most of the data is table dominated). In the case of anchors, 33% of them are
td-dominated, what may suggest tables being used as navigational bars or similar semantic
devices in an apparently very interesting proportion.
We decided to explore semantic
pattern on tables a little bit more exactly. For instance, we choose tables of nx1
dimension (n rows, 1 column) which are good candidates for navigational bars. A
simple analysis shows that 618 tables (11%) have such a shape. The shape may be
different which is quite interesting. For instance, we see a 5x1 table where all td
are anchors. We denote that but a sequence of 1 and 0, where 1 means the
corresponding td contains an anchor (a link to some url): in this case ‘1.1.1.1.1’
is the sequence. But another table of the same 5x1 size presents the pattern ‘1.0.1.0.1’.
This same pattern occurs several times for instance in 50x1 table. Another case
is this: ‘0.0.0.0.1.1.1.1.1.0’ maybe suggesting that some
links are not available. We mention that 212 patterns are 1x1, which would be a
kind of navigation button. We will present more elaborated analysis of this
table patterns in the following post.
To finish, we notice that 875 tables (16%) are not regular:
some rows have different size. Some of them are very unusual like in this 28x8
table, where each number in following sequence denotes the size id tds of the
row: 4.4.4.6.8.8.7.2.8.4.4.6.6.6.6.5.4.5.5.5.5.5.5.5.5.5.5.1.
Noisy, isn’t it?
How do we appropriately represent Web pages for semantic analysis and particularly for refactoring purposes? In a previous post at this same place, we introduced some of the main principles of what we have called semantically driven rational refactoring of Web pages as a general framework of symbolic reasoning. At this post, we want to develop some preliminary but formally appealing refinement concerning page representation; we assume our previous post has been read and we put ourselves in its context: development of automated tools pointing to the task of refactoring as a goal.
We claim that many valuable derivations can be obtained by following a semantic approach to the handling of web pages those of which are both of conceptual and implementation nature and interest; we believe looking with more detail at such derivations may provide us with more evidence for sustaining our claim on semantics in the context of refactoring. In this post, we want to consider one very specific case of conceptual derivation with respect to the representation model. Specifically, we want to present some intuitive criteria for analyzing theoretically a symbolic interpretation of the popular k-means clustering scheme, which is considered useful for the domain of web pages in more general contexts.
Let us first and briefly remain why that is relevant to refactoring, to set up a general context. We recall that refactoring is often related to code understanding; hence building a useful semantic classification of data, in general, is a first step in the right direction. In our case web pages are our data sources, we want to be able to induce semantic patterns inside this kind of software elements, in a similar way as search engines do to classify pages (however, we had to expect that specific semantic criteria are probably being more involving in our refactoring case where, for instance, procedural and implementation features are especially relevant; we will expose such criteria in future posts).
In any case, it is natural to consider clustering of pages according to some given semantic criteria as an important piece in our refactoring context: Clustering is essence about (dis)similarity between data, is consequently very important in relation to common refactoring operations, especially those ones aiming at code redundancy. For instance, two similar code snippets in some set of web pages may become clustered together, allowing by this way an abstraction and refactoring of a "web control" or similar reusable element.
As several authors have done in other contexts, we may use a symbolic representation model of a page; in a general fashion pages can naturally be represented by means of so-called symbolic objects. For instance, an initial approximation of a page (in an XML notion) can be something of the form given below (where relevant page attributes (or features) like "title" and "content" can be directly extracted from the HTML):
<rawPage
title="Artinsoft home page"
content="ArtinSoft, the industry leader in legacy application migration and software upgrade technologies, provides the simple path to .NET and Java from a broad range of languages and platforms, such as Informix 4GL and Visual Basic."
.../>
Because this representation is not as such effortless for calculation purposes, attribute values need to be scaled from plain text into a more structured basis, for clustering and other similar purposes. One way is to translate each attribute value in numerical frequencies (e.g. histogram) according to the relative uses of normalized keywords taken from a given vocabulary (like "software", "legacy", "java", "software", etc) by using standard NLP techniques. After scaling our page could be probably something similar to the following:
<scaledPage
title ="artinsoft:0.95; software:0.00; java:0.00; "
content="artinsoft:0.93; software:0.90; java:0.75; " .../>
The attribute groups values are denoting keywords and its relative frequency in page (data is hypothetic in this simple example), as usual. Notice the attribute group is another form of symbolic object whose values are already numbers. In fact, we may build another completely flat symbolic object where attribute values are all scalar, in an obvious way:
<flatPage
title.artinsoft="0.95" title.software="0.00" title.java="0.00"
content.artinsoft="0.93" content.software="0.90" content.java="0.75" .../>
Now pages that are by this way related to some common keywords set can be easily compared using some appropriate numerical distance. Of course, more complex and sophisticated representations are feasible, but numerically scaling the symbolic objects remains as an important need at some final stage of the calculation. However, and this is a key point, many important aspects of the computational process can be modeled independently of such final numerical representation (at the side of the eventual explosion of low-level attributes or poorly dense long vectors of flat representations, which are not the point here).
Let us develop more on that considering the k-means clustering scheme but without committing it to any specific numerically valued scale, at least not too early. That is, we want to keep the essence of the algorithm so that it works for "pure" symbolic objects. In a simplified way, a (hard) k-means procedure works as follows:
1.Given are a finite set of objects D and an integer parameter k.
2.Choose k objects from D representing the initial group of cluster leaders or centroids.
3.Assign each object in D to the group that has the closest centroid.
4.When all objects in D have been assigned, recalculate the k centroids.
5.Repeat Steps 2 and 3 until the centroids no longer change (move).
Essential parameters to the algorithm are 1) the distance function (of step 3) to compute the closest object and some form of operation allowing "object combination" in order to produce the centroids (step 4) (in the numerical case "combination" calculates the mean vector from vectors in cluster).
Such a combination needs to be "consistent" with the distance to assure the eventual termination (not necessarily global optimality). We pursue to state a formal criterion of consistency in an abstract way. Fundamental for assuring termination is that every assignment of one object X to a new cluster occurs with lower distance with respect to the new cluster leader than the distance to the previous clusters leaders of X. In other words the assignment of X must be a really new one. Combine operation after assignment (step 4) must guarantee that in every old cluster where X was, their leaders still remain more distant after future recalculations. So we need abstract definitions of distance and object combination would satisfy these invariant, that we call consistency.
To propose a distance is usually easier (in analogy to the vector model), objects are supposed to belong to some class and the feature values belong to some space or domain provided with a metric. For instance, an attribute like "title" takes values over a "String" domain where several numerical more o less useful metrics can possibly be defined. For more structured domains, built up from metric spaces, we may homomorphically compose individual metrics of parts for building the metric of the composed domain. For a very simplified example, if we have two "page" objects X and Y of the following form:
X= <page>
<title>DXT</title>
<content>DXC</content>
</page>
Y= <page>
<title>DYT</title>
<content>DYC</content>
</page>
Where we have that DXT and DYT are eventually complex elements representing symbolic objects over a metric domain DT provided with distance dist_DT and similarly DXC, DYC comes from a metric space DC with metric dist_DC. We may induce a metric dist_page for the domain “page”, as linear combination (where weights for dimensions could also be considered):
dist_page(X, Y) = dist_DT(DXT, DYT) + dist_DC(DXC, DYC)
Using the simple illustration, we see that we may proceed in a similar fashion for the combine_page operation, assuming operations combine_DT and combine_DC. In such a case, we have:
combine_page(X,Y)= <page>
<title>combine_DT(DXT,DYT)</title>
<content>combine_DC(DXC,DYC)</content>
</page>
It is straightforward to formulate the general case, the principle is simple enough. It is worth to mention that "combine" is clearly a kind of abstract unification operation; in general, we claim has to be consistent with the "distance" function (at every domain); further requirements rules this operation, in particular, that the domain needs to be combination-complete or combination-dense.
Proving consistency strongly relies on abstract properties like the following:
dist_D(X, Y) >= dist_D(combine_D(X,Y), Y)
In other words, that "combine" actually comprises according to distance. Fortunately, by using homomorphic compositions of consistent domains, we recover consistency at the composed domain. At the scalar level, for instance with numbers, using average as "combination" operation easily assures consistency.