By taking at look at Web Pages, we may expect to discover that
some patterns of semantics are encoded using very few HTML sets of, let us say, “combinators”
(HTML parts); this may be due to the lack of abstraction capabilities which is
inherent to the HTML alone. We have compared this situation to the Noisy-Channel model in a previous post
where we presented some interesting figures and data illustrating the claim.
Let us continue our journey showing further instances of this phenomenon whose formal
analysis is crucial for intelligent refactoring tools as the kind we have been pursued to introduce by
means of this sequel of posts. In other words, let us know other forms of “HTML noise”. As a word of warning, we recall
that the data is the result of a particular muster of crawled pages by the way we explained before.
For this post, we are experimenting
with tables that are potentially used as page layouts or page structure. For
those kinds of tables, we want to study the table shape or page surface, no the specific content; we may think of that as a
way to filter potential candidates for further deeper semantic analysis. (We briefly
recall that our muster contains 819 pages and about 5000 table instances,
roughly speaking.).
The exercise is simple: we
postulate an intuitive definition for a table as surface and see how well it is
supported by our data in muster.
Let us try our shallow analysis
by classifying a table as a page layout candidate
if its container is the page body
tag, eventually followed by a chain of div
tags (assuming such div tags are intended to be organizers or formatters of the
table), it has at least two rows and at least 2 columns (two columns is the
most interesting case, we consider it as a base).
Such a pattern definition sounds
reasonable in appearance; however, we will see that its empirical support is
not as high as one may expect, at least in our muster.
We find 261 of such candidates;
they represent a 31% of all pages, which is a quite interesting amount; however
it is unexpectedly small because one may guess there should be at least one per
page. Among these 261, we have 83 where the table is hanging directly from the
body tag (32% of the candidates; 10% of the whole muster). As a matter of fact,
such 83 tables present irregular patterns, albeit often we find 2 columns (65%)
with a high variance. For instance, we may find a pattern of the form 6.2.2.2.2.2.2.2,
where we use our convention of showing a table of n rows as a sequence of n
numbers, each of one being the number of cols (in example 8 rows, the first of
them with 6 columns the rest having 2 columns). But even worst, we find the
irregular pattern 2.2.7.2.7.7.6.5.5.4.4.5.2.3.2.7.2.7.
And talking about irregularity, let us take a look at this interesting one: 19.2.7.4.6.2.2.2.2.2.2.2.2.2.2.5.7.2.2.2.2.4.4.2,
whatever it means.
With this simple analysis, we may
learn that, perhaps, some intuitive definitions occur not as frequent as we may
expect in our muster. Actually, and after seeing in detail some of the irregular
cases, a sound conclusion might be that we may need first to pre-classify some
parts of the page before using general patterns like the one we directly tried.
In other words, we see that some noise needs to be filtered out for such a kind
of pattern.
In a forthcoming post, we will continue
studying that kind of patterns and their support.