We continue our regular series of
posts talking about refactoring of Web Pages based on semantic approaches; we
invite the interested new reader to take a look at the previous contributions
to get a general picture of our intentions.
In this particular and brief post,
we just want to present and describe some simple but interesting empirical data
which are related with the structural (syntactic) content of some given muster
of pages we have been analyzing during the last days. The results are part of a
white page we are preparing, currently; it will be available at this site in
short time.
We may remember from our first
post that we may want to recover semantics from structure given particular clues
and patterns we usually may come across when analyzing pages. The approach is
simpler to describe than to put into practice: Once semantics could be somehow detected,
refactoring steps can be applied on some places at the page and, by doing so, some
expected benefits can be gained.
However, syntactic structure is
the result of encoding some specific semantics and intentions on a web page
using HTML elements and functionality; the HTML language is (expressively
speaking) rather limited (where too much emphasis on presentation issues is the
case, for instance) and some common programming “bad practices” increase the
complexity of recovering semantics mainly based on syntactic content as input.
And being HTML quite declarative, such complexity can make the discovering
problem quite challenging in a pragmatic context, indeed. That is our more
general goal, however, we do not want to go that far in this post, we just want
to keep this perspective in mind and give the reader some insight and data to
think about it. We will be elaborating more on recovering in forthcoming posts.
As usual in NLP field, it is
interesting to use the so-called Noisy-Channel model as point of reference and
analogy. We may think of the initial semantics as the input message to the
channel (the programmer); the web page is the output message. The programmer
uses syntactic rules to encode semantics during coding adding more or less noisy
elements. Different encodings forms do normally exist, noisy can be greater
when too much structure is engaged for expressing some piece of the message.
A typical example of noisy
encoding is the use of tables for handling style, presentation or layout
purposes beyond the hypothetically primary intention of such kind of table
element: just to be an arrange of data. Complex software maintenance and sometimes
lower performance may be a consequence of too much noise, among others matters.
Let us take a look at some data
concerning questions like: how much noise in page? What kind of noise? What
kind of regular encodings could be found?
As a warning, we do not claim
anything on statistical significance because our muster is clearly too small
and was based on biased selection criteria. Our results are very preliminary,
in general. However, we feel they may be sound and believable, in some way
consistent with the noisy model.
Our “corpus” comes from of 834
pages which were crawled starting for convenience at a given root page in Costa Rica,
namely: http://www.casapres.go.cr/. The size depended of a predetermined
maximal quantity of 1000 nodes to visit; we never took more than 50 paths of
those pointed in a page and we rather preferred visiting homepages to avoid
traps.
Let us see some descriptive profile
of the data. For current limitations of the publishing tool, we are not
presenting some charts complementing the raw numbers.
Just 108 kinds of tags were
detected and we have 523.016 instances of them in corpus. That means, very
roughly, 6 kinds of tags per page, 627 instances per page. We feel that suggests the
use of the same tags for saying probably different things (we remark that many
pages are homepages for choice).
The top 10 of tags are: pure text,
a, td, tr, br, div, li, img, p and font (according to absolute frequency). Together
text, a (anchor) and img correspond to more than 60% all instances. Hence 60%
of pages are some form of data.
We notice that ‘table’ is 1% and
td 8.5% of all instances, against 42% from text, 15% from anchors. In average,
we have 7 tables per page and 54 tds per page, 6 td per table, roughly
speaking.
Likewise we just saw 198
attributes and 545.585 instances of attributes. The 10 most popular are: href,
shape, colspan, rowspan, class, width, clear and height, which is relatively consistent
with the observed tag frequency (egg. href for anchor, colspan and rowspan for
td).
We pay some special attention to
tables in the following lines. Our corpus has 5501 tables. It is worth to
mention that 65% of them are children of td; in other words nested into another
table. Hence a high proportion of nesting which suggests complexity in table
design. We see that 77% of data (text, a, img) in muster are dominated by tds
(most of the data is table dominated). In the case of anchors, 33% of them are
td-dominated, what may suggest tables being used as navigational bars or similar semantic
devices in an apparently very interesting proportion.
We decided to explore semantic
pattern on tables a little bit more exactly. For instance, we choose tables of nx1
dimension (n rows, 1 column) which are good candidates for navigational bars. A
simple analysis shows that 618 tables (11%) have such a shape. The shape may be
different which is quite interesting. For instance, we see a 5x1 table where all td
are anchors. We denote that but a sequence of 1 and 0, where 1 means the
corresponding td contains an anchor (a link to some url): in this case ‘1.1.1.1.1’
is the sequence. But another table of the same 5x1 size presents the pattern ‘1.0.1.0.1’.
This same pattern occurs several times for instance in 50x1 table. Another case
is this: ‘0.0.0.0.1.1.1.1.1.0’ maybe suggesting that some
links are not available. We mention that 212 patterns are 1x1, which would be a
kind of navigation button. We will present more elaborated analysis of this
table patterns in the following post.
To finish, we notice that 875 tables (16%) are not regular:
some rows have different size. Some of them are very unusual like in this 28x8
table, where each number in following sequence denotes the size id tds of the
row: 4.4.4.6.8.8.7.2.8.4.4.6.6.6.6.5.4.5.5.5.5.5.5.5.5.5.5.1.
Noisy, isn’t it?