Semantics from Structural Parts of Web Pages: some figures and patterns

1. February 2007 09:55 by CarlosLoria in General  //  Tags:   //   Comments (0)

We continue our regular series of posts talking about refactoring of Web Pages based on semantic approaches; we invite the interested new reader to take a look at the previous contributions to get a general picture of our intentions.

In this particular and brief post, we just want to present and describe some simple but interesting empirical data which are related with the structural (syntactic) content of some given muster of pages we have been analyzing during the last days. The results are part of a white page we are preparing, currently; it will be available at this site in short time.

We may remember from our first post that we may want to recover semantics from structure given particular clues and patterns we usually may come across when analyzing pages. The approach is simpler to describe than to put into practice: Once semantics could be somehow detected, refactoring steps can be applied on some places at the page and, by doing so, some expected benefits can be gained.

However, syntactic structure is the result of encoding some specific semantics and intentions on a web page using HTML elements and functionality; the HTML language is (expressively speaking) rather limited (where too much emphasis on presentation issues is the case, for instance) and some common programming “bad practices” increase the complexity of recovering semantics mainly based on syntactic content as input. And being HTML quite declarative, such complexity can make the discovering problem quite challenging in a pragmatic context, indeed. That is our more general goal, however, we do not want to go that far in this post, we just want to keep this perspective in mind and give the reader some insight and data to think about it. We will be elaborating more on recovering in forthcoming posts.

As usual in NLP field, it is interesting to use the so-called Noisy-Channel model as point of reference and analogy. We may think of the initial semantics as the input message to the channel (the programmer); the web page is the output message. The programmer uses syntactic rules to encode semantics during coding adding more or less noisy elements. Different encodings forms do normally exist, noisy can be greater when too much structure is engaged for expressing some piece of the message.

A typical example of noisy encoding is the use of tables for handling style, presentation or layout purposes beyond the hypothetically primary intention of such kind of table element: just to be an arrange of data. Complex software maintenance and sometimes lower performance may be a consequence of too much noise, among others matters.

Let us take a look at some data concerning questions like: how much noise in page? What kind of noise? What kind of regular encodings could be found?

As a warning, we do not claim anything on statistical significance because our muster is clearly too small and was based on biased selection criteria. Our results are very preliminary, in general. However, we feel they may be sound and believable, in some way consistent with the noisy model.

Our “corpus” comes from of 834 pages which were crawled starting for convenience at a given root page in Costa Rica, namely: http://www.casapres.go.cr/. The size depended of a predetermined maximal quantity of 1000 nodes to visit; we never took more than 50 paths of those pointed in a page and we rather preferred visiting homepages to avoid traps.

Let us see some descriptive profile of the data. For current limitations of the publishing tool, we are not presenting some charts complementing the raw numbers.

Just 108 kinds of tags were detected and we have 523.016 instances of them in corpus. That means, very roughly, 6 kinds of tags per page, 627 instances per page. We feel that suggests the use of the same tags for saying probably different things (we remark that many pages are homepages for choice).

The top 10 of tags are: pure text, a, td, tr, br, div, li, img, p and font (according to absolute frequency). Together text, a (anchor) and img correspond to more than 60% all instances. Hence 60% of pages are some form of data.

We notice that ‘table’ is 1% and td 8.5% of all instances, against 42% from text, 15% from anchors. In average, we have 7 tables per page and 54 tds per page, 6 td per table, roughly speaking.

Likewise we just saw 198 attributes and 545.585 instances of attributes. The 10 most popular are: href, shape, colspan, rowspan, class, width, clear and height, which is relatively consistent with the observed tag frequency (egg. href for anchor, colspan and rowspan for td).

We pay some special attention to tables in the following lines. Our corpus has 5501 tables. It is worth to mention that 65% of them are children of td; in other words nested into another table. Hence a high proportion of nesting which suggests complexity in table design. We see that 77% of data (text, a, img) in muster are dominated by tds (most of the data is table dominated). In the case of anchors, 33% of them are td-dominated, what may suggest tables being used as navigational bars or similar semantic devices in an apparently very interesting proportion.

We decided to explore semantic pattern on tables a little bit more exactly. For instance, we choose tables of nx1 dimension (n rows, 1 column) which are good candidates for navigational bars. A simple analysis shows that 618 tables (11%) have such a shape. The shape may be different which is quite interesting. For instance, we see a 5x1 table where all td are anchors. We denote that but a sequence of 1 and 0, where 1 means the corresponding td contains an anchor (a link to some url): in this case ‘1.1.1.1.1’ is the sequence. But another table of the same 5x1 size presents the pattern ‘1.0.1.0.1’. This same pattern occurs several times for instance in 50x1 table. Another case is this:0.0.0.0.1.1.1.1.1.0’ maybe suggesting that some links are not available. We mention that 212 patterns are 1x1, which would be a kind of navigation button. We will present more elaborated analysis of this table patterns in the following post.

To finish, we notice that 875 tables (16%) are not regular: some rows have different size. Some of them are very unusual like in this 28x8 table, where each number in following sequence denotes the size id tds of the row: 4.4.4.6.8.8.7.2.8.4.4.6.6.6.6.5.4.5.5.5.5.5.5.5.5.5.5.1.

Noisy, isn’t it?