View the limitations of Web clipping along with detailed
explanations.
Incorrect or unexpected results
on the text clipping page
When creating a Web clipping portlet
using the Text clipping option, you may encounter incorrect
or unexpected results on the text clipping page that shows the candidate
portions of the document to retain. This can happen because "text"
clipping identifies the portions of the document to retain by operating
on the content of the document at the byte level without interpreting
or imposing any structure upon the document content. For this reason, extreme
care must be taken when choosing the start and end strings. A
detailed explanation of the limitations and dangers of text clipping
follows.
Text clipping uses the following process:
- The start string, end string, and content are converted to their
UTF-8 byte representations.
- The byte representation of the content is searched for literal
occurrences of the start string byte sequence, followed by the end
string byte sequence.
- For each occurrence, all bytes between the start and end sequences
are extracted and converted back into a UTF-8 string.
This mechanism provides the Web Clipping portlet author with
a non-structural approach to document clipping. However, the original
document content is always HTML. HTML is inherently a structured content
type, and the structure is defined by special byte sequences (tags)
that have a particular meaning. Extracting arbitrary slices of that
sequence without regard for the special byte sequences is dangerous
for the following reasons:
- Any given byte sequence may begin or end within the middle of
one of the special byte sequences that define document structure.
- The HTML document structure is hierarchical, that is, certain
document structures depend on parent structures to define their interpretation.
Extracting arbitrary sequences is a lossy process; extracting arbitrary
sequences may cause a loss of meaning by taking a child structure
out of context from its parent structure.
The resulting "clipped" content may alter the semantics of
the HTML document and has a high probability for causing unexpected
or incorrect results when the document is rendered by a user-agent.
Consider the following example.
Within some HTML document is
the following content:
<A HREF="http://www.ibm.com">Go to IBM</A>
<A HREF="http://www.lotus.com">Go To Lotus</A>
If
text clipping is used to clip this document using the start string
"ibm", the end string "lotus", and retaining the start and end strings,
the following sequence would be clipped:
ibm.com">Go to IBM</A>
<A HREF="http://www.lotus
In this example, we have
now lost the first "<A". The effect
of this will be different depending upon the user-agent that receives
it. In all likelihood, the "</A>" will
be thrown away and the text prior to it will be interpreted as text
(not structural markup), in which case the ">" prior
to the word "Go" is an invalid character data since it is not escaped.
The following "<A HREF..." may or
may not automatically be closed, but it will almost certainly cause
problems in any user-agent.
More on text clipping limitations and
restrictions
The Text clipping option enables you
to select the content between specific text strings that are in the
HTML document. Content between these strings is kept, and all other
content is discarded. However, as with all clipping types (including Keep
All Content), before the content you intend to clip is pulled
for editing, the HTML and BODY tags in the original HTML document
are removed, and the HEAD tag and the entire contents of the HEAD
section are removed to prepare the document for display within a portal.
The implication of this concerning text clipping is that the HTML,
BODY, and HEAD tags, along with the entire contents of the HEAD section,
will not be available for use within the starting or ending text strings
used to perform text clipping. For example, specifying a starting
text string of </HTML> and an ending text string of </HTML>
will yield no matching pieces of text. However, the required end result
can be easily achieved using either the HTML clipping option
or the Keep all content option.
Tip: If you would
like to clip an entire page, use either the HTML clipping option
or the Keep all content option. To specify clipping options,
click Advanced options, then click Modify clipping type.
Portlets created during installation
of the Web Clipping WAR file
When the Web Clipping WAR file
is installed, two associated portlets appear in the list of available
portlets: Web Portlet HTML Template and Web Clipping Editor. Only
the Web Clipping Editor portlet can be added to a page.
- Web Portlet HTML Template is used as a template for new
portlets created by the Web Clipping Editor and cannot be added
to a page. Adding the Web Portlet HTML Template to a page will result
in an error.
- Web Clipping Editor is the GUI for creating and editing
Web clipping portlets and can be successfully added to a page.
Clipping sites that contain JavaScript
The
use of JavaScript within Web-based content is widespread. JavaScript
is used for two primary reasons:
- JavaScript can be used within Web-based content to make the page
interactive. This is done through the use of a simple event response
paradigm that is well known among user interface developers.
- JavaScript can be used to generate dynamic content, that is to
generate content "on the fly". This procedure is executed client-side,
that is within the user agent. It is executed after the content
has been retrieved from the server and returned to the user-agent
but before the document is rendered. This can be very useful
in generating content that is dependent on the user-agent or client
environment. You might say that HTML alone is "environmentally challenged"
and JavaScript provides one solution to this problem.
Ideally, a given Web page would act within a Web clipping
portlet just like it does in a stand-alone browser. In this respect,
Web clipping is a sort of "Portlet Web Browser" and can be considered
a unique user-agent that has unique restrictions and characteristics
with respect to display and interaction mechanisms, especially with
JavaScript.
In versions of WebSphere Portal prior to version
5.0, Web clipping portlets did not have any special functionality
to deal with JavaScript. In Version 5.0 of the portal, functionality
has been added to help enhance Web clipping portlets containing JavaScript
as follows:
- All JavaScript on the source web page is retained. You
can have the JavaScript removed by using the Remove all JavaScript security
option. However that Remove all JavaScript security option
might not be able to remove all of the JavaScript that has been included
in the page. It only removes implicit event handler attributes and
script nodes. It also resets javascript: URLs that
are present in nodes as src, action or href attributes
to empty quotation marks ( "" ).
- JavaScript within the HEAD of a document will be relocated to
the BODY of the document prior to any other children of the BODY. Important: No
other modifications to JavaScript will be made automatically.
These enhancements provide support for a large amount of pages
containing JavaScript. However, some pages might still not
function properly. In particular, the following restrictions will
still apply:
Runtime restrictions: - JavaScript that uses relative URLs will be broken due to the fact
that these are not rewritten during URL rewriting. That is, URLs within
JavaScript (relative or not) will not be modified.
- JavaScript that depends on a specific hierarchy of a page structure
using the DHTML models provided by various browsers may act unexpectedly
depending on the situation.
- JavaScript that depends on specific browser functionality may
not be viewable within other browsers (for example, Netscape 6 functionality
vs. Internet Explorer 6 functionality), may act unexpectedly, or not
at all.
HTML Clipping restrictions: - All <SCRIPT> blocks defined by the HTML <SCRIPT> element
and JavaScript within the <HEAD> element of the HTML document
being clipped will be removed.
- All JavaScripts, including all event handlers and embedded JavaScripts,
are removed prior to displaying the HTML page being clipped. This
means that for those scripts that generate content, the content will
not be displayed in the HTML clipping editor and therefore cannot
be clipped.
- For the same reason, <SCRIPT> blocks that are located within
the <BODY> element of the HTML document cannot be individually
retained. They may be retained implicitly if an element that contains
the <SCRIPT> element within the document hierarchy is selected
to be retained. For example, if the <BODY> directly contains
a <SCRIPT> element child that generates some content, and the <BODY>
element is not selected to be retained, the SCRIPT will be lost. However,
if a <TD> element within the document contained a <SCRIPT>
element that generated content for the <TD>, and the <TD>
element is selected to be retained, the <SCRIPT> would be retained
as well (barring the Remove JavaScript security constraint switch).
- JavaScript within HTML implicit event handlers (such as onLoad,
onMouseOver, and onKeyDown) will only be retained if the element which
defines the attribute is retained.
- JavaScript embedded within HREF attributes (using the JavaScript:
prefix or &{...} syntax) will only be retained if the element
which defines the attribute is retained.
Tip: In general, it is not a good idea to
use the HTML Clipping type together with pages with JavaScript.
Instead, use the Keep All Content clipping type to integrate
these types of pages.
Double-byte character set limitation
If
a Web page you are trying to clip does not contain a charset or contains
a charset that is not supported by the Web Clipping Editor, then the
Web Clipping Editor defaults to the ISO-8859 charset. In this case,
double-byte character set characters may not be displayed correctly.
HTML clipping limitations and restrictions
You
might notice that at times it is difficult, if not impossible, to
clip some Web pages using HTML clipping. In fact, for various technical
reasons, there are certain elements within Web content that cannot
be clipped using HTML clipping. This section explores some of the
well-known limitations and restrictions of HTML clipping.
No content appearing in the Web clipping
portlet
Due to the limitations of the HTML parser used by
the Web Clipping Editor, certain pages with excessively malformed
HTML cannot be fixed for proper display. In such cases, unexpected
results may occur or no content may appear within the Web clipping
portlet.
<FRAME> elements
The HTML
FRAME support consists of the following:
- Enablement of all existing HTML-based Web clipping portlets to
navigate to pages containing FRAME or FRAMESET elements. That is,
if you have any existing Web clipping portlets and somewhere within
the content of those portlets is a link to a page that contains FRAMESET
elements, the link can now be traversed and the content displayed
and navigated.
- Creation of new HTML-based Web clipping portlets against pages
that contain FRAMESET or FRAME elements using "Keep All Content" mode
only.
- Both Inter and Intra FRAME navigation is supported on pages with
FRAMEs, just as in a desktop browser.
Note the following restrictions concerning HTML FRAME support:
- FRAME tags that include the onload and onunload attributes
for executing JavaScript functions are not preserved when converting
the FRAME to a table cell. There is no support for those attributes
on table cell (<TD>) elements.
- The "Keep All Content" mode is required during creation of
new Web clipping portlets directly referencing pages containing FRAMEs.
Web clipping portlets can be created from content containing FRAMES
only if the "Keep All Content" mode is used as opposed to the "HTML
Clipping" mode. You can continue to create Web clipping portlets that
indirectly reference pages containing FRAMES or FRAMESETs through
a link, however you may not clip those pages as you can with pages
referenced that do not contain FRAMEs. FRAME navigation is not
supported in the editor (on the Finish page or HTML clipping page)
as a result of this restriction.
- Links in the created portlets cannot be followed if those pages
contain embedded FRAMEs. For new portlets that contain FRAMEs
and new or existing portlets that indirectly reference pages with
FRAMEs, the links in that portlet can be navigated as usual. However,
indeterminate results will occur if the links also contain FRAMEs,
that is if they reference a page that contains new FRAMESET or FRAME
elements.
- FRAME support is not provided for non-HTML user agents.
Currently, the successful presentation and navigation of pages containing
FRAMEs will work only for portlets that are targeted to be used with
HTML-based user agents that support HTML conforming to the HTML 3.2
or a later specification. Web clipping portlets that encounter pages
with FRAMESET or FRAME elements cannot be viewed from mobile devices
or non-HTML devices.
iFrames
The Web Clipping portlet
was modified to be able to perform the equivalent function of an older
WebPage/IFRAME portlet. As with the older portlets, the following
set of restrictions apply when the Web Clipping portlet is configured
to display content embedded in an iFrame and is configured to allow
the browser/user-agent to access referenced resources directly.
- The portlet will not authenticate when viewed from Internet Explorer
with the MS04-004 Cumulative Security Update applied. The MS04-004
update disables the ability to use URLs of the format http://username:password@www.acme.com/login.jsp.
This format was disabled for security purposes, as the user ID and
password appear in cleartext within the URL and can be easily compromised
- The portal server and the server hosting the site for which authentication
is required must be within the same top-level domain (for example,
acme.com). Due to security restrictions, browsers/user-agents will
not accept cookies for server a.b.c if they are sent by server x.y.z
or any other server outside the domain b.c. Doing so would allow potential
spoof attacks to gain access to authenticated content on server a.b.c
without its explicit consent. For this reason, the portal server on
which the portlet resides and the server(s) hosting the site which
requires authentication must be within the same top-level domain.
- User-agents must access the portlet using fully-qualified domain
when the portlet uses FORM-based authentication. When end-users access
the portlet from a browser, they must use the fully-qualified domain
name in the address bar. For example, http://www.ibm.com:10039/wps/portal. Not using
the fully qualified domain name will not work due to security restrictions
in the browser with regards to cookies from alternative domains than
the server on which the response originates.
Element selection issues
As
you work with HTML clipping, you might have difficulties selecting
the page elements you want to clip. This is most noticeable, for example,
when you want to clip an entire table, but that table does not have
a border or any other visual elements that you can use to select it.
Instead, you are forced to select all the columns within the table
individually and end up with the correct data but the wrong format
(not grouped together in the original table).
The only workaround
to this is by is trial and error, clicking or selecting different
areas of the rendered output to clip and then previewing the contents
to see if you achieved the required result. This process is made easier
by using the preview function that lets you view the results of each
selection attempt without having to go through the process of adding
the Web clipping portlet to a page and examining its contents.
The
HTML clipping tool allows you to select one element and then toggle
elements contained within that element, making it appear as though
you can keep all the content of a selected item, except for one or
two of the elements that it contains. As useful as this can be, you
cannot do it using the selection method described previously. Instead,
if you want to keep the entire contents of a single element except
for a few of its sub-elements, you have to individually select just
the sub-elements you want to keep. For example, if you want to keep
all the contents of a <TABLE> except for the contents of one <TD>
element, you cannot select the <TABLE> element and then select
the <TD> that you do not want. Instead, you must select all
the <TD>s that you do want. Unfortunately, as mentioned before,
because you are forced to select the <TD> elements individually,
the data will not be grouped as it was in the original table.