Quantcast
Channel: The Finer Stuff » development
Viewing all articles
Browse latest Browse all 10

Parse HTML with Java and the jsoup Library

$
0
0

Sometimes it can happen that one of your applications has to embed HTML content, but as we all know, it can be dangerous to just display HTML without escaping it before.

On the other hand, you might still want allow a basic set of HTML tags to keep at least some of the formatting delivered with original HTML content. In addition to that, think about fixing broken HTML tags: If HTML tags are broken inside of snippets you want to display within your application, it can break the whole layout!

Of course you could write all this functionality yourself from scratch, but IMHO there is a better solution…

After some research around the internet and on Stack Overflow, I stumbled upon jsoup, a Java library providing all the good stuff you need when handling HTML content in Java.

For example, I used jsoup to clean/escape HTML content for security reasons. The library provides some basic sets of cleaning you can use by default (find documentation here: http://jsoup.org/apidocs/org/jsoup/safety/Whitelist.html).

A simple example to apply one of the provided whitelists:

String cleanedHtml = Jsoup.clean(inputHtml, Whitelist.basicWithImages());

But what if you want to allow more than just the tags included in “basicWithImages“? No problem –  just add the tags to the whitelist like this:

Whitelist whitelist = Whitelist.basicWithImages();

whitelist.addTags("h1", "h2", "h3", "h4", "h5", "h6", "div");

whitelist.addTags("table", "tbody", "td", "tfoot", "th", "thead", "tr");

String cleanedHtml = Jsoup.clean(inputHtml, whitelist);

This is just a small example of what you can do with jsoup. There are many more use cases for this library. To learn more about jsoup, visit the home page or have a look at the cookbook.

jsoup is open source using the MIT license, the source code can be found on GitHub.


Viewing all articles
Browse latest Browse all 10

Latest Images

Trending Articles





Latest Images