Java Tools for Web Page Retrieval and Text Extraction
The Jakarta Commons HTTPClient class can be used to efficiently retrieve web pages. Listing One is code to get web page content. Java has built-in support for parsing HTML text. The HTMLEditorKit.ParserCallback class allows for text extraction. The Jericho HTML Parser (jerichohtml.sourceforge.net/doc/index.html) is another option. Regular expression parsing can be done using Java's java.util.regex.Matcher and Pattern classes. If full Perl 5 syntax support is required, the Jakarta ORO classes can be used. Multithreading to retrieve multiple website pages concurrently is greatly simplified by the java.util.concurrent package and HTTPClient's MultiThreadedHttpConnectionManager class. Java 6 introduced additional features that aid in the UI implementation: results are stored using Java 6's built-in JAXB support. The TableRowSorter and RowFilter classes are used for sorting and filtering the table of results.
Last of all, the Substance Look and Feel (https://substance.dev.java.net) is used to give the application a polished appearance.
public static String getContent(HttpClient httpClient, String url) { logger.info("getting content from url: " + url); GetMethod get = new GetMethod(url); get.setFollowRedirects(true); // impersonate browser get.setRequestHeader("User-Agent", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)"); get.setRequestHeader("Accept", "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/xaml+xml, application/vnd.ms-xpsdocument, application/x-ms-xbap, application/x-msapplication, */*"); get.setRequestHeader("Accept-Language", "en-us"); get.setRequestHeader("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7"); try { int statusCode = httpClient.executeMethod(get); logger.info("http status code: " + statusCode); InputStream is = get.getResponseBodyAsStream(); ByteArrayOutputStream baos = new ByteArrayOutputStream(2048); byte[] buff = new byte[2048]; int bytesRead = 0; while ( (bytesRead = is.read(buff)) != -1) { baos.write(buff, 0, bytesRead); } is.close(); baos.close(); String text = baos.toString(); logger.debug("raw html=" + text); return text; } catch (Exception ex) { logger.error(ex); } finally { get.releaseConnection(); } return null; }