Tarcísio Fischer: fevereiro 2010

POST 1:
Não vou reinventar a roda, pesquisei na internet possíveis web-crawlers.
No site a seguir existe uma lista com alguns crawlers open-source.

http://java-source.net/open-source/crawlers

---

POST 2:
Bom, pelo visto não vou ter que reinventar a roda... vou ter que INVENTA-LA mesmo!
Não encontrei nenhum crawler que seja só uma biblioteca que me permita trabalhar em cima :(
Mas, seguindo as recomendações desse cara http://lembra.wordpress.com/tag/crawler/
eu posso usar esse tal de httpunit (http://sourceforge.net/projects/httpunit/files/httpunit/1.7/httpunit-1.7.zip/download)
pra catar as informações dos sites... Vamos ver como funciona...

---

POST 3:
Em primeiro momento, o httpUnit se mostrou realmente muito bom! em poucas linhas de código ele conseguiu me retornar
quantos links um determinado site tem.

try {
// create the conversation object which will maintain state for us
WebConversation wc = new WebConversation();

// Obtain the main page on the meterware web site
String url="http://www.meterware.com";
WebRequest request = new GetMethodWebRequest( url );
WebResponse response = wc.getResponse( request );

// find the link which contains the string "HttpUnit" and click it
WebLink httpunitLink = response.getFirstMatchingLink( WebLink.MATCH_CONTAINED_TEXT, "HttpUnit" );
response = httpunitLink.click();

// print out the number of links on the HttpUnit main page
System.out.println( "The HttpUnit main page '"+url+"' contains " + response.getLinks().length + " links" );

} catch (Exception e) {
System.err.println( "Exception: " + e );
}

Se bem que ele demorou quase 3 segundos pra fazer isso... Mas tudo bem...

---

POST 4:
É, parece que o mar não está para peixe... Tentei iniciar um crawler pro site do terra. Ou eu não estou sabendo fazer direito ou o httpUnit
não é tão bom assim. Ele da erro ao tentar executar o código:

try {
// create the conversation object which will maintain state for us
WebConversation wc = new WebConversation();

// Obtain the main page on the meterware web site
String url="http://www.terra.com.br";
WebRequest request = new GetMethodWebRequest( url );
WebResponse response = wc.getResponse( request );

// find the link which contains the string "HttpUnit" and click it
//WebLink httpunitLink = response.getFirstMatchingLink( WebLink.MATCH_CONTAINED_TEXT, "HttpUnit" );
//response = httpunitLink.click();

System.out.println(response.getLinks().length);

//for(int i = 0; i < response.getLinks().length; i++){
// System.out.println(response.getLinks()[i].getURLString());
//}

} catch (Exception e) {
System.err.println( "Exception: " + e );
}

---

POST 5:
Bom, acho que encontrei a explicação aqui: http://httpunit.sourceforge.net/doc/faq.html#norhino
O httpunit não vem com essa biblioteca no pacote, acho que vou ter que baixa-la a parte.
Link para download do Rhino: http://www.mozilla.org/rhino/download.html
Agora é só colocar o arquivo js.jar no classpath e testar...

---

POST 6:
Não funcionou.
Mas esse link http://httpunit.sourceforge.net/doc/faq.html#disable%20scripting me deu uma luz:

HttpUnitOptions.setScriptingEnabled( false );

e o código agora funciona corretamente:

try {
HttpUnitOptions.setScriptingEnabled( false );
// create the conversation object which will maintain state for us
WebConversation wc = new WebConversation();

String url="http://www.terra.com.br";
WebRequest request = new GetMethodWebRequest( url );
WebResponse response = wc.getResponse( request );

// find the link which contains the string "HttpUnit" and click it
//WebLink httpunitLink = response.getFirstMatchingLink( WebLink.MATCH_CONTAINED_TEXT, "HttpUnit" );
//response = httpunitLink.click();

//System.out.println(response.getLinks().length);

for(int i = 0; i < response.getLinks().length; i++){
System.out.println(response.getLinks()[i].getURLString());
}

} catch (Exception e) {
System.err.println( "Exception: " + e );
}

Ele mostra todos os links do site do terra :D

---

POST 7:
Ok, agora que sei que é possível fazer isso, quero pegar algum dado de verdade.
Vamos ao site da iMasters. Quero pegar todas as noticias do site referentes a JAVA.
Primeiramente, pegaremos uma noticia qualquer:

Eu percebi que o conteudo do artigo fica em strConteudo e o titulo na tag title mesmo. então:

try {
HttpUnitOptions.setScriptingEnabled( false );
// create the conversation object which will maintain state for us
WebConversation wc = new WebConversation();

String url="http://imasters.uol.com.br/artigo/15674/java/jpa_com_jboss_tools_no_eclipse/";
WebRequest request = new GetMethodWebRequest( url );
WebResponse response = wc.getResponse( request );

// find the link which contains the string "HttpUnit" and click it
//WebLink httpunitLink = response.getFirstMatchingLink( WebLink.MATCH_CONTAINED_TEXT, "HttpUnit" );
//response = httpunitLink.click();

//System.out.println(response.getLinks().length);

System.out.println(response.getTitle());

HTMLElement[] divs = response.getElementsByTagName("div");
for(int i=0; i < divs.length; i++) {
if(divs[i].getID().equals("strConteudo"))
System.out.println(divs[i].getText());
}

for(int i = 0; i < response.getLinks().length; i++){
//System.out.println(response.getLinks()[i].getURLString());
}

} catch (Exception e) {
System.err.println( "Exception: " + e );
}

Funcionou corretamente :D
Próximo passo é, a partir da primeira página (http://imasters.uol.com.br) ele tem que me dar uma lista de links.
Depois de pegar a lista de links, ele vai link a link procurando a palavra JAVA no LINK, no TITULO e no CORPO do artigo.
Se encontrar, quero que o artigo seja retornado, senão, não :)

Tarcísio Fischer

GERANDO PDF`s EM JAVA

JAVA | WEB-CRAWLERS DIA 1