前言
HttpClient 是Apache Jakarta Common 下的子项目,可以用来提供高效的、最新的、功能丰富的支持 HTTP 协议的客户端编程工具包,并且它支持 HTTP 协议最新的版本和建议。
以下列出的是 HttpClient 提供的主要的功能,要知道更多详细的功能可以参见 HttpClient 的官网:
(1)实现了所有 HTTP 的方法(GET,POST,PUT,HEAD 等)
(2)支持自动转向
(3)支持 HTTPS 协议
(4)支持代理服务器等
HttpClient抓取页面HTML
步骤:创建maven项目,引入依赖,编写一个Java类使用HttpClient实现抓取
依赖引入
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.8</version>
</dependency>
java类实现
我创建的java类名为 HttpClientTest
1.生成httpclient,相当于该打开一个浏览器
2.创建get请求,相当于在浏览器地址栏输入 网址
3.设置header模拟浏览器,
4.可以使用代理ip,因为本例是测试,没有频繁的请求所以就没用
5.执行get请求
6.获取响应内容
7.关闭
package httpclient_learn;
import java.io.IOException;
import org.apache.http.HttpEntity;
import org.apache.http.HttpHost;
import org.apache.http.HttpStatus;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.utils.HttpClientUtils;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
public class HttpClientTest {
public static void main(String[] args) {
//1.生成httpclient,相当于该打开一个浏览器
CloseableHttpClient httpClient = HttpClients.createDefault();
CloseableHttpResponse response = null;
//2.创建get请求,相当于在浏览器地址栏输入 网址
HttpGet request = new HttpGet("https://www.tuicool.com/");
request.setHeader("User-Agent","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36");
// HttpHost proxy = new HttpHost("117.157.197.18", 3128);
// RequestConfig config = RequestConfig.custom().setProxy(proxy).build();
// request.setConfig(config);
try {
//3.执行get请求,相当于在输入地址栏后敲回车键
response = httpClient.execute(request);
//4.判断响应状态为200,进行处理
if(response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
//5.获取响应内容
HttpEntity httpEntity = response.getEntity();
String filename = "E:\\Desktop\\test.html";
String html = EntityUtils.toString(httpEntity, "utf-8");
Files.write(Paths.get(filename),html.getBytes(StandardCharsets.UTF_8));
} else {
//如果返回状态不是200,比如404(页面不存在)等,根据情况做处理,这里略
System.out.println("返回状态不是200");
System.out.println(EntityUtils.toString(response.getEntity(), "utf-8"));
}
} catch (ClientProtocolException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
//6.关闭
HttpClientUtils.closeQuietly(response);
HttpClientUtils.closeQuietly(httpClient);
}
}
}
获取的html数据
jsoup 处理获取的html数据
获取所有连接地址和名称
package httpclient_learn;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.File;
import java.io.IOException;
public class HandleData {
public static void main(String[] args) throws IOException {
File input = new File("E:\\Desktop\\test.html");
Document doc = Jsoup.parse(input, "UTF-8");
Element content = doc.getElementById("header");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
System.out.println(linkHref+ ":" +linkText);
}
}
}