Java爬虫被服务器拒绝访问 403错误 学习笔记

本文介绍了解决爬虫在抓取网页时遇到403 Forbidden错误的方法,包括修改HTTP头信息来模仿浏览器行为和调整请求频率。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

利用HttpClient对一个网站进行确定页面的内容抓取,其中从指定URL获取response内容的代码如下:

这是HttpClient推荐的请求网页内容的基本写法,第一次尝试运行,直接被服务器403 forbidden。

public final static String getByString(String url) throws Exception {
        CloseableHttpClient httpclient = HttpClients.createDefault();
        
        try {
            ResponseHandler<String> responseHandler = new ResponseHandler<String>() {
 
                public String handleResponse(
                        final HttpResponse response) throws ClientProtocolException, IOException {
                    int status = response.getStatusLine().getStatusCode();
                    if (status >= 200 && status < 300) {
                        HttpEntity entity = response.getEntity();
                        System.out.println(status);
                        return entity != null ? EntityUtils.toString(entity) : null;
                    } else {
                    	System.out.println(status);
                    	Date date=new Date();
                    	System.out.println(date);
                    	System.exit(0);
                        throw new ClientProtocolException("Unexpected response status: " + status);
                    }
                }
            };
            String responseBody = httpclient.execute(httpget, responseHandler);
            return responseBody;
        } finally {
            httpclient.close();
        }
    }


考虑通过浏览器能访问该网站,但是上述方法不行,因此尝试为httpget加入header属性,使其在服务器看来更像是用户直接访问。

<pre name="code" class="java">public final static String getByString(String url) throws Exception {
        CloseableHttpClient httpclient = HttpClients.createDefault();
        
        try {
            HttpGet httpget = new HttpGet(url);
            httpget.addHeader("Accept", "text/html");
	    httpget.addHeader("Accept-Charset", "utf-8");
            httpget.addHeader("Accept-Encoding", "gzip");
	    httpget.addHeader("Accept-Language", "en-US,en");
	    httpget.addHeader("User-Agent",
			"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.160 Safari/537.22");
            ResponseHandler<String> responseHandler = new ResponseHandler<String>() {
 
                public String handleResponse(
                        final HttpResponse response) throws ClientProtocolException, IOException {
                    int status = response.getStatusLine().getStatusCode();
                    if (status >= 200 && status < 300) {
                        HttpEntity entity = response.getEntity();
                        System.out.println(status);
                        return entity != null ? EntityUtils.toString(entity) : null;
                    } else {
                    	System.out.println(status);
                    	Date date=new Date();
                    	System.out.println(date);
                    	System.exit(0);
                        throw new ClientProtocolException("Unexpected response status: " + status);
                    }
                }
            };
            String responseBody = httpclient.execute(httpget, responseHandler);
            return responseBody;
        } finally {
            httpclient.close();
        }
    }

再次运行,不再被服务器拒绝访问,但是在短时间请求大量网页后,再次被服务器拒绝访问,依旧报403错误。此时通过浏览器访问该网站,同样显示403错误。当出现这种问题的时候,只能重连宽带,或者使用VPN更换代理,改变IP地址。考虑到服务器应该是屏蔽了本机IP地址,因此尝试降低请求频率,在代码中加入sleep()方法,在每次请求后,等待一段时间。并且由于被服务器拒绝访问后并不能通过程序解决,因此在获取到服务器非正常response status时,加入

System.exit(0);
直接让程序退出。

public final static String getByString(String url) throws Exception {
        CloseableHttpClient httpclient = HttpClients.createDefault();
        
        try {
public final static String getByString(String url) throws Exception {
        CloseableHttpClient httpclient = HttpClients.createDefault();
        
        try {
            HttpGet httpget = new HttpGet(url);
            httpget.addHeader("Accept", "text/html");
	    httpget.addHeader("Accept-Charset", "utf-8");
            httpget.addHeader("Accept-Encoding", "gzip");
	    httpget.addHeader("Accept-Language", "en-US,en");
	    httpget.addHeader("User-Agent",
			"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.160 Safari/537.22");
            ResponseHandler<String> responseHandler = new ResponseHandler<String>() {
 
                public String handleResponse(
                        final HttpResponse response) throws ClientProtocolException, IOException {
                    int status = response.getStatusLine().getStatusCode();
                    if (status >= 200 && status < 300) {
                        HttpEntity entity = response.getEntity();
                        System.out.println(status);
                        return entity != null ? EntityUtils.toString(entity) : null;
                    } else {
                    	System.out.println(status);
                    	Date date=new Date();
                    	System.out.println(date);
                    	System.exit(0);
                        throw new ClientProtocolException("Unexpected response status: " + status);
                    }
                }
            };
            String responseBody = httpclient.execute(httpget, responseHandler);
<span style="white-space:pre">	</span>    Thread.currentThread().sleep(200);
            return responseBody;
        } finally {
            httpclient.close();
        }
    }

对于不同网站,判定访问时间间隔可能不同,这次采集的网站,将sleep设置成2秒时,连续运行12小时,没有被服务器403拒绝。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值