用java语言和webmagic框架爬取小说题目和文章内容

最新推荐文章于 2025-06-09 16:18:07 发布

原创最新推荐文章于 2025-06-09 16:18:07 发布 · 2.1k 阅读

6 ·

CC 4.0 BY-SA版权

文章标签：

#webmagic #java #小说爬虫

JAVA_STUDY 专栏收录该内容

21 篇文章

订阅专栏

本文介绍使用WebMagic框架爬取小说内容的方法，包括单篇小说和整类小说的爬取流程及技巧。提供了具体实现代码及运行结果。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

不用webmagic爬取静态网页内容的时候比较麻烦，需要建立网络链接，然后再分析源代码。但是有了webmagic框架，在爬取网页信息就很容易了。

这次爬取的小说是圣墟，以下是爬取小说内容的代码（爬取某一篇小说的代码）：

import java.util.List;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.ConsolePipeline;
import us.codecraft.webmagic.pipeline.JsonFilePipeline;
import us.codecraft.webmagic.processor.PageProcessor;

public class MyProcess implements PageProcessor{
	
	//抓取网站的相关配置，包括编码、抓取间隔、重试次数等
	private Site site = Site.me().setRetryTimes(3).setSleepTime(3000);
	
	public Site getSite() {
		// TODO Auto-generated method stub
		return site;
	}

	//process是定制爬虫逻辑的核心接口，在这里编写抽取逻辑
	public void process(Page page) {
		// TODO Auto-generated method stub
		//定义如何抽取页面信息，并保存下来
		List<String> links = page.getHtml().links().regex("/43_43821/\\d+\\.html").all();
		//用addTargetRequests来添加需要抓取的链接
		page.addTargetRequests(links);
		//用putFiled保存爬取出来的信息
		page.putField("title", page.getHtml().xpath("//div[@class='bookname']/h1").toString());
		page.putField("content", page.getHtml().xpath("div[@id='content']").toString());
	}
	
	public static void main(String[] args){
		Spider.create(new MyProcess()).addUrl("http://www.biqudu.com/43_43821")
//		.addPipeline(new JsonFilePipeline("f://test"))	//以Json格式将文件储存在本地文件中
        	.addPipeline(new ConsolePipeline())	//将结果输出到控制台
        	.run();
	}}

运行结果如下图所示：

不过有个缺点，爬取出来的小说内容顺序是按照源代码中链接出现的顺序出现的。也就是说，并不是从第一章到最后一章的顺序。而且爬取出来的内容均带有标签。如果要去掉标签，只是将内容输出，就在xpath匹配格式字段后加text()即可。

比如：

page.putField("title", page.getHtml().xpath("//div[@class='bookname']/h1/text()"));

改过之后的输出结果如下：

2、爬取一种类型全部小说的代码（在这里爬取的小说类型是玄幻小说）：

import java.util.List;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.ConsolePipeline;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.utils.UrlUtils;


public class MyProcessor implements PageProcessor{
	public static final String FIRST_URL = "http://www\\.biqudu\\.com/\\w+";
	public static final String HELP_URL = "/\\d+_\\d+/";
	public static final String TARGET_URL = "/\\d+_\\d+/\\d+\\.html";
	private Site site = Site.me().setRetryTimes(3).setSleepTime(3000);
	
	public Site getSite() {
		// TODO Auto-generated method stub
		return site;
	}


	public void process(Page page) {
		if(page.getUrl().regex(FIRST_URL).match()){
			List<String> urls = page.getHtml().links().regex(HELP_URL).all();
			page.addTargetRequests(urls);
			page.putField("noveltitle",page.getHtml().xpath("//div[@id='info']/h1/text()"));
			page.putField("",page.getHtml().xpath("//div[@id='info']/p[1]/text()"));
			page.putField("",page.getHtml().xpath("//div[@id='info']/p[3]/text()"));
			page.putField("info",page.getHtml().xpath("//div[@id='intro']/p[1]/text()"));
			if(page.getUrl().regex(HELP_URL).match()){
				List<String> links = page.getHtml().links().regex("/\\d+_\\d+/\\d+\\.html").all();
				page.addTargetRequests(links);
				page.putField("NovelTitle", page.getHtml().xpath("//div[@class='con_top']/a[2]/text()"));
				page.putField("ContentTitle", page.getHtml().xpath("//div[@class='bookname']/h1/text()"));
				page.putField("content", page.getHtml().xpath("//div[@id='content']/text()"));
			}
		}
	}
	
	public static void main(String[] args){
		Spider.create(new MyProcessor()).addUrl("http://www.biqudu.com/xuanhuanxiaoshuo/")
		.addPipeline(new ConsolePipeline())
		.run();
	}
	
}