用XPATH解析网页并抓取要的内容

用XPATH解析网页并抓取要的内容

HTML解析器有很多种,最常用的是HtmlAgilityPack和SgmlReader(http://sourceforge.net/projects/dekiwiki/files/SgmlReader/)。


这里使用的是HtmlAgilityPack:

下载地址:http://htmlagilitypack.codeplex.com
也可以用nuget
To install HtmlAgilityPack, run the following command in the Package Manager Console
PM> Install-Package HtmlAgilityPack

同时官网提供了一个自动生成xpath路径的工具HAP Explorer。(http://htmlagilitypack.codeplex.com/downloads/get/120936)

关于XPATH表达式以及相关教程参见:http://blog.csdn.net/huwei2003/article/details/48338165

获取HTML的方式有很多种:

1.通过HttpWebRequest类可实现模拟登录并获取页面信息

2.用第三方控件模拟登录
如下:
1.实现模拟登录:

从http://code.google.com/p/autotester/downloads/list 下载AutoTesterLib.dll
在项目里添加AutoTesterLib.dll的引用.

代码如下:

    protected void Page_Load(object sender, EventArgs e)
    {
        string url = "http://rencai.baidu.com/user/login.jsp";
        TestSession ts = new HTMLTestSession();
        ts.Browser.Start(url);
    //这里你可以用任意属性查找,比如name=1;type=password等等.用分号分隔.


        ts.Objects.TextBox("name=username").Input("用户名"); 
        ts.Objects.TextBox("name=password").Input("密码");
        ts.Objects.Button("type=submit").Click();
    }

3 分析网页内容
使用方法:
首先引用HtmlAgilityPack的DLL文件 using HtmlAgilityPack;


根据XPath提取内容的函数: 
/// <summary>
        /// 根据XPATH获取筛选的字符串
        /// </summary>
        /// <param name="content">需要提取HTML的内容</param>
        /// <param name="xpath">XPath表达式</param>
        /// <param name="separ">分隔符</param>
        /// <returns>提取后的内容</returns>
        public static string GetStrByXPath(string content, string xpath, string separ)
        {
            var doc1 = new HtmlDocument();
            doc1.LoadHtml(content);
            var repeatNodes = doc1.DocumentNode.SelectNodes(xpath);
            var text = "";
            //循环节点  
            foreach (var node in repeatNodes)
            {
                //text += node.InnerText + separ;
                text += node.InnerHtml + separ;
            }
        }


例如用 //div[@itemprop='articleBody'] 这个xpath可以抓取 http://news.sohu.com/20151029/n424555111.shtml 这个新闻的正文内容

一段例子代码

public async Task<News> NewsGathering(string newsUrl)
        {
            //获取网页所有内容
            var strContent = await HttpHelper.GetContentAsync(newsUrl, Encoding.UTF8);
            var title = "";
            var content = "";
            var pubTime = DateTime.Now;
            try
            {
                //取出标题,时间
				title = StrHelperUtil.GetStrByXPath(strContent, "//h1[@class='art_title_h1']", "");
                var strTime = StrHelperUtil.GetStrByXPath(strContent, "//time", "");
                strTime = StrHelperUtil.FormatHTML(strTime).Replace("\n","").Replace("\\n","");
                if (strTime.Length > 18)
                {
                    strTime = strTime.Substring(0, 17);
                    DateTime.TryParse(strTime, out pubTime);
                }
                //取出正文区内容
                content = StrHelperUtil.GetStrByXPath(strContent, "//section[@data-sudaclick='articleContent']", "");
                //利用正则去掉一些不要的内容
                content = Regex.Replace(content, "<h1 class=\"art_title_h1\">[\\s\\S]*?</h1>", "");
                content = Regex.Replace(content, "<time>[\\s\\S]*?</time>", "");
                content = Regex.Replace(content, "<aside>[\\s\\S]*?</aside>", "");
                content = Regex.Replace(content, "<script type=\"comos/ver\">[\\s\\S]*?</script>", "");
                content = Regex.Replace(content, "<section class=\"M_attitude\" data-pl=\"attitude\" data-sudaclick=\"attitude\">[\\s\\S]*?</section>", "");
                content = Regex.Replace(content, "<!--.*?-->", "");

                var news = new News
                {
                    Content = content,
                    Title = title,
                    PubDate = pubTime,
                    From = newsUrl
                };
                return news;
            }
            catch (Exception ex)
            {
                return null;
            }
            return null;
        }

解释:

content = Regex.Replace(content, "<time>[\\s\\S]*?</time>", ""); 这个是用正则替换<time>开头,</time>结尾的(含头尾tag 字符串 

<pre name="code" class="csharp">[\\s\\S]*? 这个是非贪婪匹配 如
<time>2015-9-10 12:25</time>afkjldkfkljkl<time>2015-10-11 11:25</time>
只会匹配 <time>2015-9-10 12:25</time> 和 <time>2015-10-11 11:25</time>
如果是这样 [\\s\\S]* 不要后面的?则是贪婪匹配 则匹配 <time>2015-9-10 12:25</time>afkjldkfkljkl<time>2015-10-11 11:25</time>

附上获取网页内容的类 

里面有些方法是多余的

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
using System.Web;

namespace Net.Tools
{
    public class HttpHelper
    {
        /// <summary>
        /// 获取指定网页的内容
        /// </summary>
        /// <param name="strUrl">网页地址</param>
        /// <param name="encoder">网页编码格式</param>
        /// <returns>string</returns>
        public static string GetContent(string strUrl, Encoding encoder)
        {
            string strMsg = string.Empty;
            try
            {
                CookieContainer cc = new CookieContainer();
                //WebRequest request = WebRequest.Create(strUrl);
                HttpWebRequest request = (HttpWebRequest)WebRequest.Create(strUrl);

                //set request args
                request.Method = "Get";
                request.CookieContainer = cc;
                request.KeepAlive = true;

                //request.ContentType = "application/x-www-form-urlencoded; charset=UTF-8";
                request.ContentType = "text/html";

                //模拟goole浏览器访问
                request.UserAgent =
                    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36";
                //request.Referer = strUrl;
                //request.Headers.Add("x-requested-with:XMLHttpRequest");
                request.Headers.Add("x-requested-with:com.android.browser");
                request.Headers.Add(HttpRequestHeader.AcceptLanguage, "zh-CN,zh;q=0.8,en;q=0.6,nl;q=0.4,zh-TW;q=0.2");
                //request.ContentLength = postdataByte.Length;  text/html; charset=utf-8
                request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
                request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip |
                                                 DecompressionMethods.None;
                //支持跳转页面,查询结果将是跳转后的页面
                request.AllowAutoRedirect = true;

                request.Headers.Add("Accept-Encoding", "gzip, deflate");
                if (request.Method == "POST")
                {
                    (request as HttpWebRequest).ContentType = "application/x-www-form-urlencoded";
                }

                HttpWebResponse response = (HttpWebResponse)request.GetResponse();

                //StreamReader reader = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding("gb2312"));
                StreamReader reader = new StreamReader(response.GetResponseStream(), encoder);

                strMsg = reader.ReadToEnd();
                // .\0为null,空字符,也是字符串结束标志
                strMsg = strMsg.Replace("\0", "");
                reader.Close();
                reader.Dispose();
                response.Close();
            }
            catch
            {
            }
            return strMsg;
        }

        public static async Task<string> GetContentAsync(string strUrl, Encoding encoder)
        {
            var strMsg = await Task.Run<string>(() =>
            {
                try
                {
                    #region

                    CookieContainer cc = new CookieContainer();
                    //WebRequest request = WebRequest.Create(strUrl);
                    HttpWebRequest request = (HttpWebRequest) WebRequest.Create(strUrl);

                    //set request args
                    request.Method = "Get";
                    request.CookieContainer = cc;
                    request.KeepAlive = true;

                    //request.ContentType = "application/x-www-form-urlencoded; charset=UTF-8";
                    request.ContentType = "text/html";

                    //模拟goole浏览器访问
                    request.UserAgent =
                        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36";
                    //request.Referer = strUrl;
                    //request.Headers.Add("x-requested-with:XMLHttpRequest");
                    request.Headers.Add("x-requested-with:com.android.browser");

                    request.Headers.Add(HttpRequestHeader.AcceptLanguage, "zh-CN,zh;q=0.8,en;q=0.6,nl;q=0.4,zh-TW;q=0.2");
                    //request.ContentLength = postdataByte.Length;  text/html; charset=utf-8
                    request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
                    request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip |
                                                     DecompressionMethods.None;
                    //支持跳转页面,查询结果将是跳转后的页面
                    request.AllowAutoRedirect = true;

                    request.Headers.Add("Accept-Encoding", "gzip, deflate");
                    if (request.Method == "POST")
                    {
                        (request as HttpWebRequest).ContentType = "application/x-www-form-urlencoded";
                    }

                    HttpWebResponse response = (HttpWebResponse) request.GetResponse();

                    //StreamReader reader = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding("gb2312"));
                    StreamReader reader = new StreamReader(response.GetResponseStream(), encoder);

                    var strcontent = reader.ReadToEnd();
                    // .\0为null,空字符,也是字符串结束标志
                    strcontent = strcontent.Replace("\0", "");
                    reader.Close();
                    reader.Dispose();
                    response.Close();
                    return strcontent;

                    #endregion
                }
                catch (Exception ex)
                {
                    return "";
                }
            });

            return strMsg;
        }
        public static async Task<string> GetContentByMobileAgentAsync(string strUrl, Encoding encoder)
        {
            var strMsg = await Task.Run<string>(() =>
            {
                try
                {
                    #region

                    CookieContainer cc = new CookieContainer();
                    //WebRequest request = WebRequest.Create(strUrl);
                    HttpWebRequest request = (HttpWebRequest)WebRequest.Create(strUrl);

                    //set request args
                    request.Method = "Get";
                    request.CookieContainer = cc;
                    request.KeepAlive = true;

                    //request.ContentType = "application/x-www-form-urlencoded; charset=UTF-8";
                    request.ContentType = "text/html";

                    //模拟goole浏览器访问
                    request.UserAgent =
                        "CoolPad8750_CMCC_TD/1.0 Linux/3.4.5 Android/4.2.1 Release/06.31.2013 Browser/1.0 Profile/MIDP-1.0 Configuration/CLDC-1.0";
                    //request.Referer = strUrl;
                    //request.Headers.Add("x-requested-with:XMLHttpRequest");
                    request.Headers.Add("x-requested-with:com.android.browser");

                    request.Headers.Add(HttpRequestHeader.AcceptLanguage, "zh-CN,zh;q=0.8,en;q=0.6,nl;q=0.4,zh-TW;q=0.2");
                    //request.ContentLength = postdataByte.Length;  text/html; charset=utf-8
                    request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
                    request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip |
                                                     DecompressionMethods.None;
                    //支持跳转页面,查询结果将是跳转后的页面
                    request.AllowAutoRedirect = true;

                    request.Headers.Add("Accept-Encoding", "gzip, deflate");
                    if (request.Method == "POST")
                    {
                        (request as HttpWebRequest).ContentType = "application/x-www-form-urlencoded";
                    }

                    HttpWebResponse response = (HttpWebResponse)request.GetResponse();

                    //StreamReader reader = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding("gb2312"));
                    StreamReader reader = new StreamReader(response.GetResponseStream(), encoder);

                    var strcontent = reader.ReadToEnd();
                    // .\0为null,空字符,也是字符串结束标志
                    strcontent = strcontent.Replace("\0", "");
                    reader.Close();
                    reader.Dispose();
                    response.Close();
                    return strcontent;

                    #endregion
                }
                catch (Exception ex)
                {
                    return "";
                }
            });

            return strMsg;
        }
        public static string GetContent(string strUrl, Encoding encoder, CookieContainer cc)
        {
            string strMsg = string.Empty;
            try
            {
                //CookieContainer cc = new CookieContainer();
                //WebRequest request = WebRequest.Create(strUrl);
                HttpWebRequest request = (HttpWebRequest)WebRequest.Create(strUrl);

                //set request args
                request.Method = "Get";
                request.CookieContainer = cc;
                request.KeepAlive = true;

                //request.ContentType = "application/x-www-form-urlencoded; charset=UTF-8";
                request.ContentType = "text/html";

                //模拟goole浏览器访问
                request.UserAgent =
                    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36";
                //request.Referer = strUrl;
                request.Headers.Add("x-requested-with:XMLHttpRequest");
                request.Headers.Add(HttpRequestHeader.AcceptLanguage, "zh-CN,zh;q=0.8,en;q=0.6,nl;q=0.4,zh-TW;q=0.2");
                //request.ContentLength = postdataByte.Length;  text/html; charset=utf-8
                request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
                request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip |
                                                 DecompressionMethods.None;
                //支持跳转页面,查询结果将是跳转后的页面
                request.AllowAutoRedirect = true;

                request.Headers.Add("Accept-Encoding", "gzip, deflate");
                if (request.Method == "POST")
                {
                    (request as HttpWebRequest).ContentType = "application/x-www-form-urlencoded";
                }

                HttpWebResponse response = (HttpWebResponse)request.GetResponse();

                //StreamReader reader = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding("gb2312"));
                StreamReader reader = new StreamReader(response.GetResponseStream(), encoder);

                strMsg = reader.ReadToEnd();
                // .\0为null,空字符,也是字符串结束标志
                strMsg = strMsg.Replace("\0", "");
                reader.Close();
                reader.Dispose();
                response.Close();
            }
            catch
            {
            }
            return strMsg;
        }
        /// <summary>
        /// 获取指定网页的内容
        /// </summary>
        /// <param name="strUrl">网页地址</param>
        /// <param name="encoder">网页编码格式,不指定null时将自动获取网页编码格式</param>
        /// <returns>string</returns>
        public static string GetContent2(string strUrl, Encoding encoder)
        {
            string strMsg = string.Empty;
            CookieContainer cc = new CookieContainer();
            //WebRequest request = WebRequest.Create(strUrl);
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(strUrl);

            //set request args
            request.Method = "Get";
            request.CookieContainer = cc;
            request.KeepAlive = true;
            //request.ContentType = "application/x-www-form-urlencoded; charset=UTF-8";
            request.ContentType = "text/html";
            //request.UserAgent = "Mozilla/5.0 (Windows NT 5.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1";
            request.UserAgent =
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36";
            request.Referer = strUrl;
            request.Headers.Add("x-requested-with:XMLHttpRequest");
            request.Headers.Add(HttpRequestHeader.AcceptLanguage, "zh-CN,zh;q=0.8,en;q=0.6,nl;q=0.4,zh-TW;q=0.2");
            //request.ContentLength = postdataByte.Length;  text/html; charset=utf-8
            request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
            request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip |
                                             DecompressionMethods.None;
            //支持跳转页面,查询结果将是跳转后的页面
            request.AllowAutoRedirect = true;

            request.Headers.Add("Accept-Encoding", "gzip, deflate");
            if (request.Method == "POST")
            {
                (request as HttpWebRequest).ContentType = "application/x-www-form-urlencoded";
            }

            #region 获取数据

            using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
            {
                //从这里开始我们要无视编码了
                if (encoder == null)
                {
                    MemoryStream _stream = new MemoryStream();
                    response.GetResponseStream().CopyTo(_stream, 10240);
                    byte[] RawResponse = _stream.ToArray();
                    string temp = Encoding.Default.GetString(RawResponse, 0, RawResponse.Length);
                    //<meta(.*?)charset([\s]?)=[^>](.*?)>
                    Match meta = Regex.Match(temp, "<meta([^<]*)charset=([^<]*)[\"']",
                        RegexOptions.IgnoreCase | RegexOptions.Multiline);
                    string charter = (meta.Groups.Count > 2) ? meta.Groups[2].Value : string.Empty;
                    charter = charter.Replace("\"", string.Empty).Replace("'", string.Empty).Replace(";", string.Empty);
                    if (charter.Length > 0)
                    {
                        encoder = Encoding.GetEncoding(charter);
                    }
                    else
                    {
                        if (string.IsNullOrEmpty(response.CharacterSet))
                        {
                            encoder = Encoding.UTF8;
                        }
                        else
                        {
                            encoder = Encoding.GetEncoding(response.CharacterSet);
                        }
                    }
                    strMsg = encoder.GetString(RawResponse);
                }
                else
                {
                    //开始读取流并设置编码方式
                    using (StreamReader reader = new StreamReader(response.GetResponseStream(), encoder))
                    {
                        strMsg = reader.ReadToEnd();
                    }
                }
            }

            #endregion

            return strMsg.Replace("\0", "");
        }

        public static string PostLogin(string postData, string requestUrlString, ref CookieContainer cookie)
        {
            ASCIIEncoding encoding = new ASCIIEncoding();
            byte[] data = encoding.GetBytes(postData);
            //向服务端请求
            HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create(requestUrlString);
            myRequest.Method = "POST";
            myRequest.ContentType = "application/x-www-form-urlencoded";
            myRequest.ContentLength = data.Length;
            myRequest.CookieContainer = new CookieContainer();
            Stream newStream = myRequest.GetRequestStream();
            newStream.Write(data, 0, data.Length);
            newStream.Close();
            //将请求的结果发送给客户端(界面、应用)
            HttpWebResponse myResponse = (HttpWebResponse)myRequest.GetResponse();
            cookie.Add(myResponse.Cookies);
            StreamReader reader = new StreamReader(myResponse.GetResponseStream(), Encoding.UTF8);
            return reader.ReadToEnd();
        }
    }
}


 
 







### 使用 XPath 解析网页内容保存为 HTML 文件的方法 要实现使用 XPath 解析网页内容将结果存储为 HTML 文件,可以按照以下方法完成: #### 1. 安装必要的库 为了处理 HTML 和 XML 数据,通常会使用 `lxml` 库。该库支持 XPath 表达式的解析和提取功能[^3]。 可以通过 pip 命令安装此库: ```bash pip install lxml ``` --- #### 2. 获取网页内容 首先需要从目标网站抓取 HTML 页面的内容。这一步可以通过发送 HTTP 请求来完成,例如使用 Python 的 `requests` 库获取页面源码。 示例代码如下: ```python import requests url = 'https://example.com' # 替换为目标网址 response = requests.get(url) if response.status_code == 200: html_content = response.text else: raise Exception(f"Failed to retrieve the page: {response.status_code}") ``` --- #### 3. 创建可解析的对象 将获取到的 HTML 内容转换成可供 XPath 查询的数据结构。这里可以借助 `lxml.etree.HTML()` 方法创建一个解析对象[^1]。 示例代码如下: ```python from lxml import etree # 转换成可解析的层级对象 tree = etree.HTML(html_content) ``` --- #### 4. 使用 XPath 提取所需数据 通过定义合适的 XPath 表达式,可以从 HTML 中精确地定位所需的节点或属性。以下是几个常见的 XPath 示例: - **提取所有链接**: ```xpath //a/@href ``` - **按 ID 查找元素**: ```xpath //*[@id="element_id"] ``` - **查找具有特定类名的元素**: ```xpath //div[contains(@class, "classname")] ``` 假设我们希望提取某个表格中的所有单元格内容,则可以编写如下代码: ```python table_data = tree.xpath('//table//td/text()') print(table_data) ``` 对于更复杂的场景,还可以结合变量动态构建表达式[^4]。 --- #### 5. 构建新的 HTML 结构 如果仅需保留部分原始内容或将某些字段替换掉,那么可以根据需求重新组装一个新的 HTML 片段。比如下面的例子展示了如何生成一段简单的列表形式的结果页: ```python new_html = '<ul>\n' for item in table_data[:5]: # 取前五个作为例子展示 new_html += f'<li>{item.strip()}</li>\n' new_html += '</ul>' ``` 最终得到的是自定义格式化的字符串表示的新版布局。 --- #### 6. 存储为本地文件 最后一步就是把修改后的 HTML 输出至磁盘上的实际文档当中去。利用内置 open 函数即可轻松达成目的。 ```python output_file_path = './result.html' with open(output_file_path, mode='w', encoding='utf-8') as file_obj: file_obj.write(new_html) print(f'Saved successfully at path: {output_file_path}') ``` 这样就完成了整个流程——从网络请求开始直到形成独立存档为止的一系列操作过程概述完毕! ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值