Python 收集RSS网站内容并汇聚为Json格式文件

guaguastd

于 2015-01-24 08:40:46 发布

阅读量1.3k

点赞数

分类专栏： # PYTHON 文章标签：数据挖掘

PYTHON 专栏收录该内容

307 篇文章

订阅专栏

#!/usr/bin/python 
# -*- coding: utf-8 -*-

'''
Created on 2015-1-24
@author: beyondzhou
@name: harvest_blog_data.py
'''

import os
import json
import feedparser
from html import cleanHtml

FEED_URL = 'http://feeds.feedburner.com/oreilly/radar/atom'

fp = feedparser.parse(FEED_URL)

print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title)

blog_posts = []
for e in fp.entries:
    blog_posts.append({'title':e.title, 'content':cleanHtml(e.content[0].value),
                            'link':e.links[0].href})

f = open(os.path.join(r"E:", "\\", "eclipse", "Web", "dfile", 'feed.json'), 'w')
f.write(json.dumps(blog_posts, indent=1))
f.close()

print 'Wrote output file to %s' % (f.name, )

# Convert HTML entities back to plain-text representations
def cleanHtml(html):
    from nltk import clean_html
    from BeautifulSoup import BeautifulStoneSoup
    
    if html == "": return ""
    
    return BeautifulStoneSoup(clean_html(html), 
        convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]

Fetched 33 entries from 'O'Reilly Radar - Insight, analysis, and research about emerging technologies'
Wrote output file to E:\eclipse\Web\dfile\feed.json

博客等级

码龄13年

133
原创

84
点赞

258
收藏

212
粉丝

关注

私信

热门文章

分类专栏

展开全部收起

上一篇：: Python 使用feedparser提取rss内容

下一篇：: Python 使用nltk对数据进行自然语言处理(nlp)

最新评论

Python 实现简单的加减算数游戏
华泽小勇: 如何加界面呢
[视觉工程]以图搜图之搜索策略(bf,kdTree,ballTree,annoy,nms,falconn)
韩国麦当劳: 大佬，您好，我想问一下你的falconn是怎么装的？我用pip安装老是报错 [code=plain] Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Collecting FALCONN Using cached https://pypi.tuna.tsinghua.edu.cn/packages/96/b8/0d2c629d59398a7b3ed8726ce049abf6746bbf09d1ad15878d4fcf8048a6/FALCONN-1.3.1.tar.gz (1.4 MB) Preparing metadata (setup.py) ... done Building wheels for collected packages: FALCONN Building wheel for FALCONN (setup.py) ... error error: subprocess-exited-with-error × python setup.py bdist_wheel did not run successfully. │ exit code: 1 ╰─> [17 lines of output] running bdist_wheel running build running build_py creating build creating build\lib.win-amd64-3.7 creating build\lib.win-amd64-3.7\falconn copying falconn\__init__.py -> build\lib.win-amd64-3.7\falconn running egg_info writing FALCONN.egg-info\PKG-INFO writing dependency_links to FALCONN.egg-info\dependency_links.txt writing top-level names to FALCONN.egg-i [/code]
[GAN实战] DCGAN实现
weixin_53799925: 请问网络深度对gan有什么影响？如果使用一些卷积网络里的module会对gan有比较大的作用么
Python 使用递归打印输出数字（逆序和顺序）
豆汁泡纳豆: 醍醐灌顶
Python 正则表达式将纯文本转化为HTML格式
Tisfy: 正想看这样的文章，就遇到了它

大家在看

最新文章

目录

展开全部

收起

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。