数据集
大数据
https://delicious.com/pskomoroch/dataset
http://konect.uni-koblenz.de/
搜狗实验室
http://www.sogou.com/labs/resources.html?v=1
气象数据集
https://www.ncdc.noaa.gov/data-access/quick-links
气候监测数据集
http://cdiac.ornl.gov/ftp/ndp026b
机器学习
亚马逊网络服务数据: http://aws.amazon.com/datasets
航空公司数据(2009年ASA挑战): http://stat-computing.org/dataexpo/2009/the-data.html
澳大利亚天气: http://www.bom.gov.au/climate/dwo/
因果关系工作台: http://www.causality.inf.ethz.ch/repository.php
Kaggle竞争数据: https://www.kaggle.com/datasets
KDNuggets竞争网站: www.kdnuggets.com/datasets/
机器学习的数据集存储库: http://mldata.org/
医疗保险数据文件: http://go.cms.gov/19xxPN4
微软研究院: http://research.microsoft.com/apps/dp/dl/downloads.aspx
百万歌曲数据集: http://blog.echonest.com/post/3639160982/million-song-dataset
更多的歌曲数据集: http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets
RDataMining.com R和数据挖掘电子书数据: http://www.rdatamining.com/data
革命分析集合: http://www.revolutionanalytics.com/subscriptions/datasets/
社交网络: http://www.cs.cmu.edu//ancestry.com/ ~ jelsas /数据
UCI机器学习库: http://archive.ics.uci.edu/ml/
535亿点击: http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset
http://archive.ics.uci.edu/ml/
http://www.ics.uci.edu/~mlearn//MLRepository.htm
机器学习样本数据库
http://kdd.ics.uci.edu/
http://www.ics.uci.edu/~mlearn/MLRepository.html
关于基金的数据挖掘的网站
http://www.gotofund.com/index.asp
数据生成器的链接
http://www.cse.cuhk.edu.hk/~kdd/data_collection.html
癌症基因:
http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi
金融数据:
http://lisp.vse.cz/pkdd99/Challenge/chall.htm
网络
斯坦福大学大型网络数据收集: http://snap.stanford.edu/data/
图像
1、ImageNet
http://www.image-net.org/
包含1400万的图像。
2、Tiny Images Dataset
http://horatio.cs.nyu.edu/mit/tiny/data/index.html
包含8000万的32x32图像。
3、 MirFlickr1M
http://press.liacs.nl/mirflickr/
Flickr中的100万的图像集。
4、 CoPhIR
http://cophir.isti.cnr.it/whatis.html
Flickr中的1亿600万的图像
5、SBU captioned photo dataset
http://dsl1.cewit.stonybrook.edu/~vicente/sbucaptions/
Flickr中的100万的图像集。
6、Large-Scale Image Annotation using Visual Synset(ICCV 2011)
http://cpl.cc.gatech.edu/projects/VisualSynset/
包含2亿图像
7、NUS-WIDE
http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm
Flickr中的27万的图像集。
8、SUN dataset
http://people.csail.mit.edu/jxiao/SUN/
包含13万的图像
9、MSRA-MM
http://research.microsoft.com/en-us/projects/msrammdata/
包含100万的图像,23000视频
10、TRECVID
http://trecvid.nist.gov/
金星上的火山
7.3G stackoverflow.com-Posts.7z
573.1K stackoverflow.com-Tags.7z
153.0M stackoverflow.com-Users.7z
2.2G stackoverflow.com-Comments.7z
2014/07/07 雅虎发布超大Flickr数据集 1亿的图片+视频
http://yahoolabs.tumblr.com/post/89783581601/one-hundred-million-creative-commons-flickr-images-for
100多个有趣的数据集
http://www.csdn.net/article/2014-06-06/2820111-100-Interesting-Data-Sets-for-Statistics
图像处理相关个人主页、研究组及公开数据集网址
http://blog.sciencenet.cn/blog-673472-759786.html
Public Domain Collections
Data360: http://www.data360.org/index.aspx
Datamob.org: http://datamob.org/datasets
Factual: http://www.factual.com/topics/browse
Freebase: http://www.freebase.com/
Google: http://www.google.com/publicdata/directory
infochimps: http://www.infochimps.com/
numbray: http://numbrary.com/
Quora: http://www.quora.com/Data/Where-can-I-find-large-datasets-open-to-the-pu...
RS Collection 100+ : http://rs.io/2014/05/29/list-of-data-sets.html
Sample R data sets: http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html(右)
SourceForge研究数据: http://www.nd.edu/ oss /数据/研究司
StatSci.org: http://www.statsci.org/datasets.html
UFO报告: http://www.nuforc.org/webreports.html
维基解密911寻呼机截取: http://911.wikileaks.org/files/index.html
Stats4Stem.org:R数据集: http://www.stats4stem.org/data-sets.html(右)
《华盛顿邮报》名单: http://www.washingtonpost.com/wp-srv/metro/data/datapost.html
科学
农业实验: http://www.inside-r.org/packages/cran/agridat/docs/agridat(右)
气候数据: http://www.cru.uea.ac.uk/cru/data/temperature/#datter
and ftp://ftp.cmdl.noaa.gov/
Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/
Geo Spatial Data: http://geodacenter.asu.edu/datalist/
Human Microbiome Project: http://www.hmpdacc.org/reference_genomes/reference_genomes.php
MIT Cancer Genomics Data: http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
NASA: http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html
NIH Microarray data: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/(R)
Protein structure: http://www.infobiotic.net/PSPbenchmarks/
Public Gene Data: http://www.pubgene.org/
斯坦福大学的微阵列数据: http://smd.stanford.edu//
社会科学
综合社会调查: http://www3.norc.org/GSS +网站/
ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/access/index.jsp
皮尤研究: http://www.pewinternet.org/datasets/pages/2/
提前: http://snap.stanford.edu/data/index.html
加州大学洛杉矶分校的社会科学档案: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm
UPJOHN本月: http://www.upjohn.org/erdc/erdc.html
时间序列
时间序列数据库: http://robjhyndman.com/TSDL/
http://www.stat.wisc.edu/~reinsel/bjr-data/
大学
卡内基梅隆大学安然电子邮件: http://www.cs.cmu.edu/~安然/
卡内基梅隆大学StatLab: http://lib.stat.cmu.edu/datasets/
龙骨存储库: http://sci2s.ugr.es/keel/datasets.php
卡内基梅隆大学JASA数据归档: http://lib.stat.cmu.edu/jasadata/
俄亥俄州立大学财务数据: http://fisher.osu.edu/fin/osudata.htm
加州大学伯克利分校: http://ucdata.berkeley.edu/
加州大学洛杉矶分校: http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data
加州大学河滨分校时间序列: 方http://www.cs.ucr.edu/ / time_series_data /
多伦多大学: http://www.cs.toronto.edu/深入/数据/datasets.html
信息和计算机科学
加州大学欧文分校
互联网相关数据集
Dataset for "Statistics and SocialNetwork of YouTube Videos"
http://netsg.cs.sfu.ca/youtubedata/
2、1998 World Cup Web Site Access Logs
http://ita.ee.lbl.gov/html/contrib/WorldCup.html
这个是1998年世界杯期间的数据集。从1998/04/26 到 1998/07/26 的92天中,发生了 1,352,804,107次请求。
3、Page view statistics for Wikimedia projects
http://dammit.lt/wikistats/
4、AOL Search Query Logs - RP
http://www.researchpipeline.com/mediawiki/index.php?title=AOL_Search_Query_Logs
5、livedoor gourmet
http://blog.livedoor.jp/techblog/archives/65836960.html
离散序列数据
多元数据
- 人口收入调查数据库
- 线圈数据
- Corel图像特征
- 森林CoverType
- 保险公司基准(2000卷)
- 互联网使用数据
- IPUMS人口普查数据
- KDD CUP 1998数据
- KDD CUP 1999数据
- 1990年美国人口普查数据
关系数据
时空数据
文本
http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html
http://www.w3.org/TR/WD-logfile-960221.html
http://www.w3.org/Daemon/User/Config/Logging.html#AccessLog
http://www.w3.org/1998/11/05/WC-workshop/Papers/bala2.html
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/
http://www.web-caching.com/traces-logs.html
http://www-2.cs.cmu.edu/webkb
http://www.cs.auc.dk/research/DP/tdb/TimeCenter/TimeCenterPublications/TR-75.pdf
http://www.cs.cornell.edu/projects/kddcup/index.html
数据集推荐(网站、博客)
http://kdd.ics.uci.edu/summary.data.type.html
http://www.mlnet.org/cgi-bin/mlnetois.pl/?File=datasets.html
http://lib.stat.cmu.edu/datasets/
http://fimi.cs.helsinki.fi/data/
1、Public Data Sets onAmazon Web Services (AWS)
http://aws.amazon.com/datasets
Amazon从2008年开始就为开发者提供几十TB的开发数据。
2、Yahoo!Webscope
http://webscope.sandbox.yahoo.com/index.php
3、Konect is a collection of network datasets
http://konect.uni-koblenz.de/
4、Stanford Large Network Dataset Collection
http://snap.stanford.edu/data/index.html