java如何过滤无效的utf-8字符？

最新推荐文章于 2024-06-21 19:59:42 发布

flysharkym

最新推荐文章于 2024-06-21 19:59:42 发布

阅读量4.6k

点赞数

CC 4.0 BY-SA版权

分类专栏： java 文章标签： utf-8 java

本文链接：https://blog.csdn.net/patrickyoung6625/article/details/12042727

java 专栏收录该内容

23 篇文章

订阅专栏

本文介绍了一个在Nutch 1.3版本中出现的关于处理非字符UTF-8代码点的问题，并提供了修复该问题的具体代码实现。此问题在1.4版本中得到了解决。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

之前遇到过几次nutch/solr报这样的错误：Invalid UTF-8 character。原来1.3版本的nutch有Strip UTF-8 non-character codepoints的bug，在1.4就修复了。

bug链接： https://issues.apache.org/jira/browse/NUTCH-1016。

于是这里把nutch里如何过滤无效utf-8字符的代码找出来给小伙伴们看看。直接上代码了：

public static String stripNonCharCodepoints(String input) {
  StringBuilder retval = new StringBuilder();
  char ch;

  for (int i = 0; i < input.length(); i++) {
    ch = input.charAt(i);

    // Strip all non-characters http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
    // and non-printable control characters except tabulator, new line and carriage return
    if (ch % 0x10000 != 0xffff && // 0xffff - 0x10ffff range step 0x10000
        ch % 0x10000 != 0xfffe && // 0xfffe - 0x10fffe range
        (ch <= 0xfdd0 || ch >= 0xfdef) && // 0xfdd0 - 0xfdef
        (ch > 0x1F || ch == 0x9 || ch == 0xa || ch == 0xd)) {

      retval.append(ch);
    }
  }

  return retval.toString();
}