-
Notifications
You must be signed in to change notification settings - Fork 274
utf8 to utf16 conversion and utf16 to ascii #372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utf8 to utf16 conversion and utf16 to ascii #372
Conversation
src/util/unicode.cpp
Outdated
@@ -256,8 +256,14 @@ const char **narrow_argv(int argc, const wchar_t **argv_wide) | |||
return argv_narrow; | |||
} | |||
|
|||
std::wstring utf8_to_utf16(const std::string& in) | |||
std::wstring utf8_to_utf16be(const std::string& in) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be acceptable to spell out big_endian (and, below, little_endian). Don't make others decipher names.
src/util/unicode.cpp
Outdated
|
||
std::wstring utf8_to_utf16le(const std::string& in) | ||
{ | ||
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t,0x10ffffUL,std::codecvt_mode::little_endian> > converter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some macro or named constant that can be used in place of the hex value? Else it would be nice to have some comment explaining why it's exactly this bit pattern.
Also, the line should be broken somewhere before the 80th (or even 70th) character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I couldn't find such a constant, this is in fact the default for this template, hopefully this commit af792fe makes it a bit clearer
src/util/unicode.cpp
Outdated
@@ -267,3 +267,22 @@ std::wstring utf8_to_utf16le(const std::string& in) | |||
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t,0x10ffffUL,std::codecvt_mode::little_endian> > converter; | |||
return converter.from_bytes(in); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to the code: the commit message should be cleaned, we don't really care about that conflict anymore.
src/util/unicode.cpp
Outdated
@@ -267,3 +267,22 @@ std::wstring utf8_to_utf16le(const std::string& in) | |||
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t,0x10ffffUL,std::codecvt_mode::little_endian> > converter; | |||
return converter.from_bytes(in); | |||
} | |||
|
|||
std::string utf16le_to_ascii(const std::wstring& in) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function missing a declaration in unicode.h?
src/util/unicode.cpp
Outdated
std::locale loc; | ||
for(const auto c : in) | ||
{ | ||
if(c <= 255 && isprint(c,loc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No whitespace around `<=' but do add whitespace after ","
src/util/unicode.cpp
Outdated
{ | ||
result+="\\u"; | ||
char hex[5]; | ||
sprintf(hex,"%04x",(wchar_t)c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing whitespace after ","
5bc14cc
to
af792fe
Compare
May I suggest the test given at the end of this article:
https://mathiasbynens.be/notes/javascript-unicode
HTH
|
@martin-cs The following test is handled correctly by the solver:
Note however that the current solver is not able to check the same program for codePoint other than 0 (correspondence between code point and the actual index can be complicated in general). |
Cool; just thought that it might be a good test to add to find awkward
issues with multi-byte UTF16.
|
src/util/unicode.cpp
Outdated
std::wstring utf8_to_utf16_little_endian(const std::string& in) | ||
{ | ||
const std::codecvt_mode mode=std::codecvt_mode::little_endian; | ||
const unsigned long maxcode=0x10ffff; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for making this easier to read. Would it be possible to add some reference on where you got that magic number from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment and a reference to the c++ documentation: ccbe6f7
@romainbrenguier Is the test you mentioned already part of some commit series? If it isn't it would be nice if it were added as a "FUTURE" test to this pull request (and then changed to "CORE" once all parts of the string solver are merged). |
The following test is already marked as future in CBMC: https://github.com/diffblue/cbmc/blob/master/regression/strings/java_code_point/test_code_point.java and it illustrates the same feature. But I can still add the "💩Iñtërnâtiônàlizætiøn☃" test, should it be part of this pull request or another one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to see the commits squashed into a single one, but otherwise it looks good to me!
Hmm, please hold off on merging this one: it seems further fixes for the travis integration of Clang are necessary in order to support codecvt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a commit changing .travis.yml to set COMPILER="clang++-3.7 -stdlib=libc++" and ensure that $COMPILER is quoted?
36e32b7
to
c7a101a
Compare
@tautschnig I squashed the commits and added the modification to .travis.yml you asked for c7a101a , however I'm not sure the compilation with clang++ works |
@@ -28,16 +28,16 @@ matrix: | |||
- clang-3.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems possibly that we'll need an explicit
- libc++-dev
line in here.
.travis.yml
Outdated
- os: osx | ||
compiler: gcc | ||
env: COMPILER=g++ | ||
- os: osx | ||
compiler: clang | ||
env: COMPILER=clang++ | ||
env: COMPILER="clang++ -stdlib=libc++" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That change isn't necessary as libc++ is the default for Mac OS.
9bfb4ac
to
a073371
Compare
If anyone has a Ubuntu system with Clang (3.7, ideally) at hand and could help debug the issues this would be much appreciated. At present, we are blocked as merging this work would break our travis set up. |
@tautschnig I tried the failing command on ubuntu with clang++-3.7 and reproduced the error gotten by travis but after installing libstd++-dev the error disappeared, so I don't know what's happening... |
578925e
to
82a7398
Compare
@romainbrenguier Can you include in this pull request a travis config that matches the steps you took to make things work? It seems that installing libc++-dev is the wrong library (you stated libstdc++-dev works), and probably the compiler flags need to match as well. Thanks!!! |
dbb69ab
to
82a7398
Compare
@tautschnig sorry I meant libc++-dev , the travis file already mention the library I installed |
Using |
f666501
to
ef75c36
Compare
Did you also try Visual Studio? If C++ library support for the conversion isn't working yet, we might have to put our own implementation in. |
Added two functions for utf8 to utf16 conversion function depending on whether we use little or big endian. Added a function utf16_little_endian_to_ascii to display nicely java strings as an ascii sequence.
ef75c36
to
6071c44
Compare
Note: Vojtech has asked me to look into the outstanding problems here. AFAIK they are:
|
@smowton In case you need confirmation: yes, I believe you have summarised the problems nicely. |
Confirming Romain's findings:
Explanations for the comments above:
|
#541 is a probable fix (waiting for Travis + AppVeyor) |
Fixed one other VS problem-- it doesn't have snprintf until VS 2015. Replaced with iostreams equivalent in #541. |
67c80fc Merge pull request diffblue#394 from diffblue/jeannie/LinkedHashMapIterators 6667484 Tests that the iteration order is correct. 38301ea Tests other methods and constructor in LinkedHashMap f795b3d Models other methods in LinkedHashMap. 25773a6 Tests entrySet(), keySet() and values() in LinkedHashMap ee8cfad Models keySet(), entrySet() and values() in LinkedHashMap. 02c8271 Merge pull request diffblue#393 from diffblue/jeannie/UpdateReadMeForSpec 81460d3 Update readme.md to include new style for specs 9efcce3 Merge pull request diffblue#396 from diffblue/antonia/clean-up-for-TG-1081 c902c03 Merge pull request diffblue#397 from diffblue/jeannie/ForgotAppendObjectDocs 772a977 Merge pull request diffblue#390 from diffblue/antonia/enable-fixed-tests 5bef2ff Merge pull request diffblue#398 from diffblue/antonia/ticket-references-bugfix ed1dca2 Merge pull request diffblue#395 from diffblue/allredj/disable-tests-failing-on-tg2717 c07841d Add more tests for String.getBytes(Charset) d568e47 Fix array index bug in String.getBytesUTF_16 2218407 Model String.getBytes(Charset) 9c3a8bc Clarify difference of String.getBytes from JDK cf1c23b Merge the two active scenarios in String.spec dd5d2d9 Remove support_v1 tag from String specs 5f9a7e1 Split String tests into Level 0 and Level 1+ f2877b9 Enable Class test that was blocked by TG-1081 034f3e0 Remove reference to TG-1081 from File model 43afde7 Force static initialiser for File model d95ff9e Remove reference to fixed bug from Date model 0807806 Remove references to fixed bugs from Arrays model cde4085 Remove references to fixed bugs from HashMap model d85fe5b Update RaceTimes references to TG-1404 and TG-1523 0e925e5 Update ticket number in HashMap.spec 03a5186 Enable TG-1404 tests fa051dd Delete ArrayList CustomType test file b0e853b Enable HashMap test previously blocked by TG-1877 44bfe0a Merge pull request diffblue#392 from diffblue/lajw/TG-2389-enable-tests 595dd5d Changes CProver helped methods in HashMap to protected. c52771c Merge pull request diffblue#385 from diffblue/jeannie/UpdateTestRunner de0abdc Enable tests fixed by recent test-gen fixes b64357a Remove ticket numbers from resolved bugfix tests 53eca00 Documents StringBuilder and StringBuffer append(Object) baec23f Merge pull request diffblue#389 from diffblue/antonia/enable-TG-2666-test b2f0258 Enable LinkedList test that was blocked by TG-2666 57e79e5 Add knownbug tests for TG-2717 80fa433 Merge pull request diffblue#387 from diffblue/forejtv/unsupportedcharsetexception 990129c Merge pull request diffblue#391 from diffblue/allredj/disable-html-report 97f32f6 Don't write to the Html report 4cb5996 Merge pull request diffblue#382 from diffblue/antonia/address-ArrayList-todos 6749702 Merge pull request diffblue#386 from diffblue/antonia/gauge-telemetry-off 546dfdc Move legacy style tests into main Gauge step 5be886f Mark UnsupportedCharsetException as untested 1c17838 Add regression test for side effects 8c836ab Add tests for ArrayLists w. (non-default) capacity 34a141d Address bugfix TODOs in ArrayList 563b631 Correct bug description in comment 3e7603d Merge pull request diffblue#383 from diffblue/antonia/reformat-HashSet-tests 5d4e013 [TG-2751] Added UnsupportedCharsetException 11b28fb Reformat HashSet.spec f77c3c4 Rename HashSet_L2.spec to HashSet_L0.spec a16a1e7 Move all HashSet Maven tests into HashSet.spec f21a2da Merge pull request diffblue#376 from diffblue/jeannie/LinkedHashMap 820c5f7 Merge pull request diffblue#380 from diffblue/jeannie/AppendObject af65f4d Tests java.lang.StringBuffer append(Object) a554517 Reformat tests in java.util.StringBuilder dd6d3f6 Models append(Object) for StringBuilder and StringBuffer. 8920399 Tests toString() methods on existing models where possible 647f4fe Tests toString() methods on existing models where possible 978273b Documents and implements toString() methods in existing models. 78020ee Documents java.util.LinkedHashMap fb0cf92 Tests java.util.LinkedHashMap 7a9df4e Models java.util.LinkedHashMap 5a8af60 Marks all methods as notModelled() for java.util.LinkedHashMap 8d6b149 Initial commit for java.util.LinkedHashMap 34b7c54 Merge pull request diffblue#359 from diffblue/forejtv/throwable-no-static e2230de Cleanup of unused (mostly static) variables 255013e Merge pull request diffblue#384 from diffblue/jeannie/DisableBoundedGenericHashMap f374d5f Merge pull request diffblue#381 from diffblue/justin/TG2600-Correction ce6328b Turn off Gauge telemetry on Travis 6ab2864 Updates TestRunner.java to mimic platform parameters. 8ee75a4 Disables a HashMap test that depends on bounded generic type. 3677c3a [TG-2600] correct a mistake in the L1RemoveLast test 4511813 Merge pull request diffblue#378 from diffblue/antonia/LinkedList-first-model 6bda7ea Add tests for LinkedList b5d4cbe Model LinkedList methods specified in TG-2600 aa6e90a Empty models for new classes 3c25555 Copy LinkedList and related classes from jdk 5e4d410 Merge pull request diffblue#373 from diffblue/romain/tests/activate-arrays-hashset-test#TG-1404 a0d9289 Activate some Arrays test 27d06b4 Activate tests for Hashset fixed by TG-1404 224962a Merge pull request diffblue#371 from diffblue/romain/tests/activate-after-fix-1404 ed26b28 Activate level 2 tests fixed by TG-1404 b59c1e9 Relabel known-bug for Level2 HashMap test 5c0041b Activate tests for HashMap.values fixed by TG-1404 2777872 Activate tests for HashMap.entrySet 50d3f2c Activate tests for HashMap.keySet fixed by TG-1404 6146215 Activate ArrayList test fixed by TG-1404 bc09fd7 Merge pull request diffblue#365 from diffblue/jeannie/getTimeZone 29cf7e0 Merge pull request diffblue#372 from diffblue/forejtv/bump-up-gauge-v bdd500a Change Gauge Java Maven Plugin to 0.6.6 190aa18 Tests java.util.TimeZone 170a992 Documents java.util.TimeZone in javadocs 9e52664 Models java.util.TimeZone constructor, getID, setID and getTimeZone. eac578e Tests sun.util.calendar.ZoneInfo b4be728 Documents sun.util.calendar.ZoneInfo in javadocs e5ffa98 Models the sun.util.calendar.ZoneInfo constructor and getTimeZone() 6d704ad Marks methods as notModelled() for sun.util.calendar.ZoneInfo d249adb Marks methods as notModelled() for java.util.TimeZone 7f71d8d Initial commit for sun.util.calendar.ZoneInfo. b53d9c1 Initial commit for java.util.TimeZone 056aad2 Merge pull request diffblue#368 from diffblue/allredj/fix-hashtable-spec 8848414 Merge pull request diffblue#370 from diffblue/allredj/stringbuffer-spec-small-correction 1d947e5 Small correction to StringBuffer spec file 42f1a93 Fix typo in hashtable spec file git-subtree-dir: benchmarks/LIBRARIES/models git-subtree-split: 67c80fcdcf82418b5e7099ae63dec3360b153f90
…rect-test-and-mark-xfail [SEC-202] Correct test and mark xfail
Adds functions to
util/unicode.cpp
to convert from utf8 strings to utf16, both for little and big endian encodings and a function to pretty print utf16 characters in ascii.These conversions are useful for the string refinement because utf16 is used by java.
Note that codecvt does not come automatically with old version of gcc, so I saw compilation problems with gcc-4.8 (Ubuntu 16.04), but none with gcc-6.