Skip to content

utf8 to utf16 conversion and utf16 to ascii #372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

romainbrenguier
Copy link
Contributor

Adds functions to util/unicode.cpp to convert from utf8 strings to utf16, both for little and big endian encodings and a function to pretty print utf16 characters in ascii.
These conversions are useful for the string refinement because utf16 is used by java.
Note that codecvt does not come automatically with old version of gcc, so I saw compilation problems with gcc-4.8 (Ubuntu 16.04), but none with gcc-6.

@romainbrenguier romainbrenguier mentioned this pull request Dec 21, 2016
@@ -256,8 +256,14 @@ const char **narrow_argv(int argc, const wchar_t **argv_wide)
return argv_narrow;
}

std::wstring utf8_to_utf16(const std::string& in)
std::wstring utf8_to_utf16be(const std::string& in)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be acceptable to spell out big_endian (and, below, little_endian). Don't make others decipher names.


std::wstring utf8_to_utf16le(const std::string& in)
{
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t,0x10ffffUL,std::codecvt_mode::little_endian> > converter;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some macro or named constant that can be used in place of the hex value? Else it would be nice to have some comment explaining why it's exactly this bit pattern.

Also, the line should be broken somewhere before the 80th (or even 70th) character.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I couldn't find such a constant, this is in fact the default for this template, hopefully this commit af792fe makes it a bit clearer

@@ -267,3 +267,22 @@ std::wstring utf8_to_utf16le(const std::string& in)
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t,0x10ffffUL,std::codecvt_mode::little_endian> > converter;
return converter.from_bytes(in);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to the code: the commit message should be cleaned, we don't really care about that conflict anymore.

@@ -267,3 +267,22 @@ std::wstring utf8_to_utf16le(const std::string& in)
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t,0x10ffffUL,std::codecvt_mode::little_endian> > converter;
return converter.from_bytes(in);
}

std::string utf16le_to_ascii(const std::wstring& in)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function missing a declaration in unicode.h?

std::locale loc;
for(const auto c : in)
{
if(c <= 255 && isprint(c,loc))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No whitespace around `<=' but do add whitespace after ","

{
result+="\\u";
char hex[5];
sprintf(hex,"%04x",(wchar_t)c);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing whitespace after ","

@romainbrenguier romainbrenguier force-pushed the string-refine-unicode branch 3 times, most recently from 5bc14cc to af792fe Compare December 22, 2016 11:00
@martin-cs
Copy link
Collaborator

martin-cs commented Dec 22, 2016 via email

@romainbrenguier
Copy link
Contributor Author

@martin-cs The following test is handled correctly by the solver:

class test
{
    public static void main(String[] args)
    {
	String s="💩Iñtërnâtiônàlizætiøn☃";
	System.out.println(s.length());
	System.out.println(s.codePointAt(0));
	assert(s.length() == 23);
	assert(s.codePointAt(0) == 128169);
    }
}

Note however that the current solver is not able to check the same program for codePoint other than 0 (correspondence between code point and the actual index can be complicated in general).

@martin-cs
Copy link
Collaborator

martin-cs commented Jan 9, 2017 via email

std::wstring utf8_to_utf16_little_endian(const std::string& in)
{
const std::codecvt_mode mode=std::codecvt_mode::little_endian;
const unsigned long maxcode=0x10ffff;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for making this easier to read. Would it be possible to add some reference on where you got that magic number from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment and a reference to the c++ documentation: ccbe6f7

@tautschnig
Copy link
Collaborator

@romainbrenguier Is the test you mentioned already part of some commit series? If it isn't it would be nice if it were added as a "FUTURE" test to this pull request (and then changed to "CORE" once all parts of the string solver are merged).

@romainbrenguier
Copy link
Contributor Author

The following test is already marked as future in CBMC: https://github.com/diffblue/cbmc/blob/master/regression/strings/java_code_point/test_code_point.java and it illustrates the same feature. But I can still add the "💩Iñtërnâtiônàlizætiøn☃" test, should it be part of this pull request or another one?

Copy link
Collaborator

@tautschnig tautschnig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to see the commits squashed into a single one, but otherwise it looks good to me!

@tautschnig
Copy link
Collaborator

Hmm, please hold off on merging this one: it seems further fixes for the travis integration of Clang are necessary in order to support codecvt.

Copy link
Collaborator

@tautschnig tautschnig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a commit changing .travis.yml to set COMPILER="clang++-3.7 -stdlib=libc++" and ensure that $COMPILER is quoted?

@romainbrenguier romainbrenguier force-pushed the string-refine-unicode branch 4 times, most recently from 36e32b7 to c7a101a Compare January 11, 2017 19:56
@romainbrenguier
Copy link
Contributor Author

@tautschnig I squashed the commits and added the modification to .travis.yml you asked for c7a101a , however I'm not sure the compilation with clang++ works

@@ -28,16 +28,16 @@ matrix:
- clang-3.7
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems possibly that we'll need an explicit
- libc++-dev
line in here.

.travis.yml Outdated
- os: osx
compiler: gcc
env: COMPILER=g++
- os: osx
compiler: clang
env: COMPILER=clang++
env: COMPILER="clang++ -stdlib=libc++"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That change isn't necessary as libc++ is the default for Mac OS.

@tautschnig
Copy link
Collaborator

If anyone has a Ubuntu system with Clang (3.7, ideally) at hand and could help debug the issues this would be much appreciated. At present, we are blocked as merging this work would break our travis set up.

@romainbrenguier
Copy link
Contributor Author

@tautschnig I tried the failing command on ubuntu with clang++-3.7 and reproduced the error gotten by travis but after installing libstd++-dev the error disappeared, so I don't know what's happening...

@tautschnig
Copy link
Collaborator

@romainbrenguier Can you include in this pull request a travis config that matches the steps you took to make things work? It seems that installing libc++-dev is the wrong library (you stated libstdc++-dev works), and probably the compiler flags need to match as well. Thanks!!!

@romainbrenguier
Copy link
Contributor Author

@tautschnig sorry I meant libc++-dev , the travis file already mention the library I installed

@romainbrenguier
Copy link
Contributor Author

Using -stdlib=libstdc++ instead of -stdlib=libc++ makes the problem with cctype disappear put now there is a problem with codecvt

@kroening
Copy link
Member

Did you also try Visual Studio?

If C++ library support for the conversion isn't working yet, we might have to put our own implementation in.

smowton and others added 5 commits February 2, 2017 17:04
Added two functions for utf8 to utf16 conversion function depending on
whether we use little or big endian.
Added a function utf16_little_endian_to_ascii to display nicely java
strings as an ascii sequence.
@smowton
Copy link
Contributor

smowton commented Feb 10, 2017

Note: Vojtech has asked me to look into the outstanding problems here. AFAIK they are:

  1. figure out how and why this can be made to work with Clang (and specifically libc++, or libstd++ too?)
  2. Similarly the VS versions we target

@tautschnig
Copy link
Collaborator

@smowton In case you need confirmation: yes, I believe you have summarised the problems nicely.

@smowton
Copy link
Contributor

smowton commented Feb 13, 2017

Confirming Romain's findings:

  • G++ 5: success
  • Clang++-3.7 / libstdc++-4.9: fails (codecvt not implemented)
  • Clang++-3.7 / libstdc++-5: success. Thus the install for libstdc++-5-dev is necessary.
  • Clang++-3.7 / libc++: success on my local machine, but not easy to do under Travis (there is no libc++ package for Ubuntu 12.04, which Travis runs in its VMs per default. We could switch to 14.04 per https://docs.travis-ci.com/user/ci-environment/ if we wanted to CI this case). Looking at the SVN history for libc++ / codecvt this seems likely to work back to around 2013.
  • Visual Studio: success according to AppVeyor, and according to MSDN this should also work back to VS 2010 (according to the COMPILING doc we only support back to 2013 anyhow)

Explanations for the comments above:

  • Why does installing libstdc++-5-dev make a difference? Because Ubuntu 12.04 defaults to GCC and libstdc++ 4.9, which don't support codecvt. In the G++ Travis config we install g++-5 which pulls libstdc++-5; in the Clang case we must explicitly pull it in.
  • Why does libc++ fail with a cctype error? The attempt to install libc++ isn't working at all (the package doesn't exist), and that's simply the first stdlib include in the build.

@smowton
Copy link
Contributor

smowton commented Feb 13, 2017

#541 is a probable fix (waiting for Travis + AppVeyor)

@smowton
Copy link
Contributor

smowton commented Feb 13, 2017

Fixed one other VS problem-- it doesn't have snprintf until VS 2015. Replaced with iostreams equivalent in #541.

@forejtv forejtv closed this Feb 13, 2017
smowton pushed a commit to smowton/cbmc that referenced this pull request May 9, 2018
67c80fc Merge pull request diffblue#394 from diffblue/jeannie/LinkedHashMapIterators
6667484 Tests that the iteration order is correct.
38301ea Tests other methods and constructor in LinkedHashMap
f795b3d Models other methods in LinkedHashMap.
25773a6 Tests entrySet(), keySet() and values() in LinkedHashMap
ee8cfad Models keySet(), entrySet() and values() in LinkedHashMap.
02c8271 Merge pull request diffblue#393 from diffblue/jeannie/UpdateReadMeForSpec
81460d3 Update readme.md to include new style for specs
9efcce3 Merge pull request diffblue#396 from diffblue/antonia/clean-up-for-TG-1081
c902c03 Merge pull request diffblue#397 from diffblue/jeannie/ForgotAppendObjectDocs
772a977 Merge pull request diffblue#390 from diffblue/antonia/enable-fixed-tests
5bef2ff Merge pull request diffblue#398 from diffblue/antonia/ticket-references-bugfix
ed1dca2 Merge pull request diffblue#395 from diffblue/allredj/disable-tests-failing-on-tg2717
c07841d Add more tests for String.getBytes(Charset)
d568e47 Fix array index bug in String.getBytesUTF_16
2218407 Model String.getBytes(Charset)
9c3a8bc Clarify difference of String.getBytes from JDK
cf1c23b Merge the two active scenarios in String.spec
dd5d2d9 Remove support_v1 tag from String specs
5f9a7e1 Split String tests into Level 0 and Level 1+
f2877b9 Enable Class test that was blocked by TG-1081
034f3e0 Remove reference to TG-1081 from File model
43afde7 Force static initialiser for File model
d95ff9e Remove reference to fixed bug from Date model
0807806 Remove references to fixed bugs from Arrays model
cde4085 Remove references to fixed bugs from HashMap model
d85fe5b Update RaceTimes references to TG-1404 and TG-1523
0e925e5 Update ticket number in HashMap.spec
03a5186 Enable TG-1404 tests
fa051dd Delete ArrayList CustomType test file
b0e853b Enable HashMap test previously blocked by TG-1877
44bfe0a Merge pull request diffblue#392 from diffblue/lajw/TG-2389-enable-tests
595dd5d Changes CProver helped methods in HashMap to protected.
c52771c Merge pull request diffblue#385 from diffblue/jeannie/UpdateTestRunner
de0abdc Enable tests fixed by recent test-gen fixes
b64357a Remove ticket numbers from resolved bugfix tests
53eca00 Documents StringBuilder and StringBuffer append(Object)
baec23f Merge pull request diffblue#389 from diffblue/antonia/enable-TG-2666-test
b2f0258 Enable LinkedList test that was blocked by TG-2666
57e79e5 Add knownbug tests for TG-2717
80fa433 Merge pull request diffblue#387 from diffblue/forejtv/unsupportedcharsetexception
990129c Merge pull request diffblue#391 from diffblue/allredj/disable-html-report
97f32f6 Don't write to the Html report
4cb5996 Merge pull request diffblue#382 from diffblue/antonia/address-ArrayList-todos
6749702 Merge pull request diffblue#386 from diffblue/antonia/gauge-telemetry-off
546dfdc Move legacy style tests into main Gauge step
5be886f Mark UnsupportedCharsetException as untested
1c17838 Add regression test for side effects
8c836ab Add tests for ArrayLists w. (non-default) capacity
34a141d Address bugfix TODOs in ArrayList
563b631 Correct bug description in comment
3e7603d Merge pull request diffblue#383 from diffblue/antonia/reformat-HashSet-tests
5d4e013 [TG-2751] Added UnsupportedCharsetException
11b28fb Reformat HashSet.spec
f77c3c4 Rename HashSet_L2.spec to HashSet_L0.spec
a16a1e7 Move all HashSet Maven tests into HashSet.spec
f21a2da Merge pull request diffblue#376 from diffblue/jeannie/LinkedHashMap
820c5f7 Merge pull request diffblue#380 from diffblue/jeannie/AppendObject
af65f4d Tests java.lang.StringBuffer append(Object)
a554517 Reformat tests in java.util.StringBuilder
dd6d3f6 Models append(Object) for StringBuilder and StringBuffer.
8920399 Tests toString() methods on existing models where possible
647f4fe Tests toString() methods on existing models where possible
978273b Documents and implements toString() methods in existing models.
78020ee Documents java.util.LinkedHashMap
fb0cf92 Tests java.util.LinkedHashMap
7a9df4e Models java.util.LinkedHashMap
5a8af60 Marks all methods as notModelled() for java.util.LinkedHashMap
8d6b149 Initial commit for java.util.LinkedHashMap
34b7c54 Merge pull request diffblue#359 from diffblue/forejtv/throwable-no-static
e2230de Cleanup of unused (mostly static) variables
255013e Merge pull request diffblue#384 from diffblue/jeannie/DisableBoundedGenericHashMap
f374d5f Merge pull request diffblue#381 from diffblue/justin/TG2600-Correction
ce6328b Turn off Gauge telemetry on Travis
6ab2864 Updates TestRunner.java to mimic platform parameters.
8ee75a4 Disables a HashMap test that depends on bounded generic type.
3677c3a [TG-2600] correct a mistake in the L1RemoveLast test
4511813 Merge pull request diffblue#378 from diffblue/antonia/LinkedList-first-model
6bda7ea Add tests for LinkedList
b5d4cbe Model LinkedList methods specified in TG-2600
aa6e90a Empty models for new classes
3c25555 Copy LinkedList and related classes from jdk
5e4d410 Merge pull request diffblue#373 from diffblue/romain/tests/activate-arrays-hashset-test#TG-1404
a0d9289 Activate some Arrays test
27d06b4 Activate tests for Hashset fixed by TG-1404
224962a Merge pull request diffblue#371 from diffblue/romain/tests/activate-after-fix-1404
ed26b28 Activate level 2 tests fixed by TG-1404
b59c1e9 Relabel known-bug for Level2 HashMap test
5c0041b Activate tests for HashMap.values fixed by TG-1404
2777872 Activate tests for HashMap.entrySet
50d3f2c Activate tests for HashMap.keySet fixed by TG-1404
6146215 Activate ArrayList test fixed by TG-1404
bc09fd7 Merge pull request diffblue#365 from diffblue/jeannie/getTimeZone
29cf7e0 Merge pull request diffblue#372 from diffblue/forejtv/bump-up-gauge-v
bdd500a Change Gauge Java Maven Plugin to 0.6.6
190aa18 Tests java.util.TimeZone
170a992 Documents java.util.TimeZone in javadocs
9e52664 Models java.util.TimeZone constructor, getID, setID and getTimeZone.
eac578e Tests sun.util.calendar.ZoneInfo
b4be728 Documents sun.util.calendar.ZoneInfo in javadocs
e5ffa98 Models the sun.util.calendar.ZoneInfo constructor and getTimeZone()
6d704ad Marks methods as notModelled() for sun.util.calendar.ZoneInfo
d249adb Marks methods as notModelled() for java.util.TimeZone
7f71d8d Initial commit for sun.util.calendar.ZoneInfo.
b53d9c1 Initial commit for java.util.TimeZone
056aad2 Merge pull request diffblue#368 from diffblue/allredj/fix-hashtable-spec
8848414 Merge pull request diffblue#370 from diffblue/allredj/stringbuffer-spec-small-correction
1d947e5 Small correction to StringBuffer spec file
42f1a93 Fix typo in hashtable spec file

git-subtree-dir: benchmarks/LIBRARIES/models
git-subtree-split: 67c80fcdcf82418b5e7099ae63dec3360b153f90
smowton pushed a commit to smowton/cbmc that referenced this pull request May 9, 2018
…rect-test-and-mark-xfail

[SEC-202] Correct test and mark xfail
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants