android - Parsing ASCII characters with Erlang -
confused parsing needs done , @ end client/server.
when send umlaut 'Ö' ejabberd, received ejabberd <<"195, 150">> following send client push notifications (via gcm/apns silently). there, client builds utf-8 decoding on each numeral 1 one (this wrong).
i.e. 195 first decoded gibberish character � , on. this reconstruction needs identification if 2 bytes entertained or 3 or more. varies language of letters (german here e.g.).
how client identify language going reconstruct (no. of bytes decode in 1 go)?
to add more,
lists:flatten(mochijson2:encode({struct,[{registration_ids,[reg_id]},{data ,[{message,message},{type,type},{enum,enum},{groupid,groupid},{groupname,groupname},{sender,sender_list},{receiver,content_list}]},{time_to_live,2419200}]})). produced json as:
"{\"registration_ids\":[\"apa91bgljnkhqzlqfep7mto9p1vu9s92_a0uizluhnhl4xdftaz_0hpd5sisb4jnrpi2d7_c8d_mbhut_k-t2bo_i_g3jt1kiqbgqkrfwb3gp1jegatromsfg4gajsekclzffijeeyow\"],\"data\":{\"message\":[104,105],\"type\":[71,82,79,85,80],\"enum\":2001,\"groupid\":[71,73,68],\"groupname\":[71,114,111,117,112,78,97,109,101],\"sender\":[49,64,100,101,118,108,97,98,47,115,100,115],\"receiver\":[97,115,97,115]},\"time_to_live\":2419200}" where had given "hi" message , mochijson gave me ascii values [104,105].
the groupname field given value "groupname", asciis correct after json creation i.e. 71,114,111,117,112,78,97,109,101 however when use http://www.unit-conversion.info/texttools/ascii/
it decodes Ǎo��me , not "groupname". so, should parsing? how same should handled.
my reconstructed message gibberuish when ascii reconstructed.
thanks
the things worry here manyfold, , has both encoding desired or datastructure. in erlang, text handled in 1 of following ways:
- lists of bytes (
[0..255, ...])- this if listen socket , data returned list.
- the vm assumes no encoding. they're bytes , mean little more.
- the vm can interpret these strings (say in
io:format("~s~n", [list])). when happens (with~sflag specifically), vm assumes encodinglatin-1(iso-8859-1).
- lists of unicode codepoints (
[0..1114111, ...]).- you may files read unicode and list.
- you can use them in output when have formatter such
io:format("~ts~n", [list])~ts~sunicode. - those lists represent codepoints see in unicode standard, without encoding (they not
utf-x) - this can work in conjunction latin-1 lists of characters because unicode codepoints , latin1 characters have same sequence numbers below 255.
- binaries (
<<0..255, ...>>)- this if listen or read to/from under
binaryformat. - the vm can told assume many things:
- they sequences of bytes (
0..255) without specific meaning (<<bin/binary>>) - they utf-8 encoded sequences (
<<bin/utf-8>>) - they utf-16 encoded sequences (
<<bin/utf-16>>) - they utf-32 encoded sequences (
<<bin/utf-32>>)
- they sequences of bytes (
io:format("~s~n", [bin])still assume sequence latin-1 sequence;io:format("~ts~n", [bin])assumeutf-8only.
- this if listen or read to/from under
- a mixed list of both unicode lists , utf-encoded binaries (known
iodata()), used exclusively output.
so in gist:
- lists of bytes
- lists of latin-1 characters
- lists of unicode codepoints
- binary of bytes
- utf-8 binary
- utf-16 binary
- utf-32 binary
- lists of many of these output concatenated
also note: until version 17.0, erlang source files latin-1 only. 17.0 added option have compiler read source file unicode adding header:
%% -*- coding: utf-8 -*- the next factor json, specification, assuming utf-8 encoding has. furthermore, json libraries in erlang tend assume binary string, , lists json arrays.
this means if want output adequate, must use utf-8 encoded binaries represent json.
if have is:
- a list of bytes represent utf-encoded string,
list_to_binary(list)proper binary representation - a list of codepoints, use
unicode:characters_to_binary(list, unicode, utf8)utf-8 encoded binary - a binary representing latin-1 string:
unicode:characters_to_binary(bin, latin1, utf8) - a binary of other utf encoding:
unicode:characters_to_binary(bin, utf16 | utf32, utf8)
take utf-8 binary, , send json library. if json library correct and client parses properly, should right.
Comments
Post a Comment