android - Parsing ASCII characters with Erlang -


confused parsing needs done , @ end client/server.

when send umlaut 'Ö' ejabberd,  received ejabberd <<"195, 150">> 

following send client push notifications (via gcm/apns silently). there, client builds utf-8 decoding on each numeral 1 one (this wrong).

i.e. 195 first decoded gibberish character � , on. 

this reconstruction needs identification if 2 bytes entertained or 3 or more. varies language of letters (german here e.g.).

how client identify language going reconstruct (no. of bytes decode in 1 go)?

to add more,

lists:flatten(mochijson2:encode({struct,[{registration_ids,[reg_id]},{data ,[{message,message},{type,type},{enum,enum},{groupid,groupid},{groupname,groupname},{sender,sender_list},{receiver,content_list}]},{time_to_live,2419200}]})). 

produced json as:

"{\"registration_ids\":[\"apa91bgljnkhqzlqfep7mto9p1vu9s92_a0uizluhnhl4xdftaz_0hpd5sisb4jnrpi2d7_c8d_mbhut_k-t2bo_i_g3jt1kiqbgqkrfwb3gp1jegatromsfg4gajsekclzffijeeyow\"],\"data\":{\"message\":[104,105],\"type\":[71,82,79,85,80],\"enum\":2001,\"groupid\":[71,73,68],\"groupname\":[71,114,111,117,112,78,97,109,101],\"sender\":[49,64,100,101,118,108,97,98,47,115,100,115],\"receiver\":[97,115,97,115]},\"time_to_live\":2419200}" 

where had given "hi" message , mochijson gave me ascii values [104,105].

the groupname field given value "groupname", asciis correct after json creation i.e. 71,114,111,117,112,78,97,109,101 

however when use http://www.unit-conversion.info/texttools/ascii/

it decodes Ǎo��me , not "groupname". 

so, should parsing? how same should handled.

my reconstructed message gibberuish when ascii reconstructed.

thanks

the things worry here manyfold, , has both encoding desired or datastructure. in erlang, text handled in 1 of following ways:

  1. lists of bytes ([0..255, ...])
    • this if listen socket , data returned list.
    • the vm assumes no encoding. they're bytes , mean little more.
    • the vm can interpret these strings (say in io:format("~s~n", [list])). when happens (with ~s flag specifically), vm assumes encoding latin-1 (iso-8859-1).
  2. lists of unicode codepoints ([0..1114111, ...]).
    • you may files read unicode and list.
    • you can use them in output when have formatter such io:format("~ts~n", [list]) ~ts ~s unicode.
    • those lists represent codepoints see in unicode standard, without encoding (they not utf-x)
    • this can work in conjunction latin-1 lists of characters because unicode codepoints , latin1 characters have same sequence numbers below 255.
  3. binaries (<<0..255, ...>>)
    • this if listen or read to/from under binary format.
    • the vm can told assume many things:
      1. they sequences of bytes (0..255) without specific meaning (<<bin/binary>>)
      2. they utf-8 encoded sequences (<<bin/utf-8>>)
      3. they utf-16 encoded sequences (<<bin/utf-16>>)
      4. they utf-32 encoded sequences (<<bin/utf-32>>)
    • io:format("~s~n", [bin]) still assume sequence latin-1 sequence; io:format("~ts~n", [bin]) assume utf-8 only.
  4. a mixed list of both unicode lists , utf-encoded binaries (known iodata()), used exclusively output.

so in gist:

  • lists of bytes
  • lists of latin-1 characters
  • lists of unicode codepoints
  • binary of bytes
  • utf-8 binary
  • utf-16 binary
  • utf-32 binary
  • lists of many of these output concatenated

also note: until version 17.0, erlang source files latin-1 only. 17.0 added option have compiler read source file unicode adding header:

%% -*- coding: utf-8 -*- 

the next factor json, specification, assuming utf-8 encoding has. furthermore, json libraries in erlang tend assume binary string, , lists json arrays.

this means if want output adequate, must use utf-8 encoded binaries represent json.

if have is:

  • a list of bytes represent utf-encoded string, list_to_binary(list) proper binary representation
  • a list of codepoints, use unicode:characters_to_binary(list, unicode, utf8) utf-8 encoded binary
  • a binary representing latin-1 string: unicode:characters_to_binary(bin, latin1, utf8)
  • a binary of other utf encoding: unicode:characters_to_binary(bin, utf16 | utf32, utf8)

take utf-8 binary, , send json library. if json library correct and client parses properly, should right.


Comments

Popular posts from this blog

How to connect android app to App engine -

gcc - MinGW's ld cannot perform PE operations on non PE output file -

php - display validation error message next to the textbox in codeigniter -