java - JavaCC and Unicode issue. Why \u696d cannot be managed in JavaCC although it belong to the range "\u4e00"-"\u9fff" -


we're trying use javacc parser parse source code in utf-8( language japanese). in javacc, have declaration like:

< #letter:   [    "\u0024",    "\u0041"-"\u005a",    "\u005f",    "\u0061"-"\u007a",    "\u00c0"-"\u00d6",    "\u00d8"-"\u00f6",    "\u00f8"-"\u00ff",    "\u0100"-"\u1fff",    "\u3040"-"\u318f",    "\u3300"-"\u337f",    "\u3400"-"\u3d2d",    "\u4e00"-"\u9fff",    "\uf900"-"\ufaff"   ] > 

if meets string "日建フェンス工業", fail because of 業 character. if remove it, works expected. code of 業 character "\u696d", , can see in declaration, should belong range "\u4e00"-"\u9fff"

any suggestion on this?

ps: if rewrite grammar using antlr, how like

thank much

there nothing wrong token fragment , nothing wrong javacc. problem lies elsewhere.

here javacc specification made copying , pasting problem code javacc.

options {   static = true;   debug_token_manager = true ; }  parser_begin(mynewgrammar) package funnyunicode; import java.io.stringreader ;  public class mynewgrammar {   public static void main(string args []) throws parseexception   {     mynewgrammar parser = new mynewgrammar(new stringreader("日建フェンス工業"));     mynewgrammar.go() ;     system.out.println("ok."); } } parser_end(mynewgrammar)  token : {   < word : (<letter>)+ > |   < #letter:   [    "\u0024",    "\u0041"-"\u005a",    "\u005f",    "\u0061"-"\u007a",    "\u00c0"-"\u00d6",    "\u00d8"-"\u00f6",    "\u00f8"-"\u00ff",    "\u0100"-"\u1fff",    "\u3040"-"\u318f",    "\u3300"-"\u337f",    "\u3400"-"\u3d2d",    "\u4e00"-"\u9fff",    "\uf900"-"\ufaff"   ] > }  void go() : {token tk ; } {   tk=<word> <eof> } 

and here output resulting java program

current character : \u65e5 (26085) @ line 1 column 1    starting nfa match 1 of : { <word> } current character : \u65e5 (26085) @ line 1 column 1    matched first 1 characters <word> token.    possible kinds of longer matches : { <word> } current character : \u5efa (24314) @ line 1 column 2    matched first 2 characters <word> token.    possible kinds of longer matches : { <word> } current character : \u30d5 (12501) @ line 1 column 3    matched first 3 characters <word> token.    possible kinds of longer matches : { <word> } current character : \u30a7 (12455) @ line 1 column 4    matched first 4 characters <word> token.    possible kinds of longer matches : { <word> } current character : \u30f3 (12531) @ line 1 column 5    matched first 5 characters <word> token.    possible kinds of longer matches : { <word> } current character : \u30b9 (12473) @ line 1 column 6    matched first 6 characters <word> token.    possible kinds of longer matches : { <word> } current character : \u5de5 (24037) @ line 1 column 7    matched first 7 characters <word> token.    possible kinds of longer matches : { <word> } current character : \u696d (26989) @ line 1 column 8    matched first 8 characters <word> token.    possible kinds of longer matches : { <word> } ****** found <word> match (\u65e5\u5efa\u30d5\u30a7\u30f3\u30b9\u5de5\u696d) ******  returning <eof> token.  ok. 

as can see generated tokenizer has no trouble seeing \u696d letter.


Comments

Popular posts from this blog

powershell Start-Process exit code -1073741502 when used with Credential from a windows service environment -

twig - Using Twigbridge in a Laravel 5.1 Package -

c# - LINQ join Entities from HashSet's, Join vs Dictionary vs HashSet performance -