Java Regex to match Vietnamese chars -
i have write regex restrict input field, allow vietnamese chars, english language chars , digits. know how restrict english language chars ([a-za-z]
) , digits ([0-9]
), don't know how restrict vietnamese chars.
can give me java regex match vietnamese chars?
vietnamese chars like: ể, ứ (edit: don't know of them. otherwise, can use[a-list-of-chars]
, or maybe there range, [a-d]
instead of [abcd]
)
vietnamese alphabet
the intersection of vietnamese alphabet , english language alphabet (i.e. whatever mutual between 2 alphabets) alphabet of english language minus f
, j
, w
, z
.
in vietnamese, a
, e
, i
, o
, u
, y
considered vowels.
apart those, vietnamese utilize several other characters diacritics. below list uppercase of character (the lowercase version has 1-character-to-1-character mapping, unlike ß in german):
consonant:
Đ: latin capital letter d stroke
vowels:
Ă: latin capital letter breve Â: latin capital letter circumflex Ê: latin capital letter e circumfle Ô: latin capital letter o circumflex Ơ: latin capital letter o horn Ư: latin capital letter u horn
vietnamese has 6 tones, except first tone, other 5 tones indicated diacritic on vowels. tonal diacritics acute á
, grave à
, hook ả
, tilde ã
, dot below ạ
. since there (6 + 6) vowels times 5 tones diacritics, plus 6 vowels diacritic on first tone, there 66 glyphs of vowels diacritic(s):
here list of (67) consonants , vowels diacritic(s):
Á À Ã Ả Ạ Ă Ắ Ằ Ẳ Ẵ Ặ Â Ấ Ầ Ẩ Ẫ Ậ Đ É È Ẻ Ẽ Ẹ Ê Ế Ề Ể Ễ Ệ Í Ì Ỉ Ĩ Ị Ô Ố Ồ Ổ Ỗ Ộ Ơ Ớ Ờ Ở Ỡ Ợ Ó Ò Õ Ỏ Ọ Ư Ứ Ừ Ử Ữ Ự Ú Ù Ủ Ũ Ụ Ý Ỳ Ỷ Ỹ Ỵ
these characters spread across different latin blocks in unicode. handpicked these characters character map, , had careful not pick characters visually identical character above. sure, can print names of characters , check latin character rather greek or cyrillic.
string vietnamese_diacritic_characters = "ẮẰẲẴẶĂẤẦẨẪẬÂÁÀÃẢẠĐẾỀỂỄỆÊÉÈẺẼẸÍÌỈĨỊỐỒỔỖỘÔỚỜỞỠỢƠÓÒÕỎỌỨỪỬỮỰƯÚÙỦŨỤÝỲỶỸỴ"; (char c: vietnamese_diacritic_characters.tochararray()) { system.out.println(c + ": " + character.getname(c)); }
combining character vietnamese input methods such unikey has 2 modes: single code point mode ("unicode dựng sẵn"), , combining mark mode ("unicode tổ hợp").
as example, same character ợ
(u+1ee3), there can several ways specify it:
ợ
as combination of ơ
(u+01a1) , combining dot below (u+0323) (2 code points): ợ
as combination of o
, combining hook (u+031b), , combining dot below (u+0323) (3 code points): ợ
you can re-create these character console of browser , check length:
["ợ","ợ","ợ"].foreach(function (e) {console.log(e.length);})
if want match 3 variations above, must list possible combinations , permutations specify character, and have characters diacritics listed above, and in both uppercase , lowercase.
easy enough?
even if reply yes, code become unmaintainable mess no 1 can understand.
canonical equivalencesince there more 1 ways specify same text ợ
, without transformation, not possible compare ợ
, ợ
equal.
"ợ".equals("ợ") --> false
unicode standard hence define 3 ways specify ợ
above canonically equivalent, , define methods normalize string comparing purpose.
the reference implementation of pattern
class (by oracle, used on windows , other platforms) has (partial) back upwards canonical equivalence matching using pattern.canon_eq
mode. extremely buggy point of unusable seen in this , this bug report. @ time of writing, bug has been there on version since canon_eq
"supported", , not fixed time soon. however, not totally broken, , can still create utilize of whatever offered option.
below construction of pattern
matching vietnamese + english language alphabet, :
string vietnamese_diacritic_characters = "ẮẰẲẴẶĂẤẦẨẪẬÂÁÀÃẢẠĐẾỀỂỄỆÊÉÈẺẼẸÍÌỈĨỊỐỒỔỖỘÔỚỜỞỠỢƠÓÒÕỎỌỨỪỬỮỰƯÚÙỦŨỤÝỲỶỸỴ"; pattern p = pattern.compile("(?:[" + vietnamese_diacritic_characters + "]|[a-z])++", pattern.canon_eq | pattern.case_insensitive | pattern.unicode_case);
the additional flags pattern.case_insensitive | pattern.unicode_case
used create pattern matches case-insensitively unicode characters. pattern.case_insensitive
lone makes pattern matches case-insensitively characters in us-ascii charset.
note order of characters in vietnamese_diacritic_characters
significant. don't recommend changing order of characters unless understand implication.
the input should normalized canonical decomposition (nkd) or canonical composition (nkc) before matching performed on it. ensures combining marks in canonical order.
regardless of whether input preprocessed canonical composition or canonical decomposition, result looks same. running code in appendix should homecoming visually identical result sec , 3rd output:
bạn chính là tác giả của wikipedia mọi người đều có thể biên tập bài ngay lập tức chỉ cần nhớ vài quy tắc có sẵn rất nhiều trang trợ giúp như tạo bài sửa bài hay tải ảnh bạn cũng đừng ngại đặt câu hỏi hiện chúng ta có bài viết và thành viên
bạn chính là tác giả của wikipedia mọi người đều có thể biên tập bài ngay lập tức chỉ cần nhớ vài quy tắc có sẵn rất nhiều trang trợ giúp như tạo bài sửa bài hay tải ảnh bạn cũng đừng ngại đặt câu hỏi hiện chúng ta có bài viết và thành viên
here failed attempts, used explain why regex constructed shown above.
attempt 1
string vietnamese_diacritic_characters = "ẮẰẲẴẶĂẤẦẨẪẬÂÁÀÃẢẠĐẾỀỂỄỆÊÉÈẺẼẸÍÌỈĨỊỐỒỔỖỘÔỚỜỞỠỢƠÓÒÕỎỌỨỪỬỮỰƯÚÙỦŨỤÝỲỶỸỴ"; pattern p = pattern.compile("[a-z" + vietnamese_diacritic_characters + "]++", pattern.canon_eq | pattern.case_insensitive | pattern.unicode_case);
why don't include a-z
single character class instead of putting in separate character class , alternate diacritic character class?
nope, result broken when seek match on canonical decomposition of input string. diacritics not matched @ all.
ba n chi nh la ta c gia cu wikipedia mo ngu o đe u co bie n ta p ba ngay la p tu c chi ca n nho va quy ta c co sa n ra t nhie u trang tro giu p nhu ta o ba su ba hay ta nh ba n cu ng đu ng nga đa t ca u ho hie n chu ng ta co ba vie t va tha nh vie n
attempt 2
string vietnamese_diacritic_characters = "ÁÀÃẢẠĂẮẰẲẴẶÂẤẦẨẪẬĐÉÈẺẼẸÊẾỀỂỄỆÍÌỈĨỊÓÒÕỎỌÔỐỒỔỖỘƠỚỜỞỠỢÚÙỦŨỤƯỨỪỬỮỰÝỲỶỸỴ"; pattern p = pattern.compile("(?:[" + vietnamese_diacritic_characters + "]|[a-z])++", pattern.canon_eq | pattern.case_insensitive | pattern.unicode_case);
the diacritic characters declared in character class, code should behave same when alter order of character... right?
nope, results broken when seek match on canonical decomposition of input string.
bạn chính là tác giả của wikipedia mọi ngươ đê u có thê biên tạ p bài ngay lạ p tư c chỉ câ n nhơ vài quy tă c có să n râ t nhiê u trang trơ giúp như tạo bài sư bài hay tải ảnh bạn cũng đư ng ngại đạ t câu hỏi hiẹ n chúng ta có bài viê t và thành viên
the reference implementation (oracle) implements pattern.canon_eq
mode picking out characters in look can expanded multiple characters under canonical decomposition , perform textual transformation of regex. then, look compiled per normal.
the first pass transform regex doesn't parse look properly, exhibits crazy behavior simple matching seen in bug reports above.
fortunately, pattern
class spits out regex after transformation if there unmatched (
in regex. therefore, can add together (
@ end trigger patternsyntaxexception
, @ transformed regex string.
let's mess solution regex above , see regex string enters compilation step:
java.util.regex.patternsyntaxexception: unclosed grouping near index 596 (?:(?:[Đ]|ắ|Ắ|Ắ|ằ|Ằ|Ằ|ẳ|Ẳ|Ẳ|ẵ|Ẵ|Ẵ|ặ|Ặ|Ặ|ặ|Ặ|Ặ|ă|Ă|ấ|Ấ|Ấ|ầ|Ầ|Ầ|ẩ|Ẩ|Ẩ|ẫ|Ẫ|Ẫ|ậ|Ậ|Ậ|ậ|Ậ|Ậ|â|Â|á|Á|à|À|ã|Ã|ả|Ả|ạ|Ạ|ế|Ế|Ế|ề|Ề|Ề|ể|Ể|Ể|ễ|Ễ|Ễ|ệ|Ệ|Ệ|ệ|Ệ|Ệ|ê|Ê|é|É|è|È|ẻ|Ẻ|ẽ|Ẽ|ẹ|Ẹ|í|Í|ì|Ì|ỉ|Ỉ|ĩ|Ĩ|ị|Ị|ố|Ố|Ố|ồ|Ồ|Ồ|ổ|Ổ|Ổ|ỗ|Ỗ|Ỗ|ộ|Ộ|Ộ|ộ|Ộ|Ộ|ô|Ô|ớ|Ớ|Ớ|ớ|Ớ|Ớ|ờ|Ờ|Ờ|ờ|Ờ|Ờ|ở|Ở|Ở|ở|Ở|Ở|ỡ|Ỡ|Ỡ|ỡ|Ỡ|Ỡ|ợ|Ợ|Ợ|ợ|Ợ|Ợ|ơ|Ơ|ó|Ó|ò|Ò|õ|Õ|ỏ|Ỏ|ọ|Ọ|ứ|Ứ|Ứ|ứ|Ứ|Ứ|ừ|Ừ|Ừ|ừ|Ừ|Ừ|ử|Ử|Ử|ử|Ử|Ử|ữ|Ữ|Ữ|ữ|Ữ|Ữ|ự|Ự|Ự|ự|Ự|Ự|ư|Ư|ú|Ú|ù|Ù|ủ|Ủ|ũ|Ũ|ụ|Ụ|ý|Ý|ỳ|Ỳ|ỷ|Ỷ|ỹ|Ỹ|ỵ|Ỵ)|[a-z])++( ^
as can see, engine grab characters can expand under canonical decomposition, take outside character class , build alternation.
it still not clear happening same characters repeating in alternation, insert space between every character:
( ? : ( ? : [ Đ ] | ̆ ́ | Ă ́ | Ắ | ̆ ̀ | Ă ̀ | Ằ | ̆ ̉ | Ă ̉ | Ẳ | ̆ ̃ | Ă ̃ | Ẵ | ̣ ̆ | Ạ ̆ | Ặ | ̆ ̣ | Ă ̣ | Ặ | ̆ | Ă | ̂ ́ | Â ́ | Ấ | ̂ ̀ | Â ̀ | Ầ | ̂ ̉ | Â ̉ | Ẩ | ̂ ̃ | Â ̃ | Ẫ | ̣ ̂ | Ạ ̂ | Ậ | ̂ ̣ | Â ̣ | Ậ | ̂ | Â | ́ | Á | ̀ | À | ̃ | Ã | ̉ | Ả | ̣ | Ạ | e ̂ ́ | Ê ́ | Ế | e ̂ ̀ | Ê ̀ | Ề | e ̂ ̉ | Ê ̉ | Ể | e ̂ ̃ | Ê ̃ | Ễ | e ̣ ̂ | Ẹ ̂ | Ệ | e ̂ ̣ | Ê ̣ | Ệ | e ̂ | Ê | e ́ | É | e ̀ | È | e ̉ | Ẻ | e ̃ | Ẽ | e ̣ | Ẹ | ́ | Í | ̀ | Ì | ̉ | Ỉ | ̃ | Ĩ | ̣ | Ị | o ̂ ́ | Ô ́ | Ố | o ̂ ̀ | Ô ̀ | Ồ | o ̂ ̉ | Ô ̉ | Ổ | o ̂ ̃ | Ô ̃ | Ỗ | o ̣ ̂ | Ọ ̂ | Ộ | o ̂ ̣ | Ô ̣ | Ộ | o ̂ | Ô | o ̛ ́ | Ơ ́ | Ớ | o ́ ̛ | Ó ̛ | Ớ | o ̛ ̀ | Ơ ̀ | Ờ | o ̀ ̛ | Ò ̛ | Ờ | o ̛ ̉ | Ơ ̉ | Ở | o ̉ ̛ | Ỏ ̛ | Ở | o ̛ ̃ | Ơ ̃ | Ỡ | o ̃ ̛ | Õ ̛ | Ỡ | o ̛ ̣ | Ơ ̣ | Ợ | o ̣ ̛ | Ọ ̛ | Ợ | o ̛ | Ơ | o ́ | Ó | o ̀ | Ò | o ̃ | Õ | o ̉ | Ỏ | o ̣ | Ọ | u ̛ ́ | Ư ́ | Ứ | u ́ ̛ | Ú ̛ | Ứ | u ̛ ̀ | Ư ̀ | Ừ | u ̀ ̛ | Ù ̛ | Ừ | u ̛ ̉ | Ư ̉ | Ử | u ̉ ̛ | Ủ ̛ | Ử | u ̛ ̃ | Ư ̃ | Ữ | u ̃ ̛ | Ũ ̛ | Ữ | u ̛ ̣ | Ư ̣ | Ự | u ̣ ̛ | Ụ ̛ | Ự | u ̛ | Ư | u ́ | Ú | u ̀ | Ù | u ̉ | Ủ | u ̃ | Ũ | u ̣ | Ụ | y ́ | Ý | y ̀ | Ỳ | y ̉ | Ỷ | y ̃ | Ỹ | y ̣ | Ỵ ) | [ - z ] ) + + (
we can see bunch of same character repeating not same - different sequences represent same character.
with same method, allow analyze regex in effort 2 see why fails.
java.util.regex.patternsyntaxexception: unclosed grouping near index 596 (?:(?:[Đ]|á|Á|à|À|ã|Ã|ả|Ả|ạ|Ạ|ă|Ă|ắ|Ắ|Ắ|ằ|Ằ|Ằ|ẳ|Ẳ|Ẳ|ẵ|Ẵ|Ẵ|ặ|Ặ|Ặ|ặ|Ặ|Ặ|â|Â|ấ|Ấ|Ấ|ầ|Ầ|Ầ|ẩ|Ẩ|Ẩ|ẫ|Ẫ|Ẫ|ậ|Ậ|Ậ|ậ|Ậ|Ậ|é|É|è|È|ẻ|Ẻ|ẽ|Ẽ|ẹ|Ẹ|ê|Ê|ế|Ế|Ế|ề|Ề|Ề|ể|Ể|Ể|ễ|Ễ|Ễ|ệ|Ệ|Ệ|ệ|Ệ|Ệ|í|Í|ì|Ì|ỉ|Ỉ|ĩ|Ĩ|ị|Ị|ó|Ó|ò|Ò|õ|Õ|ỏ|Ỏ|ọ|Ọ|ô|Ô|ố|Ố|Ố|ồ|Ồ|Ồ|ổ|Ổ|Ổ|ỗ|Ỗ|Ỗ|ộ|Ộ|Ộ|ộ|Ộ|Ộ|ơ|Ơ|ớ|Ớ|Ớ|ớ|Ớ|Ớ|ờ|Ờ|Ờ|ờ|Ờ|Ờ|ở|Ở|Ở|ở|Ở|Ở|ỡ|Ỡ|Ỡ|ỡ|Ỡ|Ỡ|ợ|Ợ|Ợ|ợ|Ợ|Ợ|ú|Ú|ù|Ù|ủ|Ủ|ũ|Ũ|ụ|Ụ|ư|Ư|ứ|Ứ|Ứ|ứ|Ứ|Ứ|ừ|Ừ|Ừ|ừ|Ừ|Ừ|ử|Ử|Ử|ử|Ử|Ử|ữ|Ữ|Ữ|ữ|Ữ|Ữ|ự|Ự|Ự|ự|Ự|Ự|ý|Ý|ỳ|Ỳ|ỷ|Ỷ|ỹ|Ỹ|ỵ|Ỵ)|[a-z])++( ^
insert space between every character:
( ? : ( ? : [ Đ ] | ́ | Á | ̀ | À | ̃ | Ã | ̉ | Ả | ̣ | Ạ | ̆ | Ă | ̆ ́ | Ă ́ | Ắ | ̆ ̀ | Ă ̀ | Ằ | ̆ ̉ | Ă ̉ | Ẳ | ̆ ̃ | Ă ̃ | Ẵ | ̣ ̆ | Ạ ̆ | Ặ | ̆ ̣ | Ă ̣ | Ặ | ̂ | Â | ̂ ́ | Â ́ | Ấ | ̂ ̀ | Â ̀ | Ầ | ̂ ̉ | Â ̉ | Ẩ | ̂ ̃ | Â ̃ | Ẫ | ̣ ̂ | Ạ ̂ | Ậ | ̂ ̣ | Â ̣ | Ậ | e ́ | É | e ̀ | È | e ̉ | Ẻ | e ̃ | Ẽ | e ̣ | Ẹ | e ̂ | Ê | e ̂ ́ | Ê ́ | Ế | e ̂ ̀ | Ê ̀ | Ề | e ̂ ̉ | Ê ̉ | Ể | e ̂ ̃ | Ê ̃ | Ễ | e ̣ ̂ | Ẹ ̂ | Ệ | e ̂ ̣ | Ê ̣ | Ệ | ́ | Í | ̀ | Ì | ̉ | Ỉ | ̃ | Ĩ | ̣ | Ị | o ́ | Ó | o ̀ | Ò | o ̃ | Õ | o ̉ | Ỏ | o ̣ | Ọ | o ̂ | Ô | o ̂ ́ | Ô ́ | Ố | o ̂ ̀ | Ô ̀ | Ồ | o ̂ ̉ | Ô ̉ | Ổ | o ̂ ̃ | Ô ̃ | Ỗ | o ̣ ̂ | Ọ ̂ | Ộ | o ̂ ̣ | Ô ̣ | Ộ | o ̛ | Ơ | o ̛ ́ | Ơ ́ | Ớ | o ́ ̛ | Ó ̛ | Ớ | o ̛ ̀ | Ơ ̀ | Ờ | o ̀ ̛ | Ò ̛ | Ờ | o ̛ ̉ | Ơ ̉ | Ở | o ̉ ̛ | Ỏ ̛ | Ở | o ̛ ̃ | Ơ ̃ | Ỡ | o ̃ ̛ | Õ ̛ | Ỡ | o ̛ ̣ | Ơ ̣ | Ợ | o ̣ ̛ | Ọ ̛ | Ợ | u ́ | Ú | u ̀ | Ù | u ̉ | Ủ | u ̃ | Ũ | u ̣ | Ụ | u ̛ | Ư | u ̛ ́ | Ư ́ | Ứ | u ́ ̛ | Ú ̛ | Ứ | u ̛ ̀ | Ư ̀ | Ừ | u ̀ ̛ | Ù ̛ | Ừ | u ̛ ̉ | Ư ̉ | Ử | u ̉ ̛ | Ủ ̛ | Ử | u ̛ ̃ | Ư ̃ | Ữ | u ̃ ̛ | Ũ ̛ | Ữ | u ̛ ̣ | Ư ̣ | Ự | u ̣ ̛ | Ụ ̛ | Ự | y ́ | Ý | y ̀ | Ỳ | y ̉ | Ỷ | y ̃ | Ỹ | y ̣ | Ỵ ) | [ - z ] ) + + (
notice a ̂ | Â
comes before a ̂ ̀ | Â ̀ | Ầ
in regex. means a ̂
tried first on input ầ
(a ̂ ̀
), , repetition end when fails match in next iteration.
since order of alternation important, general rule, between 2 strings 1 string prefix of other, longer string should go first in alternation. in our case, need place characters more diacritics before character less or without diacritics.
same problem effort 1:
java.util.regex.patternsyntaxexception: unclosed grouping near index 589 (?:[a-zĐ]|ắ|Ắ|Ắ|ằ|Ằ|Ằ|ẳ|Ẳ|Ẳ|ẵ|Ẵ|Ẵ|ặ|Ặ|Ặ|ặ|Ặ|Ặ|ă|Ă|ấ|Ấ|Ấ|ầ|Ầ|Ầ|ẩ|Ẩ|Ẩ|ẫ|Ẫ|Ẫ|ậ|Ậ|Ậ|ậ|Ậ|Ậ|â|Â|á|Á|à|À|ã|Ã|ả|Ả|ạ|Ạ|ế|Ế|Ế|ề|Ề|Ề|ể|Ể|Ể|ễ|Ễ|Ễ|ệ|Ệ|Ệ|ệ|Ệ|Ệ|ê|Ê|é|É|è|È|ẻ|Ẻ|ẽ|Ẽ|ẹ|Ẹ|í|Í|ì|Ì|ỉ|Ỉ|ĩ|Ĩ|ị|Ị|ố|Ố|Ố|ồ|Ồ|Ồ|ổ|Ổ|Ổ|ỗ|Ỗ|Ỗ|ộ|Ộ|Ộ|ộ|Ộ|Ộ|ô|Ô|ớ|Ớ|Ớ|ớ|Ớ|Ớ|ờ|Ờ|Ờ|ờ|Ờ|Ờ|ở|Ở|Ở|ở|Ở|Ở|ỡ|Ỡ|Ỡ|ỡ|Ỡ|Ỡ|ợ|Ợ|Ợ|ợ|Ợ|Ợ|ơ|Ơ|ó|Ó|ò|Ò|õ|Õ|ỏ|Ỏ|ọ|Ọ|ứ|Ứ|Ứ|ứ|Ứ|Ứ|ừ|Ừ|Ừ|ừ|Ừ|Ừ|ử|Ử|Ử|ử|Ử|Ử|ữ|Ữ|Ữ|ữ|Ữ|Ữ|ự|Ự|Ự|ự|Ự|Ự|ư|Ư|ú|Ú|ù|Ù|ủ|Ủ|ũ|Ũ|ụ|Ụ|ý|Ý|ỳ|Ỳ|ỷ|Ỷ|ỹ|Ỹ|ỵ|Ỵ)++( ^
since alternations formed after original character class, vowels in [a-z]
tried first, leading repetition terminating when encounters stray combining mark.
below source code of testing program.
demo on ideone
import java.util.regex.*; import java.text.*; class ideone { public static void main (string[] args) throws java.lang.exception { string vietnamese_diacritic_characters = "ẮẰẲẴẶĂẤẦẨẪẬÂÁÀÃẢẠĐẾỀỂỄỆÊÉÈẺẼẸÍÌỈĨỊỐỒỔỖỘÔỚỜỞỠỢƠÓÒÕỎỌỨỪỬỮỰƯÚÙỦŨỤÝỲỶỸỴ"; /* (char c: vietnamese_diacritic_characters.tochararray()) { system.out.println(c + ": " + character.getname(c)); } */ string tests[] = new string[3]; tests[0] = "bạn chính là tác giả của wikipedia!\n" + "mọi người đều có thể biên tập bài ngay lập tức, chỉ cần nhớ vài quy tắc." + "có sẵn rất nhiều trang trợ giúp như tạo bài, sửa bài hay tải ảnh." + "bạn cũng đừng ngại đặt câu hỏi.\n" + "hiện chúng ta có 1.109.446 bài viết và 406.782 thành viên."; tests[1] = normalizer.normalize(tests[0], normalizer.form.nfd); /* (char c: tests[1].tochararray()) { system.out.printf("%04x ", (int) c); } */ tests[2] = normalizer.normalize(tests[0], normalizer.form.nfc); seek { pattern p = pattern.compile("(?:[" + vietnamese_diacritic_characters + "]|[a-z])++", pattern.canon_eq | pattern.case_insensitive | pattern.unicode_case); (string t: tests) { matcher m = p.matcher(t); while (m.find()) { system.out.print(m.group() + " "); } system.out.println(); } } grab (exception e) { system.out.println(e); } } }
java regex
Breedlove: Java Regex To Match Vietnamese Chars - >>>>> Download Now
ReplyDelete>>>>> Download Full
Breedlove: Java Regex To Match Vietnamese Chars - >>>>> Download LINK
>>>>> Download Now
Breedlove: Java Regex To Match Vietnamese Chars - >>>>> Download Full
>>>>> Download LINK M2