Sunday 15 April 2012

windows - R: can't read unicode text files even when specifying the encoding -



windows - R: can't read unicode text files even when specifying the encoding -

i'm using r 3.1.1 on windows 7 32bits. i'm having lot of problems reading text files on want perform textual analysis. according notepad++, files encoded "ucs-2 little endian". (grepwin, tool name says all, says file "unicode".)

the problem can't seem read file specifying encoding. (the characters of standard spanish latin set -ñáó- , should handled cp1252 or that.)

> sys.getlocale() [1] "lc_collate=spanish_spain.1252;lc_ctype=spanish_spain.1252;lc_monetary=spanish_spain.1252;lc_numeric=c;lc_time=spanish_spain.1252" > readlines("filename.txt") [1] "ÿþe" "" "" "" "" ... > readlines("filename.txt",encoding="utf-8") [1] "\xff\xfee" "" "" "" "" ... > readlines("filename.txt",encoding="ucs2le") [1] "ÿþe" "" "" "" "" "" "" ... > readlines("filename.txt",encoding="ucs2") [1] "ÿþe" "" "" "" "" ...

any ideas?

thanks!!

edit: "utf-16", "utf-16le" , "utf-16be" encondings fails similarly

after reading more closely documentation, found reply question.

the encoding param of readlines applies to param input strings. documentation says:

encoding assumed input strings. used mark character strings known in latin-1 or utf-8: it not used re-encode input. latter, specify encoding part of connection con or via options(encoding=): see examples. see ‘details’.

the proper way of reading file uncommon encoding is, then,

filetext <- readlines(con <- file("unicodefile.txt", encoding = "ucs-2le")) close(con)

windows r unicode encoding ucs2

No comments:

Post a Comment