windows - R: can't read unicode text files even when specifying the encoding -
i'm using r 3.1.1 on windows 7 32bits. i'm having lot of problems reading text files on want perform textual analysis. according notepad++, files encoded "ucs-2 little endian". (grepwin, tool name says all, says file "unicode".)
the problem can't seem read file specifying encoding. (the characters of standard spanish latin set -ñáó- , should handled cp1252 or that.)
> sys.getlocale() [1] "lc_collate=spanish_spain.1252;lc_ctype=spanish_spain.1252;lc_monetary=spanish_spain.1252;lc_numeric=c;lc_time=spanish_spain.1252" > readlines("filename.txt") [1] "ÿþe" "" "" "" "" ... > readlines("filename.txt",encoding="utf-8") [1] "\xff\xfee" "" "" "" "" ... > readlines("filename.txt",encoding="ucs2le") [1] "ÿþe" "" "" "" "" "" "" ... > readlines("filename.txt",encoding="ucs2") [1] "ÿþe" "" "" "" "" ...
any ideas?
thanks!!
edit: "utf-16", "utf-16le" , "utf-16be" encondings fails similarly
after reading more closely documentation, found reply question.
the encoding
param of readlines
applies to param input strings. documentation says:
encoding assumed input strings. used mark character strings known in latin-1 or utf-8: it not used re-encode input. latter, specify encoding part of connection con or via options(encoding=): see examples. see ‘details’.
the proper way of reading file uncommon encoding is, then,
filetext <- readlines(con <- file("unicodefile.txt", encoding = "ucs-2le")) close(con)
windows r unicode encoding ucs2
No comments:
Post a Comment