Tuesday 15 June 2010

web scraping - Different results when reading from text file and web in R -



web scraping - Different results when reading from text file and web in R -

let www.exampleweb.com website info that:

... -3.7358293e+000 7.6062331e-001 6.0701401e+000 -1.6897975e+000 -2.1088811e+000 2.7172791e+000 -2.5477626e+000 ...

1 column 1000 rows. i'm obtaining info website in 2 ways: 1.

con = url("www.exampleweb.com") data_from_html <- readlines(con) close(con)

now need convert data, because

str(data_from_html) chr [1:1000] " -2.9735888e+000" " -1.4757566e+000" " 8.6980880e-001" " 4.9502553e+000" ...

so:

converted <- as.numeric(data_from_html)

copying (ctrl+a) whole site, , pasting .txt file. saving "my_data.txt".

data_from_txt <- read.table("my_data.txt")

now, when utilize

summary(converted) min. 1st qu. median mean 3rd qu. max. -16.2800 -1.5030 -0.0598 -0.1809 1.2220 13.0100

but on other hand:

summary(data_from_txt) v1 min. :-16.2789 1st qu.: -1.5026 median : -0.0598 mean : -0.1809 3rd qu.: 1.2217 max. : 13.0112

i can't decide 1 better, sense there info loss in converting char numeric. don't know how prevent it. checked head/tail of these variables, they've got same values:

head(converted) [1] -2.9735888 -1.4757566 0.8698088 4.9502553 -4.3059115 0.9745958 > tail(converted) [1] -3.007217 -4.600345 -3.740255 2.579664 -2.233819 -1.028491 head(data_from_txt) v1 1 -2.9735888 2 -1.4757566 3 0.8698088 4 4.9502553 5 -4.3059115 6 0.9745958 > tail(data_from_txt) v1 995 -3.007217 996 -4.600345 997 -3.740255 998 2.579664 999 -2.233819 1000 -1.028491

how deal it? mean should never web scrap data? if i, reason, can't create .txt file? maybe jest need improve method info conversion?

r web-scraping text-files analytics data-loss

No comments:

Post a Comment