Monday 15 August 2011

apache spark - EntityTooLarge error when uploading a 5G file to Amazon S3 -



apache spark - EntityTooLarge error when uploading a 5G file to Amazon S3 -

amazon s3 file size limit supposed 5t according announcement, getting next error when uploading 5g file

'/mahler%2fparquet%2fpageview%2fall-2014-2000%2f_temporary%2f_attempt_201410112050_0009_r_000221_2222%2fpart-r-222.parquet' xml error message: <?xml version="1.0" encoding="utf-8"?> <error> <code>entitytoolarge</code> <message>your proposed upload exceeds maximum allowed size</message> <proposedsize>5374138340</proposedsize> ... <maxsizeallowed>5368709120</maxsizeallowed> </error>

this makes seem s3 accepting 5g uploads. using apache spark sql write out parquet info set using schemrdd.saveasparquetfile method. total stack trace is

org.apache.hadoop.fs.s3.s3exception: org.jets3t.service.s3serviceexception: s3 set failed '/mahler%2fparquet%2fpageview%2fall-2014-2000%2f_temporary%2f_attempt_201410112050_0009_r_000221_2222%2fpart-r-222.parquet' xml error message: <?xml version="1.0" encoding="utf-8"?><error><code>entitytoolarge</code><message>your proposed upload exceeds maximum allowed size</message><proposedsize>5374138340</proposedsize><requestid>20a38b479ffed879</requestid><hostid>kxegspreq0ho7mm7dtcglin7vi7nqt3z6p2nbx1alulsezp6x5iu8kj6qm7whm56cij7udeenn4=</hostid><maxsizeallowed>5368709120</maxsizeallowed></error> org.apache.hadoop.fs.s3native.jets3tnativefilesystemstore.storefile(jets3tnativefilesystemstore.java:82) sun.reflect.nativemethodaccessorimpl.invoke0(native method) sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl.java:57) sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:43) java.lang.reflect.method.invoke(method.java:606) org.apache.hadoop.io.retry.retryinvocationhandler.invokemethod(retryinvocationhandler.java:82) org.apache.hadoop.io.retry.retryinvocationhandler.invoke(retryinvocationhandler.java:59) org.apache.hadoop.fs.s3native.$proxy10.storefile(unknown source) org.apache.hadoop.fs.s3native.natives3filesystem$natives3fsoutputstream.close(natives3filesystem.java:174) org.apache.hadoop.fs.fsdataoutputstream$positioncache.close(fsdataoutputstream.java:61) org.apache.hadoop.fs.fsdataoutputstream.close(fsdataoutputstream.java:86) parquet.hadoop.parquetfilewriter.end(parquetfilewriter.java:321) parquet.hadoop.internalparquetrecordwriter.close(internalparquetrecordwriter.java:111) parquet.hadoop.parquetrecordwriter.close(parquetrecordwriter.java:73) org.apache.spark.sql.parquet.insertintoparquettable.org$apache$spark$sql$parquet$insertintoparquettable$$writeshard$1(parquettableoperations.scala:305) org.apache.spark.sql.parquet.insertintoparquettable$$anonfun$saveashadoopfile$1.apply(parquettableoperations.scala:318) org.apache.spark.sql.parquet.insertintoparquettable$$anonfun$saveashadoopfile$1.apply(parquettableoperations.scala:318) org.apache.spark.scheduler.resulttask.runtask(resulttask.scala:62) org.apache.spark.scheduler.task.run(task.scala:54) org.apache.spark.executor.executor$taskrunner.run(executor.scala:177) java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor.java:1145) java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor.java:615) java.lang.thread.run(thread.java:745)

is upload limit still 5t? if why getting error , how prepare it?

the object size limited 5 tb. upload size still 5 gb, explained in manual:

depending on size of info uploading, amazon s3 offers next options:

upload objects in single operation—with single put operation can upload objects 5 gb in size.

upload objects in parts—using multipart upload api can upload big objects, 5 tb.

http://docs.aws.amazon.com/amazons3/latest/dev/uploadingobjects.html

once multipart upload, s3 validates , recombines parts, , have single object in s3, 5tb in size, can downloaded single entitity, single http get request... uploading potentially much faster, on files smaller 5gb, since can upload parts in parallel , retry uploads of parts didn't succeed on first attempt.

amazon-s3 apache-spark jets3t parquet apache-spark-sql

No comments:

Post a Comment