twitter/hadoop-lzo

lzo.index.tmp files not deleted

Opened this issue · 4 comments

We use distributed lzo indexer on EMR (hadoop version: 1.0.3), files stored on Amazon s3.

Sometimes (observed twice by now) we had the following issue:

all lzo.index is generated, but some of the lzo.index.tmp files are not deleted and cause problem when processing them with pig. No exception or error is thrown during the indexing and job is reported to run successfully.

We have not seen this in our self-hosted environment. Might be due something EC2 specific. Do you have any theories about the root cause?
gszjulcsi notifications@github.com wrote:We use distributed lzo indexer on EMR (hadoop version: 1.0.3), files stored on Amazon s3.

Sometimes (observed twice by now) we had the following issue:

all lzo.index is generated, but some of the lzo.index.tmp files are not deleted and cause problem when processing them with pig. No exception or error is thrown during the indexing and job is reported to run successfully.

—Reply to this email directly or view it on GitHub.

Meanwhile we have noticed that these index.tmp files disappeared. We
suspect that was an s3 eventual consistency issue, namely it took s3 too
long (cc. 7 hours) to maintain consistency.

2014-01-29 dvryaboy notifications@github.com

We have not seen this in our self-hosted environment. Might be due
something EC2 specific. Do you have any theories about the root cause?
gszjulcsi notifications@github.com wrote:We use distributed lzo indexer
on EMR (hadoop version: 1.0.3), files stored on Amazon s3.

Sometimes (observed twice by now) we had the following issue:

all lzo.index is generated, but some of the lzo.index.tmp files are not
deleted and cause problem when processing them with pig. No exception or
error is thrown during the indexing and job is reported to run
successfully.

--Reply to this email directly or view it on GitHub.

Reply to this email directly or view it on GitHubhttps://github.com//issues/87#issuecomment-33571495
.

I see. Well perhaps it would make sense to add a filter to the lzo input formats so they ignore these temp files and you don't get an error. Feel free to send a pull request with such a change, we will be happy to take a look.

excluding .tmp files is a good fix.

There are other subtle issues with S3 because of these delays e.g. https://github.com/kevinweil/elephant-bird/issues/309