[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trimmed files are uploaded without configured file extension #358

Closed
igorcalabria opened this issue Aug 8, 2017 · 1 comment
Closed

Comments

@igorcalabria
Copy link

With a little investigation, I found out that the files with missing extensions were always trimmed. Looking at this snippet from com.pinterest.secor.uploader.Uploader#trim you can clearly see that extension is only set if a compression codec is configured:

            String extension = "";
            if (mConfig.getCompressionCodec() != null && !mConfig.getCompressionCodec().isEmpty()) {
                codec = CompressionUtil.createCompressionCodec(mConfig.getCompressionCodec());
                extension = codec.getDefaultExtension();
            }
            reader = createReader(srcPath, codec);
            KeyValue keyVal;
            while ((keyVal = reader.next()) != null) {
                if (keyVal.getOffset() >= startOffset) {
                    if (writer == null) {
                        String localPrefix = mConfig.getLocalPath() + '/' +
                            IdUtil.getLocalMessageDir();
                        dstPath = new LogFilePath(localPrefix, srcPath.getTopic(),
                                                  srcPath.getPartitions(), srcPath.getGeneration(),
                                                  srcPath.getKafkaPartition(), startOffset,
                                                  extension);
                        writer = mFileRegistry.getOrCreateWriter(dstPath,
                        		codec);
                    }
                    writer.write(keyVal);
                    copiedMessages++;
                }
            }

This may seem like a harmless bug, but it affects people that use a glob pattern to process files. Example: s3a://secor-backups/some-topic/dt=2017-08-08/*.parquet. If there's a trimmed file in there, it will be missed by the pattern.

The solution is pretty straight forward(set extension by calling mConfig.getFileExtension()). I'll gladly submit a patch, but I need to know the desired behavior of extensions names when compression is enabled. Do we keep both extensions? file.parquet.snappy or just the compressed extension file.snappy ? My vote goes to keeping both extensions since it is default with spark and other big data tools.

@HenryCaiHaiying
Copy link
Contributor
HenryCaiHaiying commented Aug 10, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants