Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ID-1570 - Fix dice-where and update dce-id to use the latest version #112

Closed
wants to merge 6 commits into from

Conversation

snackk
Copy link
Member

@snackk snackk commented Aug 7, 2024

WHAT

Fix dice-where and update dce-id to use the newest dice-where version.


We call .md5() twice on the same StreamWithMD5Decorator object. This causes the second invocation to return the incorrect hash.

  • First call is newly added to the LocalFileAcceptor and S3FileAcceptor classes

  • Second call is already existing in S3Source.produce()

This causes dice-where to download a file successfully, but not return the correct DownloadExecutionResult when it’s finished processing. The error is triggered in the ID test DiceWhereDownloaderScheduleJobTest .

WHY

Recently dice-where was updated to add extra md5 checks prevent faulty downloads. We need to update id to use the newest version.

Sadly the above change also introduced a bug for services using dice-where as a library. This also needs to be fixed so that we can use the updated version in id.

AC

  • Dice-where is fixed so it no longer calls md5() twice on the same object.

  • dce-id testDiceWhereDownloaderScheduleJobTest passes with the updated version of dice-where

  • Updated dce-id can be deployed and successfully pulls in files from dce-shared.

Implementation Details

Usage

Creation: of to instantiate passing the original InputStream.
Reading: Because StreamWithMD5Decorator extends from InputStream, read is the same as it would on any InputStream.
Checksum: md5() method will return the md5 hash at any time as long as the original stream has been fully consumed.

Purpose

The current implementation allows for:

  • MD5 checksum calculation of an InputStream whilst allowing read from it - without changing the stream and updating the stream digest.
  • Access the MD5 checksum at any point after the stream has been consumed.

Drawbacks

StreamWithMD5Decorator buffers the entire stream in memory to be able to calculate the MD5 - This is not acceptable for large files.

Alternative PR ( WIP )

#113

Copy link

codecov bot commented Aug 8, 2024

Codecov Report

Attention: Patch coverage is 94.11765% with 1 line in your changes missing coverage. Please review.

Project coverage is 59.02%. Comparing base (6703b9d) to head (7a7a335).

Files Patch % Lines
...here/downloader/stream/StreamWithMD5Decorator.java 94.11% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master     #112      +/-   ##
============================================
+ Coverage     58.75%   59.02%   +0.26%     
- Complexity      370      374       +4     
============================================
  Files            97       97              
  Lines          1913     1923      +10     
  Branches        167      167              
============================================
+ Hits           1124     1135      +11     
+ Misses          730      729       -1     
  Partials         59       59              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@snackk snackk requested a review from lcardito August 8, 2024 15:34
public MD5Checksum md5() {
String hex = (new HexBinaryAdapter()).marshal(this.md5.digest());
return MD5Checksum.of(hex);
if (duplicatedStream.isEmpty()) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't we consuming stream on the constructor? is this possible to happen?
if not, instead of throwing an exception, wouldn't be possible to make sure we consume the stream here?

Copy link
Contributor

@julianhowarth julianhowarth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not sure what's going on in StreamWithMD5Decorator - either it needs documentation to explain it, or else we are performing operations in an odd order and/or unnecessarily.

Comment on lines -75 to -77
byte[] buffer = new byte[BUFFER];
while ((md5Is.read(buffer)) != -1) {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, what was this ever for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julianhowarth the digest is incrementally added per each byte we read.
To be able to generate and compare the hash we need to consume every byte for the stream and then perform the MD5. Because we cannot re-use the same InputStream, what we're doing here is copying the stream ( therefore updating the digest for the original stream ) and provide a copied InputStream to be streamed elsewhere - that contains the exact md5 checksum of the original one.


private StreamWithMD5Decorator(DigestInputStream inputStream, MessageDigest md5) {
this.inputStream = inputStream;
private StreamWithMD5Decorator(DigestInputStream originalStream, MessageDigest md5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is lying and likely to cause issues. One of the major benefits of InputStreams is that they do not require the whole contents of the file (or other source) to be read into memory at any time. But this class subverts this as the passed stream is consumed and written to a temporary array - this is not what clients of this class would expect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we're indeed copying an InputStream for origin -
But if the stream is not read, I cannot generate an hash from it ( gives me d41d8cd98f00b204e9800998ecf8427e checksum for an empty input )
and if the stream is read it can't be re-used.
So this implementation tries to perform the md5 whilst letting you a InputStream to be used elsewhere

Comment on lines 22 to 23
private Optional<InputStream> duplicatedStream = Optional.empty();
private Optional<MD5Checksum> md5Checksum = Optional.empty();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this class is stateful but without any documentation to that effect. Would be much cleaner if the only state was within the inputstream itself.

}

@Override
public void close() throws IOException {
this.inputStream.close();
inputStream().close();
originalStream.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arguably, this is not ours to close, caller should be closing it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants