Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate unmarshalling and parsing in a single pass #5643

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

sugmanue
Copy link
Contributor

@sugmanue sugmanue commented Oct 4, 2024

Motivation and Context

Change the way to unmarshall JSON payloads. Before this change the process was done in two steps, first we parse the JSON input into a JsonNode structure than then is traversed to unmarshall it into a SdkPojo. After this change, there's a single pass that builds the SdkJson as we parse the input.

Changes

SdkPojo

The SdkPojo interface was changed to add a new method, sdkFieldNameToField() that returns a Map<String, SdkField<?>> which allow us to lookups the field metadata for a given field name. We also changed the code generation to create this map on top of the list of fields that we were creating before.

JsonUnmarshallingParser

A new internal class JsonUnmarshallingParser was added that is used by JsonProtocolUnmarshaller to unmarshall the input. This class builds a JSON parer and builds the expected java types as the stream of tokens are read. Unmarshalling of simple types is still done using the registered unmarshallers.

Benchmarks results

This change improves the performance of unmarshalling in general. Shapes with labeled simple types (as in members of structures) benefit the most from this change in plain JSON, whereas collections of datapoints (e.g., list of doubles) do not seem to benefit much, mainly because of how the parsing from strings dominates the time.

V2DynamoDbAttributeValue

Below is the before result of running the V2DynamoDbAttributeValue.getItem benchmark

Benchmark                         (testItem)  Mode  Cnt      Score      Error  Units
V2DynamoDbAttributeValue.getItem        TINY  avgt    5    736.556 ±   48.044  ns/op
V2DynamoDbAttributeValue.getItem       SMALL  avgt    5   2609.546 ±   98.809  ns/op
V2DynamoDbAttributeValue.getItem        HUGE  avgt    5  20363.496 ± 1149.802  ns/op

And the after results,

Benchmark                         (testItem)  Mode  Cnt      Score     Error  Units
V2DynamoDbAttributeValue.getItem        TINY  avgt    5    402.055 ±   8.297  ns/op
V2DynamoDbAttributeValue.getItem       SMALL  avgt    5    919.009 ±  12.303  ns/op
V2DynamoDbAttributeValue.getItem        HUGE  avgt    5   6400.011 ± 146.718  ns/op

And the after V2 results, updated after enabling and using Jackson's fast float parsing, notice how this change improves the performance as there's no longer need to create intermediate objects.

Benchmark                         (testItem)  Mode  Cnt     Score    Error  Units
V2DynamoDbAttributeValue.getItem        TINY  avgt    5   386.915 ± 25.105  ns/op
V2DynamoDbAttributeValue.getItem       SMALL  avgt    5   922.134 ±  3.386  ns/op
V2DynamoDbAttributeValue.getItem        HUGE  avgt    5  5948.790 ± 19.249  ns/op

The improvements comparing before and after is

Benchmark                         (testItem)  Before      After      Improvement
V2DynamoDbAttributeValue.getItem        TINY    736.556    402.055    1.80x
V2DynamoDbAttributeValue.getItem       SMALL   2609.546    919.009    2.70x
V2DynamoDbAttributeValue.getItem        HUGE  20363.496   6400.011    3.32x

Edit, the improvements comparing before and after V2 is

Benchmark                         (testItem)  Before      After      Improvement
V2DynamoDbAttributeValue.getItem        TINY    736.556   386.915    1.90x
V2DynamoDbAttributeValue.getItem       SMALL   2609.546   922.134    2.82x
V2DynamoDbAttributeValue.getItem        HUGE  20363.496  5948.790    3.42x

JsonMarshallerBenchmark

This change also shows a ~2x performance improvements of the JsonMarshallerBenchmark.unmarshall for RPCv2, but, not so much for AWS JSON. Below is the before result of the benchmark

Benchmark                              (protocol)  (size)  Mode  Cnt      Score      Error  Units
JsonMarshallerBenchmark.unmarshall  smithy-rpc-v2   small  avgt    5   2053.930 ±  151.090  ns/op
JsonMarshallerBenchmark.unmarshall  smithy-rpc-v2  medium  avgt    5   3080.472 ±  262.710  ns/op
JsonMarshallerBenchmark.unmarshall  smithy-rpc-v2     big  avgt    5  13009.513 ± 1116.719  ns/op
JsonMarshallerBenchmark.unmarshall       aws-json   small  avgt    5   5259.660 ±   68.543  ns/op
JsonMarshallerBenchmark.unmarshall       aws-json  medium  avgt    5  10676.408 ±  113.583  ns/op
JsonMarshallerBenchmark.unmarshall       aws-json     big  avgt    5  43582.916 ±  316.564  ns/op

And the after results,

Benchmark                              (protocol)  (size)  Mode  Cnt      Score      Error  Units
JsonMarshallerBenchmark.unmarshall  smithy-rpc-v2   small  avgt    5   1096.103 ±   21.391  ns/op
JsonMarshallerBenchmark.unmarshall  smithy-rpc-v2  medium  avgt    5   1978.710 ±  145.267  ns/op
JsonMarshallerBenchmark.unmarshall  smithy-rpc-v2     big  avgt    5   7171.007 ±   32.614  ns/op
JsonMarshallerBenchmark.unmarshall       aws-json   small  avgt    5   5230.546 ±   80.274  ns/op
JsonMarshallerBenchmark.unmarshall       aws-json  medium  avgt    5  11179.496 ±  964.445  ns/op
JsonMarshallerBenchmark.unmarshall       aws-json     big  avgt    5  41770.648 ± 1723.839  ns/op

And the after V2 results, updated after enabling and using Jackson's fast float parsing, notice how RPCv2 also improves as there's no longer need to create intermediate objects.

JsonMarshallerBenchmark.unmarshall  smithy-rpc-v2   small  avgt    5    870.653 ±   17.654  ns/op
JsonMarshallerBenchmark.unmarshall  smithy-rpc-v2  medium  avgt    5   1361.450 ±    9.747  ns/op
JsonMarshallerBenchmark.unmarshall  smithy-rpc-v2     big  avgt    5   4614.679 ±   17.724  ns/op
JsonMarshallerBenchmark.unmarshall       aws-json   small  avgt    5   2792.786 ±   12.847  ns/op
JsonMarshallerBenchmark.unmarshall       aws-json  medium  avgt    5   5647.308 ±   56.101  ns/op
JsonMarshallerBenchmark.unmarshall       aws-json     big  avgt    5  22556.135 ± 1178.098  ns/op

The improvements comparing before and after is

Benchmark                              (protocol)  (size)  Before     After      Improvement
JsonMarshallerBenchmark.unmarshall  smithy-rpc-v2   small   2053.930   1096.103  1.70x
JsonMarshallerBenchmark.unmarshall  smithy-rpc-v2  medium   3080.472   1978.710  1.57x
JsonMarshallerBenchmark.unmarshall  smithy-rpc-v2     big  13009.513   7171.007  1.86x
JsonMarshallerBenchmark.unmarshall       aws-json   small   5259.660   5230.546  1.01x
JsonMarshallerBenchmark.unmarshall       aws-json  medium  10676.408  11179.496  0.97x
JsonMarshallerBenchmark.unmarshall       aws-json     big  43582.916  41770.648  1.04x

Edit, the improvements comparing before and after V2 is

Benchmark                              (protocol)  (size)  Before     After      Improvement
JsonMarshallerBenchmark.unmarshall  smithy-rpc-v2   small   2053.930    870.653  2.35x
JsonMarshallerBenchmark.unmarshall  smithy-rpc-v2  medium   3080.472   1361.450  2.26x
JsonMarshallerBenchmark.unmarshall  smithy-rpc-v2     big  13009.513   4614.679  2.81x
JsonMarshallerBenchmark.unmarshall       aws-json   small   5259.660   2792.786  1.88x
JsonMarshallerBenchmark.unmarshall       aws-json  medium  10676.408   5647.308  1.89x
JsonMarshallerBenchmark.unmarshall       aws-json     big  43582.916  22556.135  1.93x

Modifications

Testing

Screenshots (if appropriate)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

Checklist

  • I have read the CONTRIBUTING document
  • Local run of mvn install succeeds
  • My code follows the code style of this project
  • My change requires a change to the Javadoc documentation
  • I have updated the Javadoc documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed
  • I have added a changelog entry. Adding a new entry must be accomplished by running the scripts/new-change script and following the instructions. Commit the new file created by the script in .changes/next-release with your changes.
  • My change is to implement 1.11 parity feature and I have updated LaunchChangelog

License

  • I confirm that this pull request can be released under the Apache 2 license

@sugmanue sugmanue force-pushed the sugmanue/unmarshalling-single-pass branch 8 times, most recently from b266dd7 to 7c8ea44 Compare October 5, 2024 16:19
@sugmanue sugmanue force-pushed the sugmanue/unmarshalling-single-pass branch from 7fd9d12 to 99923f8 Compare October 7, 2024 22:54
@sugmanue sugmanue marked this pull request as ready for review October 7, 2024 23:04
@sugmanue sugmanue requested a review from a team as a code owner October 7, 2024 23:04
@@ -3065,5 +3113,10 @@ public AllTypesRequest build() {
public List<SdkField<?>> sdkFields() {
return SDK_FIELDS;
}

@Override
public Map<String, SdkField<?>> sdkFieldNameToField() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to return a collection? Doesn't seem like we use any of the collection functionality anywhere. Can we just have a getByName type method, that returns the SdkField (or null)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this method has to return a map as defined in the SdkPojo interface. We use it while parsing when we find a name to figure out to which field that name maps. Both the data class and the builder implement SdkPojo.

Comment on lines +227 to +230
unmarshallFromJson(sdkPojo, response.content().get());
return unmarshallResponse(sdkPojo, response);
}
return unmarshallFromJson(sdkPojo, response.content().get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these get()s guaranteed to be safe here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point yes, otherwise hasJsonPayload above returns false and the first branch will be used.

@@ -224,23 +217,111 @@ public T unmarshall(JsonUnmarshallerContext context,

public <TypeT extends SdkPojo> TypeT unmarshall(SdkPojo sdkPojo,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumb question: What is the reason for the branching here? i.e. why isn't this just a call to unmarshallResponse? I assume unmarshallFromJson is the fast-path here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the idea. Most of the complications here are to support restJson, RPC protocols are a lot simpler and do not need any of that, so we attempt to separate one from the other to avoid paying the cost for all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more documentation to this class? Lot going on

Sure, I will look into what can be improved.

Can we add additional tests for some uncovered branches?

I will remove some of the removing location that's not used. If you look closely most of the remaining ones are about error recover from invalid JSON, e.g., that we find a field name or a closing bracket after an opening one, if that's not the case the JSON is not valid, which might not be, we don't know until we finish consuming it. I will look into covering more that we do now.

Comment on lines +90 to +95
if (token == null) {
return (SdkPojo) ((Buildable) pojo).build();
}
if (token == JsonToken.VALUE_NULL) {
return null;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm intuitively, seems like these cases should return the same value. When would token be null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is between an input of length 0 and an input with a literal value null.

@dagnir
Copy link
Contributor

dagnir commented Oct 11, 2024

LGTM, can we run this in the canaries?

Copy link

sonarcloud bot commented Oct 14, 2024

Quality Gate Failed Quality Gate failed

Failed conditions
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarCloud

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants