imlib/filter: Vectorize morph() kernel.#2415
Open
kwagyeman wants to merge 2 commits intoopenmv:masterfrom
Open
imlib/filter: Vectorize morph() kernel.#2415kwagyeman wants to merge 2 commits intoopenmv:masterfrom
kwagyeman wants to merge 2 commits intoopenmv:masterfrom
Conversation
9d6c4ec to
3acaa57
Compare
|
Code Size Report:
|
35a520c to
3bc546d
Compare
44ccd95 to
56a348f
Compare
ae11f46 to
71f7a91
Compare
71f7a91 to
1f5aaad
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Depends on #2417.
Benchmark results here: https://docs.google.com/spreadsheets/d/1-FNVKCEr8-6UYs8MUm6wgsOt2c8ihJ2mg9QXKkG91os/edit?gid=452211341#gid=452211341
AE3 Performance with Helium is 4.2x faster than the RT1062.
Otherwise, note that this PR reduces the performance of the morph kernel by 50% for grayscale 3x3 kernels to be generic and vectorizable. The previous code provided the best possible speed for M4/M7 architectures but could not be vectorized and was only applicable for kernels of size 3x3. The new code offers vectorized processing for any kernel size.
Given the massive performance gain Helium has over the scalar code, this tradeoff makes sense.
Arguments mul/add were dropped as these are impossible to handle without complicating the default loop case. Additionally, they can easily overflow the 16-bit accumulators being used.