Skip to content

[S#3543] Piece-wise linear compression of column groups first working prototype #2415#2420

Open
mori49 wants to merge 25 commits intoapache:mainfrom
mori49:main
Open

[S#3543] Piece-wise linear compression of column groups first working prototype #2415#2420
mori49 wants to merge 25 commits intoapache:mainfrom
mori49:main

Conversation

@mori49
Copy link

@mori49 mori49 commented Jan 29, 2026

No description provided.

Copy link
Contributor

@janniklinde janniklinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your first contribution @mori49, this is a good start. I left some comments in the code. You used segmented least squares, which is a fine approach (even though control over the actual loss is quite limited). One limiting factor is compression complexity of O(n³), which is not viable for production compression. This particular approach can be optimized to O(n²). This could be achieved by precomputing prefix sums or SSE matrix (please first address the smaller formatting issues and other code suggestions before approaching that optimization).
In general, we may think of a more lightweight and accurate method to preserve targetLoss as an upper bound (this will be part of after the first submission deadline).
Also, please avoid german comments or variable/method names in your contribution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this class from colgroup/scheme package to colgroup/.
In general, all methods that are currently unimplemented should throw new NotImplementedException()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should not be part of the PR. You can keep it locally but you should untrack it and not add it to your commits. You could use git rm --cached bin/systemds-standalone.sh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like you reformatted the file to revert the tabs -> spaces conversion, which is good. However, there are still many unnecessary changes. I would recommend you revert that file to the original state of this repository and then only add the enum CompressionType PiecewiseLinear

Comment on lines 1077 to 1250
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep this file clean, I recommend that you create a new class called PiecewiseLinearUtils in the package functional. Your compressPiecewiseLinearFunctional(...) then just calls PiecewiseLinearUtils.compressSegmentedLeastSquares(...).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here please revert the file. Did you change anything in this file (except tabs->spaces which you should be reverted)?

You might consider creating a variable double targetLoss and a method public CompressionSettingsBuilder setTargetLoss(double loss) {...}. If you then add the targetLoss as a variable in the CompressionSettings constructor, you directly set the target loss via CompressionSettingsBuilder

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove jupiter assertions, that will cause the build to fail as we don't use jupiter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no underscores in method names.

Move this test file to test/component/compress/colgroup.

You have a lot of isolated tests (which also look like autogenerated tests and not handwritten). It would be nice to have more tests. Please remove some redundant ones, and add tests on randomly generated data (with a fixed seed) where you create a ColGroupPiecewiseLinearCompressed and then decompressToDenseBlock. You then compare it to the original data and compute a loss (which should be no more than some upper bound).

Comment on lines 139 to 142
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You take the first column, which is fine for now, but in a finished implementation you would either repeat compression on every column or do a multidimensional regression, where you treat a 'row' of all indices as a vector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid emojis;
Also, they are usually a hint of LLM generated code (which is strictly forbidden for your submissions)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants

Comments