Group #

Principle #

DBPlusEngine’s grouping function can be divided into streaming grouping and in-memory grouping. Streaming grouping means loading data into memory for grouping calculation item by item by means of streaming query, streaming grouping requires that the SQL sort item must be consistent with the field and sort type (ASC or DESC) of the group item.

In-memory grouping is when the query result set is loaded into memory one at a time and then grouped into memory. In-memory grouping is suitable for scenarios where the fields and sort types (ASC or DESC) of the SQL sort items and group items cannot be aligned, and in-memory grouping takes up more memory.

Advice #

Ensure that the fields of the sorted and grouped items and the sort type (ASC or DESC) of SQL are the same in order to use streaming grouping and reduce memory consumption;

Note: If sorting by AB, grouping or sorting by BA may not be used to stream processing as the fields are inconsistent.

Example #

Take the following SQL as an example:

copySELECT name, SUM(score) FROM t_score GROUP BY name ORDER BY name;

Assuming that the table structure contains the candidate’s name (for simplicity, no duplicate names are considered) and score based on subject slicing, this SQL allows us to obtain the total score for each candidate.

In the case where the grouped items are identical to the sorted items, the data obtained is continuous and the total number of data required for the grouping exists in the data values pointed to by the current cursor of each data result set, so streaming merging can be used. This is shown in the diagram below.

group1

When DBPlusEngine performs the merge process, the logic is similar to a sorted merge. The diagram below shows how streaming grouping and merging work when making the next call.

group2

The diagram shows that when the first next call is made, t_score_java, which is at the top of the queue, is ejected from the queue, along with the rest of the result set with the same group value of “Jerry”. After the scores of all the students with the name “Jerry” have been obtained, they are added up, so that after the first next call, the result set is the sum of the scores of “Jerry”. At the same time, all the cursors in the result set are moved down to the next different data value from the data value “Jerry” and reordered according to the value to which the cursor in the result set currently points. As a result, the relevant data result set containing “John” in the second position by name is at the top of the queue.

For cases where the grouped items do not match the sorted items, as the data values associated with the grouping to be fetched are not contiguous, streaming merging cannot be used and all the result set data needs to be loaded into memory for grouping and aggregation. For example, if the total score of each candidate is obtained and sorted from highest to lowest by the following SQL:

copySELECT name, SUM(score) FROM t_score GROUP BY name ORDER BY score DESC;

Then the data taken out of each data result set is the same as the original data in the table structure in the top half of the sorted merge chart, and it is not possible to perform a streaming merge.

When the SQL contains only grouping statements, the sorting order may not necessarily be the same as the grouping order depending on the database implementation. However, the absence of a sort statement means that this SQL does not care about the sort order. Therefore, DBPlusEngine automatically adds sorting terms that are consistent with the grouping terms through a rewrite of SQL optimization, enabling it to convert from a memory-consuming in-memory grouping consolidation approach to a streaming grouping consolidation scheme.