Group every X rows up to Y

Question

Post reply

Group every X rows up to Y

darthurfle

Valued Member

Points: 66
More actions
June 28, 2013 at 4:24 am

#279627

Dear SQL Experts,
I need to housekeep a large table basically composed by an ID and a creation date (not timestamp) columns. However, I don't have direct write permissions on the database, so I need to use an batch job program to do it. This program accepts as a parameter the maximum retention days (i.e., it will keep all data from today minus X days and delete everything older).
StoreID; Date
123; 01/18/2013
124; 01/18/2013
125; 01/18/2013
126; 01/19/2013
127; 01/19/2013
128; 01/19/2013
...
The application server has memory limitation, so I need to set the retention days parameter gradually. I ran some tests that showed that the batch job can load and delete 250,000 rows.
Therefore, I need to know how I could group the rows in groups up to 250,000 and know the finish (max) day of each group. Could you please help me?
Thanks in advance.
Best Regards,
Darthurfle

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

darthurfle Valued Member Points: 66 More actions · Answer 1

The outcome I need is something like:

Date; Count

01/21/2013; 248,389

02/17/2013; 229,111

04/02/2013; 249,632

Lowell SSC Guru Points: 323518 More actions · Answer 2

well, it seems to me the NTILE() funtion could group your data into groups, but your expected results looks like a simple GROUP BY results.

does something like this help? note the NTILE(5) would probably be NTILE(250000) according to your notes against a bigger data set.

SELECT '123' AS [StoreID],'01/18/2013' AS [Date] INTO #MySampleData UNION ALL

SELECT '124','01/18/2013' UNION ALL

SELECT '125','01/18/2013' UNION ALL

SELECT '126','01/19/2013' UNION ALL

SELECT '127','01/19/2013' UNION ALL

SELECT '128','01/19/2013'

SELECT NTILE(5) OVER(PARTITION BY [Date] ORDER BY [Date]) AS NT, * FROM #MySampleData

SELECT [Date],COUNT(*) FROM #MySampleData GROUP BY [Date]

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Bala' Ten Centuries Points: 1236 More actions · Answer 3

Check this

http://beyondrelational.com/blogs/tc/archive/2009/08/30/tsql-challenge-13-break-a-batch-of-invoices-into-multiple-parts-based-on-rules.aspx

darthurfle Valued Member Points: 66 More actions · Answer 4

Hi Lowell,

Thank you for your help, but I had already tried to use NTILE with no success. The problem is that is creates overlapping groups (as the creation field holds only the date, not the timestamp).

So as a result I will have something like:

myGroup(No column name)(No column name)(No column name)

12012-02-11 00:00:00.0002012-06-14 00:00:00.000255450

22012-06-14 00:00:00.0002012-07-12 00:00:00.000255450

32012-07-12 00:00:00.0002012-08-10 00:00:00.000255450

42012-08-10 00:00:00.0002012-09-11 00:00:00.000255450

52012-09-11 00:00:00.0002012-10-04 00:00:00.000255450

62012-10-04 00:00:00.0002012-10-18 00:00:00.000255450

In the above example there are A number of records whose creation date lies on 2012-06-14 and NTILE classified then as group1, and B number of records that were classified as group2. If I count the number of records <= "2012-06-14", I will get more than 255,450 records (and I get a memory dump).

I need somehow to create a kind of knapsack problem (https://en.wikipedia.org/wiki/Knapsack_problem) with multiple knapsacks with a maximum capacity of 256,000.

darthurfle Valued Member Points: 66 More actions · Answer 5

Hi Bala,

As far I could understand my problem is the same described in the challenge. Unfortunately, the last post is dated of October 19, 2009 and no solution was provided :[

Guess I am in trouble.

Best Regards,

Darthurfle

Lowell SSC Guru Points: 323518 More actions · Answer 6

ok i still don't understand the issue i guess; I think you are over complicating things too much.

correct me if i am wrong, but you need to delete ALL data that is older than x days, whether it's 2K rows or 2 Million rows, right?

you are introducing the 250K constraint becuase of an observed behavior, right, anything roughly over that amount of rows gets a memory error?

why can you not just use a loop, and delete in batches until it's done?

this is commonly done to prevent blocking access to an entire table, because of locks being upgraded to table locks because so many rows get deleted in a batch.

keeping a smaller batch size might give only row or page level locks, resulting in less blocking.

the fix is usually something like this:

SET ROWCOUNT 50000

WHILE 1=1

BEGIN

DELETE FROM dbo.myTable

WHERE MyField = My Condition --ie SomeColumn < DATEADD(dd,-180,getdate())

IF @@ROWCOUNT = 0

BREAK

END

SET ROWCOUNT 0

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

darthurfle Valued Member Points: 66 More actions · Answer 7

Hi Lowell,

I am not trying to make the solution harder. It is a real world situation.

The table in question is a SAP ECC Add-on "standard" table that establishes relationships with several other tables. Usually in the SAP world, the infrastructure team, database team and even the SAP Support does not allow us to direct do changes to tables (they don't have foreign keys and all consistency is made by the application layer). If it is already difficult to find information about the SAP ECC tables, imagine for an Add-on specific for Utilities industry.

Any changes made directly to the table is not supported by SAP and it violates the product warranty.

That's why I cannot simply delete the rows: we don't have the documentation of the relationships (Intellectual Property), and we don't have direct access to the tables.

The application code uses several ABAP macros, inversion of control design patterns, dynamic injections, lazy proxies, subscriber/listener patterns, etc. that make it hard to trace what tables are involved.

The developer, however, made a program that deletes everything older than X days ago. The problem is that it is may be using some kind of a persistence framework developed by themselves that load the row (do the Relational-to-Object mapping) before triggering the object delete method. Here you can see the memory constraint.

I cannot simply set X to 1, because the application servers will not have enough memory to load all rows from that table (and related ones) to delete them.

We need to gradually set X to a higher value and decrease it gradually. But the person (data center SAP BASIS analyst) who are going to do that cannot set a new value every day (and it would take ages). That is why we would like to do that during a scheduled batch execution time (perhaps once a week) setting appropriate X values.

Lowell SSC Guru Points: 323518 More actions · Answer 8

darthurfle (6/28/2013)
Hi Lowell,
I am not trying to make the solution harder. It is a real world situation.
The table in question is a SAP ECC Add-on "standard" table that establishes relationships with several other tables. Usually in the SAP world, the infrastructure team, database team and even the SAP Support does not allow us to direct do changes to tables (they don't have foreign keys and all consistency is made by the application layer). If it is already difficult to find information about the SAP ECC tables, imagine for an Add-on specific for Utilities industry.
Any changes made directly to the table is not supported by SAP and it violates the product warranty.
That's why I cannot simply delete the rows: we don't have the documentation of the relationships (Intellectual Property), and we don't have direct access to the tables.
The application code uses several ABAP macros, inversion of control design patterns, dynamic injections, lazy proxies, subscriber/listener patterns, etc. that make it hard to trace what tables are involved.
The developer, however, made a program that deletes everything older than X days ago. The problem is that it is may be using some kind of a persistence framework developed by themselves that load the row (do the Relational-to-Object mapping) before triggering the object delete method. Here you can see the memory constraint.
I cannot simply set X to 1, because the application servers will not have enough memory to load all rows from that table (and related ones) to delete them.
We need to gradually set X to a higher value and decrease it gradually. But the person (data center SAP BASIS analyst) who are going to do that cannot set a new value every day (and it would take ages). That is why we would like to do that during a scheduled batch execution time (perhaps once a week) setting appropriate X values.

ok, that helps a bit; except...what is X?

so you need to delete via the applications interface, that's fine, we all hit issues like this.

what, specifically, can you pass to the application then? when you say you cannot simply pass X=1 , does that mean everything older than 1 day?

if you can do a batch file, it still implies you could do a loop, so why could you not pass it , incrementally, a say 3650 (365 *10) which means everything older than 10 years, then 3620(9 years 11 months) , etc etc , so the deletes are in smaller, bitesize pieces, instead of flirting with the known memory crash limit of ~250K rows?

if the goal is to use the application to delete older rows, doe sit matter if it's in 2000 smaller batches instead of 5 big batches of almost 250K?

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

darthurfle Valued Member Points: 66 More actions · Answer 9

Hi Lowell,

Thank you for your ideas. You were right. X corresponds to the retention days. In addition to what I explained, I must say that our SAP environment runs over Oracle solution (to which I don't have direct access). However, I tried to do the grouping analysis using Microsoft Access with no success due to its own limitations, and now I was hoping that using the analytical functions of SQL Server (of a development environment we have for another system) I would be able to figure out a solution (know the number of times I need to run the housekeeping and what retention number I should use on each execution).

Regarding passing the variable to the application, the problem is again the this proprietary German add-on developed using indirect calls, polymorphism, lazy materialization patterns, etc.. Usually we run all SAP programs using the Control-M software by BMC (a kind of Enterprise Bus Orchestrator which has technology specific client agents). In this case, we are able to create program "variants" (set of values for the program setup screen fields or parameters). However, the add-on housekeeping program (as many other programs of this add-on), actually, uses a generic program ("process starter") that calls a polymorphic object class whose behavior is determined by some values set on a common user interface (the user must browse a component tree view and find it). The add-on uses a specific job scheduler developed by the SAP guys to control the start and execution of their jobs. The X variable in this case is a variable of the the polymorphic class set through a GUI interface. To sum up, we tried, but we are not able to prepare beforehand several program execution variants. The data center SAP BASIS analyst would need to manually stop the execution of the housekeeping task through the internal job scheduler, navigate to the housekeeping task definition, set the retention variable, and schedule it to run (we can only have one instance of the housekeeping task setting). Because of that we would like to minimize the frequency he/she would need to do that.

Moreover, because the housekeeping is going to consume the application servers machine resources and these tables could be locked (it is a transactional table), we need to run the housekeeping during the weekends (we cannot run it after the working hours during the working days). If you set a large enough number of records to be deleted, you know more or less how much time it will take. So, the BASIS analyst could rest/do something else for a couple of hours or leave it running until finishes (the next execution he/she would schedule only next week). However, if you set a small amount of records to be deleted, he/she will need more weekends to complete the housekeeping or each weekend he/she will need to check many times if the task has finished and put the next one to run.

I will try to simply group the records by the creation date and count the subtotals, and create a procedure to sum the subtotal of each day until the total is less than 250,000 and store the last creation date somewhere. I can't see another alternative here.

Best Regards,

Darthurfle

darthurfle Valued Member Points: 66 More actions · Answer 10

Hi everybody,

I did it through coding a VBA macro in Excel (using a knapsack first fit algorithm). It was quite easy to do it by coding. I only had to get the total number of records for each day.

I think this kind of problem is more likely to be solved through procedural code.

Anyway, I appreciate your attention and help!

Best Regards,

Darthurfle