June 29, 2017 at 7:06 am
This is what is happening here:
We have lots of big data, table's size is often around billion and there is no integer id for primary key.
We need to get a small set of sample data (1%), current solution is for the tables with big int field, we do something like WHERE id%100<1
As lots of pending tables do not have integer field like id, we are stuck here now.
Can anyone share your solution here on how to get the needed sampling dataset?
Please note: it is not acceptable nor realistic to add new field to existing table.
June 29, 2017 at 7:26 am
The issue with that is it'll result in a ful index or heap scan if there is nothing supprting it.
How about tablesample? This should only tocuh pages required to fulfill the sample request
http://technet.microsoft.com/en-us/library/ms189108%28v=sql.105%29.aspx
June 29, 2017 at 7:32 am
parody - Thursday, June 29, 2017 7:26 AMThe issue with that is it'll result in a ful index or heap scan if there is nothing supprting it.How about tablesample? This should only tocuh pages required to fulfill the sample request
http://technet.microsoft.com/en-us/library/ms189108%28v=sql.105%29.aspx
Thank you for your input, unfortunately this is not MS SQL and there is no TABLESAMPLE function or similar one.
June 29, 2017 at 7:37 am
halifaxdal - Thursday, June 29, 2017 7:32 AMparody - Thursday, June 29, 2017 7:26 AMThe issue with that is it'll result in a ful index or heap scan if there is nothing supprting it.How about tablesample? This should only tocuh pages required to fulfill the sample request
http://technet.microsoft.com/en-us/library/ms189108%28v=sql.105%29.aspx
Thank you for your input, unfortunately this is not MS SQL and there is no TABLESAMPLE function or similar one.
Ohk, it's just you have posted under SQL Server 2012 - T- SQL seciton. What is the DBMS in question, else no one can guess at the syntax you are expecting to be supported?? Or do you actually mean Google Big Table, not simple a big table. Maybe i'm missing something.
June 29, 2017 at 7:51 am
halifaxdal - Thursday, June 29, 2017 7:32 AMparody - Thursday, June 29, 2017 7:26 AMThe issue with that is it'll result in a ful index or heap scan if there is nothing supprting it.How about tablesample? This should only tocuh pages required to fulfill the sample request
http://technet.microsoft.com/en-us/library/ms189108%28v=sql.105%29.aspx
Thank you for your input, unfortunately this is not MS SQL and there is no TABLESAMPLE function or similar one.
You're posting on the wrong website, then. If you're using a similar RDBMS, something like SELECT TOP 1 PERCENT might work for you. Otherwise, you're likely to get a quicker, clearer and more comprehensive answer on a discussion board relating to the product you're using.
John
June 29, 2017 at 8:07 am
What about using something like this?
SELECT TOP (1) PERCENT *
FROM MyTable;
Note that if you add an ORDER BY clause the performance will decrease.
June 29, 2017 at 8:14 am
Just to be pedantic, though, this may not give the type of sample desired - it will return in an efficient-first order which may or may not be represantive of the full contents depending on a number of factors (insert order, what is the cluster, partitioning, hardware etc)
The modulo example from the OP mimcks a sampled set across the full contents, hence the equivilent thought of tablesample (in tsql at least!)
June 29, 2017 at 8:22 am
I don't think anyone can tell you the best way (or even a viable approach) since it's not SQL Server and the database platform for the table is unknown.
Sue
June 29, 2017 at 9:18 am
halifaxdal - Thursday, June 29, 2017 7:32 AMparody - Thursday, June 29, 2017 7:26 AMThe issue with that is it'll result in a ful index or heap scan if there is nothing supprting it.How about tablesample? This should only tocuh pages required to fulfill the sample request
http://technet.microsoft.com/en-us/library/ms189108%28v=sql.105%29.aspx
Thank you for your input, unfortunately this is not MS SQL and there is no TABLESAMPLE function or similar one.
You have been around long enough to know better than keeping the facts to yourself, if you want an answer then you must at least post all the relevant information. For a start, what system are you working on?
😎
There are hundreds of regulars here which can answer your question, the only requirement is to properly ask and not leave us guessing!
June 29, 2017 at 9:57 am
Most typically in SQL Server I see this technique used:
SELECT TOP ...
FROM dbo.table_name
ORDER BY NEWID()
I presume that gives a roughly random sample. Of course it likely does require sorting the entire record set.
If you have an existing index that contains a unique set of columns, you could select just those columns in the "ORDER BY" query, allowing an index scan and sort rather than a full table scan and sort, then join back to the main table after that to get other column(s) from the table.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
June 29, 2017 at 10:20 am
I am so sorry that I forgot to mention that I am working in Cloudera and Netezza, I post my question here because I know this is the best forum for SQL and people like yourself are all the wonderful people and are willing to help, so I can borrow some idea from here and port it to my own environment.
My apology to all.
In this case, the query would be used in Impala (Cloudera)
The raw table has a size of 11978638511, for the query to do the sampling, the table's size is around 6000
I tried Luis suggestion of
SELECT TOP (1) PERCENT *
FROM MyTable;
Impala doesn't take it.
Thank you very much.
June 29, 2017 at 10:42 am
halifaxdal - Thursday, June 29, 2017 10:20 AMI am so sorry that I forgot to mention that I am working in Cloudera and Netezza, I post my question here because I know this is the best forum for SQL and people like yourself are all the wonderful people and are willing to help, so I can borrow some idea from here and port it to my own environment.My apology to all.
In this case, the query would be used in Impala (Cloudera)
The raw table has a size of 11978638511, for the query to do the sampling, the table's size is around 6000
I tried Luis suggestion of
SELECT TOP (1) PERCENT *
FROM MyTable;Impala doesn't take it.
Thank you very much.
I just did a quick search - it won't support top. An equivalent is the limit clause. But it also doesn't mean it's the best and there could be alternatives so you may want to google and get more information on using the limit clause. select * from MyTable
limit 6000;
Sue
June 29, 2017 at 10:56 am
Rereading your first post actually made me realise you are doing it just fine with modulo, it's just with some tables you don't have an integer column to use.
In that case can you not find some other suitable column with enough variation (timestamp or some random data) on which to calculate a checksum and then perform a modulo?
Alternatively a TOP with calculated instead of percentage and order by something random or a new_id type function?
Also FYI
https://issues.apache.org/jira/plugins/servlet/mobile#issue/IMPALA-1924
June 29, 2017 at 1:08 pm
Thank you. I am not sure if I ran into sort of "bug" in Impala, here is what I like to share here and I am waiting for Cloudera's reply:
As there is no integer column to use, I came up with the following query hoping to get 1% random subset:
SELECT
round(rand(unix_timestamp(now())) * 10,0)
,cast(round(rand(unix_timestamp(now())) * 10,0) as int)
,*
FROM fq_stage.OATS_GTC_ORDER
--where cast(round(rand(unix_timestamp(now())) * 10,0) as int) = 0
The second field is not necessary but added there for troubleshooting, the query with where clause is supposed to return me the 1% subset but actually nothing
Comment out the where clause will give all the dataset
Modify it to <=1,2,..... will give me all the dataset.
BTW:
rand(unix_timestamp(now())) generates value like: 0.2900070456275749
June 29, 2017 at 1:34 pm
ScottPletcher - Thursday, June 29, 2017 9:57 AMMost typically in SQL Server I see this technique used:SELECT TOP ...
FROM dbo.table_name
ORDER BY NEWID()I presume that gives a roughly random sample. Of course it likely does require sorting the entire record set.
If you have an existing index that contains a unique set of columns, you could select just those columns in the "ORDER BY" query, allowing an index scan and sort rather than a full table scan and sort, then join back to the main table after that to get other column(s) from the table.
How about adding that WITH ( NOLOCK) next to the table name
Viewing 15 posts - 1 through 15 (of 28 total)
You must be logged in to reply to this topic. Login to reply