What's the best way to get a sample set of a big table without primary key?

  • This is what is happening here:

    We have lots of big data, table's size is often around billion and there is no integer id for primary key.

    We need to get a small set of sample data (1%), current solution is for the tables with big int field, we do something like WHERE id%100<1 

    As lots of pending tables do not have integer field like id, we are stuck here now.

    Can anyone share your solution here on how to get the needed sampling dataset?

    Please note: it is not acceptable nor realistic to add new field to existing table.

  • The issue with that is it'll result in a ful index or heap scan if there is nothing supprting it.

    How about tablesample?  This should only tocuh pages required to fulfill the sample request

    http://technet.microsoft.com/en-us/library/ms189108%28v=sql.105%29.aspx

  • parody - Thursday, June 29, 2017 7:26 AM

    The issue with that is it'll result in a ful index or heap scan if there is nothing supprting it.

    How about tablesample?  This should only tocuh pages required to fulfill the sample request

    http://technet.microsoft.com/en-us/library/ms189108%28v=sql.105%29.aspx

    Thank you for your input, unfortunately this is not MS SQL and there is no TABLESAMPLE function or similar one.

  • halifaxdal - Thursday, June 29, 2017 7:32 AM

    parody - Thursday, June 29, 2017 7:26 AM

    The issue with that is it'll result in a ful index or heap scan if there is nothing supprting it.

    How about tablesample?  This should only tocuh pages required to fulfill the sample request

    http://technet.microsoft.com/en-us/library/ms189108%28v=sql.105%29.aspx

    Thank you for your input, unfortunately this is not MS SQL and there is no TABLESAMPLE function or similar one.

    Ohk, it's just you have posted under SQL Server 2012 - T- SQL seciton.  What is the DBMS in question, else no one can guess at the syntax you are expecting to be supported?? Or do you actually mean Google Big Table, not simple a big table.  Maybe i'm missing something.

  • halifaxdal - Thursday, June 29, 2017 7:32 AM

    parody - Thursday, June 29, 2017 7:26 AM

    The issue with that is it'll result in a ful index or heap scan if there is nothing supprting it.

    How about tablesample?  This should only tocuh pages required to fulfill the sample request

    http://technet.microsoft.com/en-us/library/ms189108%28v=sql.105%29.aspx

    Thank you for your input, unfortunately this is not MS SQL and there is no TABLESAMPLE function or similar one.

    You're posting on the wrong website, then.  If you're using a similar RDBMS, something like SELECT TOP 1 PERCENT might work for you.  Otherwise, you're likely to get a quicker, clearer and more comprehensive answer on a discussion board relating to the product you're using.

    John

  • What about using something like this?

    SELECT TOP (1) PERCENT *
    FROM MyTable;

    Note that if you add an ORDER BY clause the performance will decrease.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Just to be pedantic, though, this may not give the type of sample desired - it will return in an efficient-first order which may or may not be represantive of the full contents depending on a number of factors (insert order, what is the cluster, partitioning, hardware etc)

    The modulo example from the OP mimcks a sampled set across the full contents, hence the equivilent thought of tablesample (in tsql at least!)

  • I don't think anyone can tell you the best way (or even a viable approach) since it's not SQL Server and the database platform for the table is unknown.

    Sue

  • halifaxdal - Thursday, June 29, 2017 7:32 AM

    parody - Thursday, June 29, 2017 7:26 AM

    The issue with that is it'll result in a ful index or heap scan if there is nothing supprting it.

    How about tablesample?  This should only tocuh pages required to fulfill the sample request

    http://technet.microsoft.com/en-us/library/ms189108%28v=sql.105%29.aspx

    Thank you for your input, unfortunately this is not MS SQL and there is no TABLESAMPLE function or similar one.

    You have been around long enough to know better than keeping the facts to yourself, if you want an answer then you must at least post all the relevant information. For a start, what system are you working on? 
    😎

    There are hundreds of regulars here which can answer your question, the only requirement is to properly ask and not leave us guessing!

  • Most typically in SQL Server I see this technique used:

    SELECT TOP ...
    FROM dbo.table_name
    ORDER BY NEWID()

    I presume that gives a roughly random sample.  Of course it likely does require sorting the entire record set.

    If you have an existing index that contains a unique set of columns, you could select just those columns in the "ORDER BY" query, allowing an index scan and sort rather than a full table scan and sort, then join back to the main table after that to get other column(s) from the table.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • I am so sorry that I forgot to mention that I am working in Cloudera and Netezza, I post my question here because I know this is the best forum for SQL and people like yourself are all the wonderful people and are willing to help, so I can borrow some idea from here and port it to my own environment.

    My apology to all.

    In this case, the query would be used in Impala (Cloudera)

    The raw table  has a size of 11978638511, for the query to do the sampling, the table's size is around 6000

    I tried Luis suggestion of

    SELECT TOP (1) PERCENT *
    FROM MyTable;

    Impala doesn't take it.

    Thank you very much.

  • halifaxdal - Thursday, June 29, 2017 10:20 AM

    I am so sorry that I forgot to mention that I am working in Cloudera and Netezza, I post my question here because I know this is the best forum for SQL and people like yourself are all the wonderful people and are willing to help, so I can borrow some idea from here and port it to my own environment.

    My apology to all.

    In this case, the query would be used in Impala (Cloudera)

    The raw table  has a size of 11978638511, for the query to do the sampling, the table's size is around 6000

    I tried Luis suggestion of

    SELECT TOP (1) PERCENT *
    FROM MyTable;

    Impala doesn't take it.

    Thank you very much.

    I just did a quick search - it won't support top. An equivalent is the limit clause. But it also doesn't mean it's the best and there could be alternatives so you may want to google and get more information on using the limit clause.
    select * from MyTable
    limit 6000;

    Sue

  • Rereading your first post actually made me realise you are doing it just fine with modulo, it's just with some tables you don't have an integer column to use.

    In that case can you not find some other suitable column with enough variation (timestamp or some random data) on which to calculate a checksum and then perform a modulo?

    Alternatively a TOP with calculated instead of percentage and order by something random or a new_id type function?

    Also FYI

    https://issues.apache.org/jira/plugins/servlet/mobile#issue/IMPALA-1924

  • Thank you. I am not sure if I ran into sort of "bug" in Impala, here is what I like to share here and I am waiting for Cloudera's reply:
    As there is no integer column to use, I came up with the following query hoping to get 1% random subset:

    SELECT
    round(rand(unix_timestamp(now())) * 10,0)
    ,cast(round(rand(unix_timestamp(now())) * 10,0) as int)
    ,*
    FROM fq_stage.OATS_GTC_ORDER
    --where cast(round(rand(unix_timestamp(now())) * 10,0) as int) = 0

    The second field is not necessary but added there for troubleshooting, the query with where clause is supposed to return me the 1% subset but actually nothing
    Comment out the where clause will give all the dataset
    Modify it to <=1,2,..... will give me all the dataset.

    BTW:
    rand(unix_timestamp(now())) generates value like:  0.2900070456275749        

  • ScottPletcher - Thursday, June 29, 2017 9:57 AM

    Most typically in SQL Server I see this technique used:

    SELECT TOP ...
    FROM dbo.table_name
    ORDER BY NEWID()

    I presume that gives a roughly random sample.  Of course it likely does require sorting the entire record set.

    If you have an existing index that contains a unique set of columns, you could select just those columns in the "ORDER BY" query, allowing an index scan and sort rather than a full table scan and sort, then join back to the main table after that to get other column(s) from the table.

    How about adding that   WITH ( NOLOCK)   next to the table name

Viewing 15 posts - 1 through 15 (of 28 total)

You must be logged in to reply to this topic. Login to reply