T-SQL decoder ring for Google BigQuery?

  • Sorry if this is in the wrong forum... feel free to move it.

    I'm okay with T-SQL, but BigQuery boggles my mind. Is there a good reference comparing the two somewhere? I was looking for a little bit of something sort of like an intro and recipes for converting between the two.

    thanks!

     

  • Haven't seen that. I did run across this item on learning BigQuery

    https://codingisforlosers.com/learn-bigquery-sql/

  • Thanks,  Steve, I'll check it out. (Gotta say, I never expected to have to help my nephew figure his way through BigQuery, but here I am!)

    This book looks promising, but it's just about BigQuery...

    Data Science on the Google Cloud Platform: Implementing End-to-End Real-Time Data Pipelines: From Ingest to Machine Learning 

    Might be the best I'm going to do for the time being.

  • I haven't seen a comparison.

     

    that being said, if you're learning it, maybe you want to write one. I'd pay you, even for a series of articles as you learn the differences in the languages.

  • Be warned, I'm a rank beginner... I'm absolutely clueless about BigQuery, but that might be a good thing. I guess I might have to just make an account for BigQuery (BQ, because I'm lazy) and use data there. At the moment, I'm thinking of doing something like:

    1. create tables in BQ...
    2. Populating tables in BQ from parquet files (Something I know absolutely nothing about)
    3. similarities & differences I found between BQ SQL and T-SQL, like LIMIT n instead of TOP n - one *huge* thing that makes my life easier is that BQ supports common table expressions.
    4. Weird things / surprises I found.

    One thing I was hoping to figure out (which I may do with CTEs) was solving the "sanity check" question - Well, running a query against an enormous dataset can take 15 minutes, so I wanted a way to test on the smallest reasonable dataset I could. I haven't figured it out yet, but I was tempted to create 2 CTEs -- one for "today" and one for "yesterday" (except the way things are in Nephew's BQ dataset is that each one is in a separate table, and the table names change). So I'd do a TOP/LIMIT query for each and compare a thin horizontal "slice" of data from each just to see if things make sense or are working.

    part of the fun is that I have zero clue what to expect. The other problem is getting my head around Python and Pandas, because I think that's the sane way of dealing with the result sets from BQ. (And something like Notebooks)... because it would be self-documenting.

    Any thoughts?

    Pieter

  • It might be interesting to just see your impressions and experiments. I'd think you present this as "what I learned" rather than "this is what you should do"

  • Oh, I definitely wouldn't do that! It would be more like "Notes from the road: what I learned (and maybe didn't) using BigQuery"

  • Okay, just found one really interesting feature... described here: https://cloud.google.com/bigquery/docs/querying-wildcard-tables#limitations

    It makes sense if your data pipeline generates something like CSV files with a format of [standard file prefix]MMDDYY

    the part to keep in mind is that in BigQuery, you often upload a series of huge files/CSVs/whatever. In addition to filtering what's IN a table, in BQ, you can filter the tables that "participate" in the query, like this:

    SELECT
      max,
      ROUND((max-32)*5/9,1) celsius,
      mo,
      da,
      year
    FROM
      bigquery-public-data.noaa_gsod.gsod19*
    WHERE
      max != 9999.9 # code for missing data
      AND _TABLE_SUFFIX BETWEEN '29' and '35'
    ORDER BY
      max DESC

    It's like partitioned tables without the hassle of partitioning. It also means you can import entire files with the same structure into their own tables, and then combine them with _TABLE_SUFFIX instead of writing a messy UNION ALL query that would require dynamic T-SQL. If you're doing time-based analysis (like changes in value or whatever over time), you can query all the tables at once. (Super handy if you're testing a query ... you can just eliminate all but the first table by changing the _TABLE_SUFFIX filter so only one table is returned.)  Funky, but I can definitely see how this would be helpful if you're doing queries on a massive scale. I seriously think it's time for a book! Otherwise, I might spend too much time thinking in terms of how SQL Server does it, and the two database engines are vastly different.

    Once upon a time, I did cancer research, and their studies were all in separate databases/tables. It would have been pretty wild to do a query using something like _TABLE_SUFFIX (I wonder if there's a PREFIX version of that.) to get ALL the data from ALL the protocols of  the same type (like "lung", "central nervous system", etc) and just do a super quick summary of *all* of them at once, just to look for patterns.

    • This reply was modified 4 years, 6 months ago by  pietlinden.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply