when to use Python in SQL Server

  • I have a really stupid question.  Lots of articles explain how to use Python in SQL Server, but I have a more basic question... How do you know when to use Python in SQL Server? Maybe part of my problem is that Python can do so many things with data it's just mindboggling... so where do you start? Is there a good tutorial on the "when", and walkthroughs? (I have a PluralSight subscription... haven't seen an answer there yet... but maybe I just don't know what I'm looking for!)

    Thanks!

  • My opinion - I would use it when I need to do analytics on the data; similar to when you'd use R.

    My approach though is to use these tools outside of the SQL engine as I can do the processing on my local machine rather than putting extra strain on the SQL Server (R and Python operate in different memory space than SQL Server, so it can be easy to overload your server).

    If you have a beefy server and need to do some heavy lifting server side that would be too painful to do on your local box, then use Python inside SQL Server.

    Others may disagree with me as there are benefits to having everything in one tool, but my opinion is that just because you can doesn't mean you should.  SQL Agent job and xp_cmdshell can be used to schedule chkdsk for example, but I would not recommend to anyone that that be the tool to use to schedule chkdsk.

    On the other hand, if doing the analytics on the SQL Server makes more sense than pushing it client side, then use Python on your SQL Server.  I would encourage you to set up a data warehouse for that and not do heavy analytics on a server hosting anything that is time sensitive with reads or writes as you are likely to hog resources while doing the analytics.  And you will want to make sure you configure SQL so that Python has enough resources to do the analytics you need...

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • I've only just started Python, but trying to use Python inside SQL Server is mindboggling to me. Seems a lot cleaner and clearer if you do it from outside... connect to SQL Server using pyodbc, run a stored procedure (well, for the sake of simplicity), and then do the analysis there.  I'll keep looking, but I was hoping for a somewhat introductory guide to it... Maybe it's in Wes McKinney or Bill Lubanovic's books...

  • My opinion - I don't think there are that many use cases for using it.  Similar to R - you can access it externally, does it need to be baked in to SQL Server?

    The only benefits I can see are if you are writing some super complex stuff that uses a lot of CPU and memory and your server has some to spare but your poor little desktop doesn't.  OR your desktop does, but the end users who would be using the results would need beefier machines.

    One advantage to it is that any anyalytics you do should run in fairly consistent times, presuming the data doesn't change, no matter what machine you run it on.  A cheap little laptop with 2 GB of RAM or a beefier developer machine with 64 GB of RAM.  all the heavy lifting is done on the SQL Server side.

    Another advantage is you could have a centralized stored procedure that does the calculations that others could consume without needing to have an extra tool installed on their machines for analytics.  Then they can call the stored procedure and have it do its magic on the back end and get results consistently across machines and tools (SSRS, Excel, SSMS, Power BI, etc).  That being said, I have no idea if SSRS, Excel, Power BI, etc support getting Python/R data from SQL Server, but I can't see why they would complain.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • I can get Python to read my database tables / run my stored stored procedures using pyodbc from Python (outside SQL Server)... I guess Python is just super flexible - I could use it to filter data (if I needed to), including using joins etc. So I might try the "easy" version first, get it working, and then move toward the more complex stuff. To be a bit more concrete, write a stored procedure to get the data in the "easy" version, and then just manipulate it in Python - maybe just using Seaborn to graph my results.

    Then try to write something that does what the stored procedure does using different Python dataframe methods to join dataframes to do the same thing. Maybe it's just that there are so many choices to make, given the particular situation you're in.

  • My comment would be to not use Python to duplicate what can be done in SQL Server.  Only use it if you have something that SQL Server can't do well or can't do at all.  It's like cursors and other forms of RBAR, though.  There are a ton of people that will insist they need to use some form of RBAR simply because they don't know enough about T-SQL to do it in T-SQL.

    I'll also say that you use the ship you know how to steer but, if people spent as much time learning about how to do something in T-SQL as they do in other "languages", they might not have use those other "languages".  Of course, that won't help you "keep up with the cool kids".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I half agree with Jeff there.  Python may not be the best option for all use cases but, like cursors, there are cases where it is the best or only option.

    The part I disagree with is the second half.  Some things CAN be done in SQL, but can be done so much faster and/or easier in other languages.  Knowing other languages should never be seen as a fault.  Python in SQL can do some pretty cool things.  I think the better debate would be does Python need to be baked into SQL Server and that question is a bit harder to tackle and find valid use cases for.  The only use cases I can think of are for when the process being done in Python is better suited to be run server side than client side due to duration or resource use.  In the end, it is another case of picking the right tool for the job.  For example, if you want to find the mean of a large data set, Python may outperform SQL (depending on how you have things configured and which libraries you use in Python).  I have not tested this, but it would be an interesting article for SSC - does Python outperform SQL for analytics?  I expect it would vary greatly depending on configurations (ie does the server have enough memory for Python to play nice with SQL or will one of the two eat it all up).  It is one of those things where it may make sense to test it in both SQL and in Python and see which performs better and which is easier to read and easier to support.  Presuming you have the appropriate timeline to do that level of testing.

    Another advantage to Python in SQL that I can think of is for developers who are not that familiar with SQL, but know other languages such as Python.  Learning how to do it in SQL may make the code faster but harder to support.  If you understand Python better than TSQL, Python code will be more "readable" and thus more supportable.

    It is one thing to say your code is fast, but if nobody understands how it is working it means nobody can support the code.  Being able to maintain the code is my #1 priority when writing code.  Performance is a close second, but if I can't maintain it, it won't hit production.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Mr. Brian Gale wrote:

    Some things CAN be done in SQL, but can be done so much faster and/or easier in other languages.

    While that is certainly true, I find that it's a matter of people simply not knowing T-SQL or SQL Server.  It's just like cursors where people say that they're ok to use if there's no other way.  The problem is that people simply don't know that there's another usually much simpler way.

    Mr. Brian Gale wrote:

    ...but if nobody understands how it is working it means nobody can support the code.

    Oh my... how true THAT is.  But it's also true with the "Tower of Babble" that frequently results from people going to other methods simply because they don't know much about the tool that they really should know more about.

    In a previous company that I worked at, I had one guy take a complaint about me all the way to the CEO with the CTO in tow.  He wanted me to deploy a CLR to production for him because it was "absolutely imperative and critical to multiple tasks".  I told him no and I started to explain and he got all flustered and stormed out of the room with his final words being something to the effect of "We'll just see about that!".

    To make a much longer story shorter, his CLR was to do a simple modulo.  He never took the time to look up "modulo" in BOL.  He looked mighty stupid when I explained it to the CEO and CTO that his CLR could be replaced with a single "%" sign as well as the fact that he cut me off when I tried to explain why I wasn't going to let his CLR into production.

    Also, at the very same company, we had a system written in DTS (yeah... it was a long time ago) that took 45 minutes just to get ONE "Doubleclicknet" file ready for import.  It had worked correctly for years before I was brought it and still worked correctly.  Part of the reason why I was brought in was because they were having performance issues across the board because business was suddenly booming.  We had hundreds of such files to import and there wasn't enough time in the day to import them all even if we doubled that rate.

    The files weren't very long... usually just 30,000 rows or so.  But they were variable width and were frequently up to 800 columns wide.  Only the presence of the first 9 columns (the "key" columns) were predictable.  All of the other columns could show up in any position from file to file.  As if that wasn't enough, it was important that pairs of columns and, sometimes, quads of columns had to stay together.  The import they wanted to do would "normalize" the data so that the 9 key columns would always be present on a row and the other columns would be unpivoted as either pairs or quads and sent to their respective tables for pairs or quads.  Instead, it took 45 minutes just to get the file ready for import.

    The DTS package was a "wonder".  Like I said, it took 45 minutes just to get a file ready for import.  It would run a PERL script to do something.  It ran Active-X to make on the fly changes to the DTS package paths (disabling a path and enabling another for divergent and reconverging paths).  It used some form of C to find the files.  It had a VB module associated with it (I forget what that did).  And no one was left at the company that knew how it all worked.

    And they wrote it that way because of things like what you said...

    Mr. Brian Gale wrote:

    Another advantage to Python in SQL that I can think of is for developers who are not that familiar with SQL, but know other languages such as Python. Learning how to do it in SQL may make the code faster but harder to support. If you understand Python better than TSQL, Python code will be more "readable" and thus more supportable.

    Just replace the word "Python" with "R", "Perl", "Active-X", "C", ".net", "C#", "VB Module", or even "DTS" and you'll suddenly understand why companies need to get some Developers that actually do understand T-SQL very well.  Everything else is just an excuse to have overly complex and unnecessarily "distributed" code of many disparate types.

    Mr. Brian Gale wrote:

    Some things CAN be done in SQL, but can be done so much faster and/or easier in other languages.

    Again, very true but this is just like the problem with cursors and when they should be used.  So many people say "can be done so much faster and/or easier in other languages" but that's very frequently because they don't actually know enough about the language they should know because they ARE working with the database (or should be) and that's T-SQL.  Consider again the moroff that wanted to write a CLR to do a bloody Modulo in T-SQL.

    I listen to people say "To a hammer, everything is a nail".  Yep... and when you're trying to pound nails into your database, why the hell are you looking for a screwdriver or a paperclip or a shoehorn?  Use the bloody hammer! 😀

    I've also had people tell me "Well, Jeff... SQL Server isn't the center of the universe".

    My reply is "Cool!  Let's turn it off and see if you're right".

    My point is that I truly agree that T-SQL isn't the right tool for some things but I find that a lot of people don't actually have the right knowledge about the tool to say what those things actually are.  Some say "Well, just because can can do something in SQL, doesn't mean you should".  The same holds true in reverse... "Just because you can do something outside of SQL, doesn't mean you should".

    I almost forgot... that 45 minute "Tower of Babel" run just to get one file ready for import?  Yeah... I rewrote it using T-SQL... it was much shorter and certainly easy to understand (that what comments are for!).  It actually did the full imports, unpivots, and redistribution of data to the many required normalized tables along with data validation and cleansing at the rate of 8 files every 2 minutes and it was done in SQL Server 2000 on a 32 bit box a long, long time ago.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff I think this may be a thing that we have to agree to disagree on.  I am not saying that SQL shouldn't be the first solution you look at for a problem, I am just saying that it shouldn't be the ONLY solution you look at for a problem.  I am also not saying that just because you KNOW Python that all of your SQL code should be done in Python.

    All I am trying to say is make sure you are picking the right tool for the job.  When you are learning a new thing (like Python in SQL), I find that it is VERY useful to find a SQL query that will give you the same results as your Python script.  This helps you determine if your Python code is accurate or has a bug.  Or alternately, if your SQL code has a bug.  Learning new skills and tools such as Python or PowerShell or whatever is never a bad thing.  Knowing the limitations of that tool is also beneficial.  For example, using Python in SQL you can load data from a flat file into memory and do analytics on it without needing to pull it into a table first and can dump the results from the file into a temporary table and run some TSQL against the data.  Now you can do this entirely in TSQL (uftReadfileAsTable), but maybe you need to do some manipulation on the data and you enjoy using regex.  If you want to use regex in SQL Server, you either need to import a CLR (which may or may not be allowed at your company) or rewrite it to not use regex (which may be complex or slow).  Now, regex in Python is not a problem.

    If time permits, I strongly recommend trying all solutions and tools that are available that you can support and that make sense.  I would not bother to spin up a MongoDB database to see if it performs better for my tasks, but my skillset and work requirements mean I am comfortable with doing things in C#, powershell and TSQL, so I will try to pick the right tool for the job.

    I would never say that SQL is not capable of doing any sort of data processing and manipulation, but if you asked me to write up an application that end users would be using with a fancy GUI, TSQL is not going to be my "hammer".  SQL will likely hold some of the configuration and user customization data and any data that is collaboration data that needs to be accessible by multiple clients at the same time, but there is no way I'd build a GUI in SQL.  Also, if the GUI side of the application needed the ability to sort the data inside a data grid view object (for example), I am not going to bake that into the database side as that is clunky and messy and slow to do.  You will get much better performance building the sorting into the datagridview object.

    As for your comment about SQL Server being the center of the universe, the same could be said about the application server.  Lets turn that off and see how long before production comes calling because the application failed.  I would not say SQL Sever is the "center of the universe", it is more a cog in the system.  You pull out one cog, the whole thing goes down.

    I do agree that a lot of people don't have the right knowledge to know which tool is the right tool for the job.  Same thing applies to the hammer vs nail analogy.  If the only tool you have ever used or owned is a hammer, all problems look like nails.  But going into a hardware store and looking at all of the tools can be overwhelming and knowing which tool is the right one for the job you have can be tricky.  Sometimes a circular saw is the way to go, sometimes a jigsaw is better.  Or maybe a table saw or a hand saw.  Can the hand saw do the same work as the table saw?  yes, but it is a lot more work and will be much slower.  BUT if you have some branches you need to cut down from your tree and you bough the table saw, that is going to be incredibly difficult to do.  You need to know what the problem definition is and the problem scope before you start looking at the tools you need.  And that is what pietlinden was trying to do - They got a new tool (Python in SQL server) and wanted to know a use case for it.

    One use case I can think of that would probably check all the boxes - you have an existing Python script that you are calling via xp_cmdshell in a SQL agent job.  Your company is implementing a new policy to turn of xp_cmdshell, so you need to find another way to run that python script at step 3 of 10 in that SQL agent job.  You can schedule out a windows task scheduler job to start when step 3 usually runs and do a best guess with it and include a waitfor in the SQL agent job, but that is sloppy.  Having all of the code inside SQL may be the best "quick win" for the team with plans to migrate it in the future to full TSQL without Python OR keeping it  in Python may make the most sense.

     

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • This was removed by the editor as SPAM

  • Muskan wrote:

    Knowing when to use Python in SQL Server depends on the specific requirements and tasks you need to accomplish. Python can be used in various scenarios within SQL Server, including data manipulation, data analysis, machine learning, automation, and more. Here are some common use cases where Python can be beneficial:

    • Data manipulation and transformation: Python provides powerful libraries like pandas and NumPy, which can be used to perform complex data manipulations, transformations, and cleansing tasks on SQL Server data.
    • Data analysis and visualization: Python's libraries, such as matplotlib, seaborn, and plotly, enable you to analyze and visualize SQL Server data to gain insights and make data-driven decisions.
    • blah blah blah

    ChatGPT again.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • This was removed by the editor as SPAM

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply