May 15, 2023 at 7:16 pm
SQL would be my first choice for cleaning data, while it's a powerful tool for querying and cleaning data, it may not be the best choice for performing complex language analysis tasks like sentiment analysis. I was reading about someone who was analysing data on Twitter, they had to make a a decision the positivity of comments, a statement like "yes, it was great" within a comment could be either genuine or sarcastic. Sentiment analysis involves determining the sentiment or emotion expressed in a piece of text, it often requires natural language processing (NLP) techniques to analyse the context, sentiment words, phrases, and other linguistic features. So for this case Python or R would be a good choice to integrate with SQL for cleaning data.
Totally agreed on all of that.
Still, there's a shedload of analysis that can be done in T-SQL (of course, you already know that so apologies for stating the obvious).
--Jeff Moden
Change is inevitable... Change for the better is not.
May 15, 2023 at 7:17 pm
Jeff Moden wrote:Thanks, Dennis... that was my exact first impression and for the very reason you stated... The word "but".
How about the rest of you good folks... did anyone else have a FIRST impression that differed from mine?
My first impression was the same you had Jeff. I know nothing of Python or R and I can do everything I need with SQL. Ok a little SSRS for reports, but that's mainly SQL anyway. 100% agree with Phil statement as well.
Aye... thank you for the feedback on that.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 15, 2023 at 7:27 pm
Jeff Moden wrote:Thanks, Dennis... that was my exact first impression and for the very reason you stated... The word "but".
How about the rest of you good folks... did anyone else have a FIRST impression that differed from mine?
First, second and third impressions are the same: "It is impossible to clean, manipulate, analyse and visualise data without knowing some R or Python."
The corollary being that SQL cannot do all of those things.
At a stretch, it could be argued that SQL cannot be used to visualise data and therefore the above statement is true, because one of {clean, manipulate, analyse, visualise} is false, even though {clean, manipulate, analyse} would all be true.
Aye. Thanks, Phil. I'll say that you can actually visualize data in SQL. Not all visualization needs to be in graphic form.
I also say that there are hybrid opportunities. For example, I saw one fellow post an article about Python and his code was "chunking" data in 100,000 row-size iterations. I asked by and he said to "increase performance". One of the many articles I have on the proverbial back burner is one about how to do linear regressions on millions of rows in a nasty fast manner. It would be interesting to compare Python with SQL Server on things like that.
Another, which I'm actually going to teach this coming Saturday at the Ohio North SQL Saturday 2023 is how to some pretty awesome, on the fly aggregates on 20 million rows out of 100 million rows (28 Milli-seconds on the first run and less on subsequent runs). It also doesn't take much longer on all 100 million rows.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 16, 2023 at 2:00 am
Okay I know enough about Python to know that it is not nearly as fast to use as straight C especially if you are doing it in MS Windows (ack hack cough puke) I need some relief pass me a bottle of Linux or Unix. This is because MS Windows does not support multi-threading (which many Python programmers are not even aware of) -- it does support multi-processing just not multi-threading. However, if you truly want to get high rates of data maniuplation you use straight C algorithms runing on a Unix server --- I know because I worked on the code that handles all the claims processing for Medcaid for the state of Florida where they had to process nearly billion claims every day and get them completed in a fraction of a day. The magnitude of that versus what most folks are doing is kind of mind boggling to say the least.
Still if I am going to use code to process data manipulation with a need for speed I am going to use a compiled language (C, C++, Java, etc...) not an interpretted one (Python).
May 16, 2023 at 2:41 am
Still if I am going to use code to process data manipulation with a need for speed I am going to use a compiled language (C, C++, Java, etc...) not an interpretted one (Python).
That's another tic I have... I love it when I tell someone that their SQL will be slow and they retort with something (fairly stupid, IMHO) that it's only ever going to have a small number of rows or will only run once and day and so performance doesn't matter. People just can't look past their one piece of code and understand that their code is just one of thousand of pieces of code where someone made the same mistake in thinking. Now, you have a server that's running twice, tens, and, sometimes, hundreds of times slower and there's no place to easily fix it because the poison has permeated the meat of the entire server.
Eh? Just throw hardware at it, right?
Lordy.
I have seen it quite a few times where adding extra hardware actually slowed things down because more things want to get at the same data but it's all slow.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 16, 2023 at 7:56 am
Okay I know enough about Python to know that it is not nearly as fast to use as straight C especially if you are doing it in MS Windows (ack hack cough puke) I need some relief pass me a bottle of Linux or Unix. This is because MS Windows does not support multi-threading (which many Python programmers are not even aware of) -- it does support multi-processing just not multi-threading. However, if you truly want to get high rates of data maniuplation you use straight C algorithms runing on a Unix server --- I know because I worked on the code that handles all the claims processing for Medcaid for the state of Florida where they had to process nearly billion claims every day and get them completed in a fraction of a day. The magnitude of that versus what most folks are doing is kind of mind boggling to say the least.
Still if I am going to use code to process data manipulation with a need for speed I am going to use a compiled language (C, C++, Java, etc...) not an interpretted one (Python).
To the best of my knowledge Windows does support multithreading, certainly in C# you can create multithreaded programs that only run on Windows.
Many popular Python libraries are actually built on top of optimised C or C++ code. These are much faster than interpreted Python libraries. So if most of your processing is in the libraries I think these are comparable with compiled C applications.
Nearly billion claims every day seems a very high number for Medicaid Florida, considering the population of Florida is only about 22 million.
May 16, 2023 at 12:49 pm
Nearly billion claims every day seems a very high number for Medicaid Florida, considering the population of Florida is only about 22 million.
That sounds about right. The number of medicaid recipients in Florida is very high. Without looking, I think they have the highest percentage of medicaid recipients per capita of all the states.
A regular trip to the doctors office results in multiple claims.
Side note on Florida medicaid. In 1999, they did an audit of our billings. They determined that we had underbilled them 700k over three years. So the company got fined 300k. The issue was the module that generated the bills. The programmer who wrote the code did not consider the time portion of the datetime columns, so what ever date was the last date of the billing report was never included.
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
May 16, 2023 at 12:52 pm
Dennis Jensen wrote:Still if I am going to use code to process data manipulation with a need for speed I am going to use a compiled language (C, C++, Java, etc...) not an interpretted one (Python).
That's another tic I have... I love it when I tell someone that their SQL will be slow and they retort with something (fairly stupid, IMHO) that it's only ever going to have a small number of rows or will only run once and day and so performance doesn't matter. People just can't look past their one piece of code and understand that their code is just one of thousand of pieces of code where someone made the same mistake in thinking. Now, you have a server that's running twice, tens, and, sometimes, hundreds of times slower and there's no place to easily fix it because the poison has permeated the meat of the entire server.
Eh? Just throw hardware at it, right?
Lordy.
I have seen it quite a few times where adding extra hardware actually slowed things down because more things want to get at the same data but it's all slow.
Dennis Jensen wrote:Still if I am going to use code to process data manipulation with a need for speed I am going to use a compiled language (C, C++, Java, etc...) not an interpretted one (Python).
That's another tic I have... I love it when I tell someone that their SQL will be slow and they retort with something (fairly stupid, IMHO) that it's only ever going to have a small number of rows or will only run once and day and so performance doesn't matter. People just can't look past their one piece of code and understand that their code is just one of thousand of pieces of code where someone made the same mistake in thinking. Now, you have a server that's running twice, tens, and, sometimes, hundreds of times slower and there's no place to easily fix it because the poison has permeated the meat of the entire server.
Eh? Just throw hardware at it, right?
Lordy.
I have seen it quite a few times where adding extra hardware actually slowed things down because more things want to get at the same data but it's all slow.
To coin a cliché, If I had a penny for every time I have heard that "It only runs once a week...", retirement would have occurred long ago.
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
May 16, 2023 at 4:08 pm
Jonathan AC Roberts wrote:Nearly billion claims every day seems a very high number for Medicaid Florida, considering the population of Florida is only about 22 million.
That sounds about right. The number of medicaid recipients in Florida is very high. Without looking, I think they have the highest percentage of medicaid recipients per capita of all the states.
A regular trip to the doctors office results in multiple claims.
Side note on Florida medicaid. In 1999, they did an audit of our billings. They determined that we had underbilled them 700k over three years. So the company got fined 300k. The issue was the module that generated the bills. The programmer who wrote the code did not consider the time portion of the datetime columns, so what ever date was the last date of the billing report was never included.
But that's nearly 50 claims a day for each person in Florida.
May 16, 2023 at 4:38 pm
Michael L John wrote:Jonathan AC Roberts wrote:Nearly billion claims every day seems a very high number for Medicaid Florida, considering the population of Florida is only about 22 million.
That sounds about right. The number of medicaid recipients in Florida is very high. Without looking, I think they have the highest percentage of medicaid recipients per capita of all the states.
A regular trip to the doctors office results in multiple claims.
Side note on Florida medicaid. In 1999, they did an audit of our billings. They determined that we had underbilled them 700k over three years. So the company got fined 300k. The issue was the module that generated the bills. The programmer who wrote the code did not consider the time portion of the datetime columns, so what ever date was the last date of the billing report was never included.
But that's nearly 50 claims a day for each person in Florida.
Michael L John wrote:Jonathan AC Roberts wrote:Nearly billion claims every day seems a very high number for Medicaid Florida, considering the population of Florida is only about 22 million.
That sounds about right. The number of medicaid recipients in Florida is very high. Without looking, I think they have the highest percentage of medicaid recipients per capita of all the states.
A regular trip to the doctors office results in multiple claims.
Side note on Florida medicaid. In 1999, they did an audit of our billings. They determined that we had underbilled them 700k over three years. So the company got fined 300k. The issue was the module that generated the bills. The programmer who wrote the code did not consider the time portion of the datetime columns, so what ever date was the last date of the billing report was never included.
But that's nearly 50 claims a day for each person in Florida.
Yeah. After I posted this I did the math. It does not work. Now, each claim is made up of multiple details. That may be where the billion is coming from.
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
May 16, 2023 at 4:49 pm
Yes Jonathn AC Roberts and Python will let you do multi-threading as well but have you ever looked into how how MS Windows supports their so-called multi-threading, especially since windows does not support Forking? So what is multi-threading in MS Windows -- it is a queue of requests operated on in a linear fashion -- which by my understanding of multi-threading that it is not. As you do not have multiple threads but one single thread with multiple processing being run as it can get to it. If you have one thread doing a continuous stream you have just blocked or negated your ability to do their so-called multi-threading within MS Windows. How do I know this, by having encountered it and then having eventually figured out how to implement multi-processing.
Yes I now that nearly a billion claims seems rather high to me but that was the number I understood we were processing everyday. And as Michael L John points out a single office visit creates multiple claims and a single hospital visit creates a mass of claims. So yeah about 50 claims per person (taking the full current population of Florida) still to me seems a bit high but the numbers were the numbers that is all I can say.
Now I cannot say if that number might not have been the number of individual claims but maybe the number of processes, or included the number of times an individual claim had to be processed. As we did have several stages to processing a single claim which had to be done in a sort of sequence and each claim or batch of claims was not handled exactly the same within each sequence. Still I know some of the processing took a looong time to complete due to the shere volume of claims that were being processed on a daily basis. Further certain end reports I think actually took more than a day to produce due to this shere volume as well. That being said I am working with the VA now on their claims processing and while not nearly the volume of Florida's Medicaid it is still a rather daunting number of claims that need to be processed and they like us so much they have been seriously talking about having us process all of the VA claims for the country which we figure would mean about 5 times as many as we are processing now and that means an issue if we cannot get some of our latency issues under control. Something I have been working on since I got here.
May 16, 2023 at 8:03 pm
Some of the claims are still coming from this problem I bet:
https://www.cbsnews.com/news/medicare-fraud-a-60-billion-crime-23-10-2009/
May 17, 2023 at 10:29 pm
Whether you know SQL or not, do you have access to a database? There are plenty of devs that don't really have access to anything other than LocalDB (that they control), and so they might not think of SQL as the place to clean data, not because it can't, but because creating a database, loading data, etc. is hard. At least to them.
With a little R or Python, you can quickly load and analyze data. especially if you are just doing analysis quickly and not saving things. I would argue that you don't need to know it, but it can be helpful and easier.
As far as loading billions of rows, what's the memory size needed? Might be a lot, might not be. Python could handle this for sure, if you have enough RAM on a machine. It's also much easier than C/C++ because y0u don't need to compile it. Is it faster? probably not, though the way lots of people write C++ programs, any execution savings would be lost in time to write/debug code.
May 18, 2023 at 1:54 am
Well Steve Jones perhaps we will have to agree to disagree on that. I am not just a SQL developer but I have used numerous languages over the years including C, C++, C#, Java, Python, and a lot more. Further I did not find Python to be any more difficult or any easier than the various other languages it simply was just another language, and the language that those I was working for were using. Further currently those that I am working with use Java and SQL and we try to offload the data processing to the Java side of things (compiled C-based language) and the data collection and storage to the SQL side of things. Further, this is a common approach for large databases as working with that data needs to be done quickly. However, if you are working with a much smaller dataset then T-SQL will work just fine for just about all of it. I draw this conclusion from working on many different sizes of native database languages for instance VBA in conjunction with MS Excel perhaps the smallest database out there. VBA with MS Access. Python with MySQL which does not support stored procedures. As well as numerous other combinations. The only time I would use Python or any language for that matter with a database, is if that database does not support the various aspects that MS Sequel Server or Oracle have incorporated into their databases or if I am working with large scale data and need a much faster processing language at which point I would not select Python as that would be a step backwards rather than forwards.
All I am trying to say with all of that, is be careful what you recommend to others as they might go down the wrong rabbit hole based on that advise and suggesting that someone use Python or R instead of the Sequel Server natural environment is definitely advising someone to go down the wrong rabbit hole.
By the way has anyone else noticed that this discussion has gone way off topic??
May 18, 2023 at 2:37 am
By the way has anyone else noticed that this discussion has gone way off topic??
It became a "water cooler" thread very early in its existence. 😉 Most of us old timers simple refer to it as "The Thread". 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 66,301 through 66,315 (of 66,712 total)
You must be logged in to reply to this topic. Login to reply