May 1, 2018 at 12:00 am
Comments posted to this topic are about the item Random Word Generation for Data Scrubbing
May 1, 2018 at 1:30 am
Thanks for this article John, interesting approach.
😎
One thought though, you might want to convert the scalar UDF's to inline functions, big difference in performance when handling large sets. I posted a test harness for similar functionality few years back and the results showed almost 10 times difference.
May 1, 2018 at 3:04 am
If considering doing something similar in production you might want to ensure your 'words' source doesn't include things like the F and C bombs 🙂
May 1, 2018 at 3:30 am
redgate have a tool for generating test data that looks quite good. I don't know how much it costs but it would save a lot of effort.
https://www.red-gate.com/hub/product-learning/sql-data-generator/how-to-generate-various-forms-of-realistic-data-for-testing-development-and-prototypes
May 1, 2018 at 6:00 am
Nice article. I have been working on a similar concept for our environments.
I did notice a disparity between you sql for the table and your paragraph describing it.
In the sql, you have the PK as the column TheWord. In the paragraph, it says the integer identity is the PK.
Bill Soranno
MCP, MCTS, MCITP DBA
Database Administrator
Winona State University
Maxwell 143
"Quality, like Success, is a Journey, not a Destination" - William Soranno '92
May 1, 2018 at 6:09 am
Great job, John. Can you share the link for the dictionary download that you used? I ask because most word lists seem to be rather unabridged when it comes to highly offensive words and know that you're rather diligent and might have found one that doesn't contain such words.
I also agree on the idea of using Inline Table Value Functions (iTVF) as an Inline Scalar Function (iSF) rather than using regular Scalar Functions for performance especially if you have large tables to anonymize.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 1, 2018 at 7:42 am
Hi. I screwed up! Emailing Steve Jones to replace the zip file with the sanitized version. The original list was from http://www-01.sil.org/linguistics/wordlists/english/
Thanks
John.
:Wow: :blush: (Red faced)
May 1, 2018 at 7:52 am
SQLBlimp - Tuesday, May 1, 2018 7:42 AMHi. I screwed up! Emailing Steve Jones to replace the zip file with the sanitized version. The original list was from http://www01.sil.org/linguistics/wordlists/englishThanksJohn. (Red faced)
The link http://www01.sil.org/linguistics/wordlists/english does not work but http://www.sil.org/linguistics/wordlists/english does, redirects to http://www-01.sil.org/linguistics/wordlists/english
😎
May 1, 2018 at 7:54 am
Eirikur Eiriksson - Tuesday, May 1, 2018 7:52 AMSQLBlimp - Tuesday, May 1, 2018 7:42 AMHi. I screwed up! Emailing Steve Jones to replace the zip file with the sanitized version. The original list was from http://www01.sil.org/linguistics/wordlists/englishThanksJohn. (Red faced)The link http://www01.sil.org/linguistics/wordlists/english does not work but http://www.sil.org/linguistics/wordlists/english does, redirects to http://www-01.sil.org/linguistics/wordlists/english
😎
Hi - I noticed and corrected. Still freaked out by the nasty list.
Thanks
John :blush:
May 1, 2018 at 7:56 am
Eirikur Eiriksson - Tuesday, May 1, 2018 1:30 AMThanks for this article John, interesting approach.
😎One thought though, you might want to convert the scalar UDF's to inline functions, big difference in performance when handling large sets. I posted a test harness for similar functionality few years back and the results showed almost 10 times difference.
Thanks! I will look into that today, and perhaps post revised functionality into this thread.
Thanks
John.
May 1, 2018 at 8:34 am
What a great article - hands down. Thank you. I too am working on something similar. I was using word-counts and Loreum Ipsum replacements. I think this approach is far better. I need to ensure that the dictionary is sanitized. I did get sort of lost in all the replies surrounding this and at the moment I am still not sure where to get a sanitized list.
May 1, 2018 at 9:22 am
The "redacted" zip file requires further work. For example, words beginning with "fellat" or "whor", not to mention the N-bomb.
--
Scott
May 1, 2018 at 9:48 am
SQLBlimp - Tuesday, May 1, 2018 7:54 AMEirikur Eiriksson - Tuesday, May 1, 2018 7:52 AMSQLBlimp - Tuesday, May 1, 2018 7:42 AMHi. I screwed up! Emailing Steve Jones to replace the zip file with the sanitized version. The original list was from http://www01.sil.org/linguistics/wordlists/englishThanksJohn. (Red faced)The link http://www01.sil.org/linguistics/wordlists/english does not work but http://www.sil.org/linguistics/wordlists/english does, redirects to http://www-01.sil.org/linguistics/wordlists/english
😎Hi - I noticed and corrected. Still freaked out by the nasty list.
Thanks
John :blush:
It's the same list that Princeton uses so don't feel bad. What would be really cool is if we all got together and submitted lists of words that need to be removed so that it would help everyone. Of course, we wouldn't just past them on the forum. Perhaps attaching readily consumable text files with the label of NSFW.txt would help. They could be either the words or words using % wildcards to make it easy do delete from an original list.
My first word would be "XML" :D:D:D
--Jeff Moden
Change is inevitable... Change for the better is not.
May 2, 2018 at 7:57 am
Jeff Moden - Tuesday, May 1, 2018 9:48 AMSQLBlimp - Tuesday, May 1, 2018 7:54 AMEirikur Eiriksson - Tuesday, May 1, 2018 7:52 AMSQLBlimp - Tuesday, May 1, 2018 7:42 AMHi. I screwed up! Emailing Steve Jones to replace the zip file with the sanitized version. The original list was from http://www01.sil.org/linguistics/wordlists/englishThanksJohn. (Red faced)The link http://www01.sil.org/linguistics/wordlists/english does not work but http://www.sil.org/linguistics/wordlists/english does, redirects to http://www-01.sil.org/linguistics/wordlists/english
😎Hi - I noticed and corrected. Still freaked out by the nasty list.
Thanks
John :blush:It's the same list that Princeton uses so don't feel bad. What would be really cool is if we all got together and submitted lists of words that need to be removed so that it would help everyone. Of course, we wouldn't just past them on the forum. Perhaps attaching readily consumable text files with the label of NSFW.txt would help. They could be either the words or words using % wildcards to make it easy do delete from an original list.
My first word would be "XML" :D:D:D
...followed by "JSON" 😀
😎
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply