Untouched Data

  • Comments posted to this topic are about the item Untouched Data

  • An interesting topic, but for me it has a spin-off that could be quite far reaching.

    Here in the UK, personal data is protected by the Data Protection Act, and one of the basic tenets of the act is that companies storing personal data should only do so for as long as the data remains relevant, useful and up to date. After this point, the companies have a legal obligation to delete the data, and the penalties are potentially pretty severe.

    So determining which data is no longer being actively used isn't just useful for archiving, it'd be good to know for legal compliance. I can certainly see a case could be built for SQL Server functionality to be extended to provide more detailed information in this area.

    Semper in excretia, suus solum profundum variat

  • While some data has to be deleted, other organisations (eg NHS) have a requirement to store information for 20 years or more.

    The only real solution I have found that still allows immediate data access is the Compressed data type in MYSQL.

    Now if MS could provide something like that with the bonus of data encryption I would be very happy.:)

  • In the aritcle you ask for medium speed high capacity storage. Isn't this where the ability to hang SATA drives off a SAS controller is intended to be used.

  • Seems like there's a call for it.... I can think of alot of instances where that would come in handy.

  • There are systems out there that do give you options when it comes to storage, speed of access to the files and such. These are document imaging systems. These can be setup to use only optical disks for storage, providing a slow access speed for the user when retrieving documents, or they can be stored on high speed magnetic media, SAN Arrays, for instance, or a combination of both.

    In a typical system, documents are stored on optical disk in some jukebox back in the corner, and also they are cached on a SAN, NAS or other higher end disk system. Documents that are accessed regularly, remain in the cache as long as someone is pulling them up on a regular basis. Those that get written once, and never accessed again, are written to the cache, and then purged after a set time interval if they haven't been accessed, but are still sitting out on an optical disk. This way, if a user needs to view a document, and it was put into the system months before, the system will pull the document from the jukebox while the user waits. Or, if the document needed is something that was stored recently, they will have an instant view of the document because it is still in the cache. And all the database is doing is holding the pointers to where the documents are stored, and the search information for the user to locate them.

  • Here in the U.S. we have the Sarbanes Oxley act (law) that requires us to retain all data for years

    or possibly even forever... e-mails too. This is according to our law department. I can see it growing

    to consume everything we have... what do you think?

  • Make partitions without enterprise edition by creating a view unioning tables on different file groups. These tables would all have the same identity increment, but different identity seeds so that they never contain one anothers ID and it remains unique. Next you need Instead Of triggers to insert the data into the primary table, and update data in the correct table. Like this you can have the older data on your slower drive's filegroup.

  • Legal requirements for data storage and deletion certainly need to be considered first when looking at a potential solution. After legalities, we need to look at the business rules.

    My former employer being a school district had a legal requirement to keep all records (student/staff/vendor/etc.) for seven years before it was eligible for purging. When I started with the district, everything long term was printed and ultimately microfiched (destroying the original paper record) so someone needing old records would have to go to the vault, pull the microfiche and then print out what they needed. Outside of the legalities, this district made a decision early in its existence that they would keep records indefinitely which proved to be useful several years ago when an older gentleman came in needing his high school diploma which was destroyed in a house fire.

    When computers came to the district, they started storing everything on tape but NEVER tested the restorability of the tapes until I started and almost couldn't pull the required information from the tape. I converted my side to CD and then to DVD storage as the sizes of the databases got too large to practically keep on CD. While writable CD/DVD media is not perfect, it should last for more years than anyone currently there will care about at which point, there may be newer technology that can read that failed CD/DVD and still recover the information.

    At this point, I'm not convinced there is a reliable long term media that records can be stored on. Microfiche fade over time, paper just falls apart if not stored properly, tapes demagnetize, CD/DVD media aren't rated for more than 100 years and HDD media can fail due to crashes. Of course, if legalities require deletion of older data, I'm not sure what the best storage media would be. I would still choose optical media but you can't delete from a disc that has been closed.

    Definitely an interesting discussion...

  • I support the db backend to many dms systems. We utlize the concept above of images online, nearline and finally offline. With the cost of storage it is feasible to keep everything online on a SAN or NAS for many companies. The bigger issue is restorability. Several of our customers ran into issues with their tapes and hardware after Katrina however the dvd and cd media in the jukes just needed to be wiped off and it was good to go. They were able to bring in a new juke and database server (fortunately the db backups were ok as they were all recent and offsite.) The images though were not as recent of backups in some cases and in others they were not offsite yet when the hurricane hit.

    Another issue that i run into is that They have issues with performance when it is not on the online sstorage. They don't care where it is or how it is stored by they want the same response times regardless. unfortunately we are an instant gratification people. That is the bigger issue we face with our space concerns is having teh data as quickly available as always as well as setting expectations for end users.

  • Data Storage is a great concern in terms of the SW upgrades.

    We do have to keep data for many years due to FDA (more then 40 years), Sarbanes Oxley etc. Also public data I assume have to be kept for a long time like mortgage, home title, title insurance etc. Personal -I hope my personal pictures will be saved for the future generations.

    We keep some data in the database and some outside the database in many types of files and documents: Word, Excel, PDF, JPG and much more.

    While it is feasible to upgrade the database once in 3 years, what do we do with the other files? Documents stored in the file system or optical storage? Old CDs with pictures? By the way, CDs and DVDs have a limited shelf life in case you don't know. Software that reads these documents change and some features become obsolete.

    Back to the databases: data in the database are read mostly by the application or report tools. The databases are so over normalized it is very difficult for a person who is unfamiliar with this application's data dictionary to create a meaningful report outside the application. The application versions change, people familiar with the data structure retire and if the database just sits there, it is possible nobody will be able to use its data in 10 years.

    All this can present a real possibilty for new types of IT businesses specializing in bringing stored data up to the modern tools.

    Regards,Yelena Varsha

  • I'll admit that I hadn't thought about retention when read this article, so very good points about how long to keep things.

    I've used optical jukeboxes, but they were problematic from a reliability standpoint. Also, applications had to deal with (relatively) long wait times, a few minutes at times. It's good to see people still using them and I hope that they've greatly grown in capacity and reliability when they do.

    Disk drives are always moving, and consume lots of power. This becomes a concern for cost and reliability reasons. Moving to optical is a much lower cost measure, and much slower than disk, but much faster than tape. Perhaps with the move to Blu-Ray, and less consumer interest we'll see more high capacity optical drives appearing soon.

  • I worked at a company that was doing a great deal of data warehousing. We worked on a project where "Facts" about the data were stored as it was staged and then placed into a reporting database that was accessible through Microsoft SQL Server Reporting Services. Most of the data wasn't worth diving into and could be archived or even purged, however, this reporting allowed us to pull a particular file or data staging effort to find out why the "Fact" deviated from our standards.

  • Luckily, I work for a private company - so we're not dictated by legislation on how long we can keep data. We do behavioral targeting and have a lot of large ISPs as customers - so we process a ton of information. The downside is, we're always having to purge information, which is obviously not something we'd like to do. More recently the DBA team and I have developed a new reporting project that summarizes all of that data in a "GrandCentral" database, one that all of our backends summarize their data to. This allows us to get rid of really old data and gives us the ability to mark trends much easier/faster. These kinds of projects take a long time, depending on how your reporting system works...but it's so worth it in the end.

  • Frankley, with the cost of storage as low as it is in my opinion it makes little sense to not keep every thing online, unless you are talking about petabytes, but even chances are that if you have that much data you will have the infrastructure and budget to deal with it.

    The cost differential between online and nearline storage is so narrow that the management, maintenance and reduced functionality is just not worth it.

    Full online storage with a good disaster recovery plan, secured document (paper) storage (e.g. Iron Mountain), is the way to go.

    You will spend more in time, energy, staff costs, and unltimately hardware costs trying to come up with some nearline solution than the online solution; and provide less functionality.

    Save your money and put it into your disaster recovery program. Give users online storage for anything that they need to access.

    Move true archival data (data that will never be accessed unless there is a compliance issue like and audit or lawsuit), to 3rd party offline/nearline storage.

    Remember that you can also get cheap highly secure nearline data storage from 3rd party providers like Amazon and Google. As an example we have Postini, our email filtering provider archive all of our email for regulatory compliance offline.

    PS

    SSC Veteran: It sounds like you are still storing images in the file system. Store them in the database as BLOBS, its cleaner, faster, allows for better management, and you can even partition the database IF you really don't have the online storage, BUT we have over 4 million images in the database and it is still less than 100GB.

    Forum Newbie: Does not matter if you are private you still need to keep corporate financial data and transactional data for 7 years (5 years for IRS, 7 for legal statue of limitations for fraus in the US).

Viewing 15 posts - 1 through 15 (of 28 total)

You must be logged in to reply to this topic. Login to reply