May 4, 2021 at 9:47 am
Dear community,
I want to ask you the following question:
I have a larger set of data. It consists of around 700 Million records (rows), each row has around 30 attributes (fields). It is mainly text, only a few fields are numerical.
The data is split into around 200 separate spreadsheets. The formats are spreadsheets xlsx/CSV.
I now want to analyse this data, by applying simple queries to this data .
Is there a system (user interface/ application framework) which I can use to apply our queries?
How can I “move in” and “unify” all our spreadsheets into this system at the beginning?
Unfortunately, I am only a beginner in data management. So I do not have much technical knowledge yet. If there is any solution which is for beginners, I really would appreciate it.
Thank you
May 4, 2021 at 12:22 pm
This might not be real simple. Microsoft lists a bunch of ways you can do it though. Probably the easiest is using the SQL Server Import/Export Wizard in Excel. However, to really make this happen in an efficient way, programmatically automating it would be best. In that case, I'd look to the BULK INSERT command as way to get it done. Both those are listed at the link.
However, to do both, you'll need to have a database to put it into. If you're not ready for SQL Server installs and all that entails, you may want to look to Azure SQL Database as a much easier alternative.
Best of luck on the project.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
May 4, 2021 at 4:15 pm
I would first determine what you need to do with the data and if this is a long term solution (ie you are going to need to do this import of data on a regular basis) or a one-off solution (ie once you merge the data, you will never need to merge the data again). You also need to determine what tool you are going to be using to review the data. Are you wanting to write TSQL to pull data out and review it that way? What about SSRS? Or are you looking to dump the data back into Excel in the end to review and report on?
Getting the data is only half the battle, but it is an important part of the battle because whatever tool you use for doing the analytics after will perform best if the data is in a single location. What I mean is if, for example, you wanted to pull the data into Tableau, that supports pulling data from Excel into it. But if you are pulling from 200 Excel files, it is not going to perform nicely and it will likely eat up a lot of memory. Plus, tableau is not a free reporting tool (note I do not work for Tableau or any of their parent companies, I just recently demoed the tool so it is fresh on my mind).
If it is long term, I would be looking at migrating the data to a database of some sort. Or, at the very least, automating the process of pulling data from the xlsx files and csv files into a single file.
If this is a one-off thing, I'd either manually merge the data (time consuming, but if it is a one off thing, it may be faster and cheaper than learning/buying a new tool) or look for some tools to merge spreadsheets into 1 with multiple workbooks or, if the data is similar enough, a single workbook in excel. Once all of the data is in one excel workbook, then you can manipulate it as you see fit.
If you have anyone on your team with scripting experience (VB Script or powershell for example), or coding experience (pretty much any .NET language like C#) then pulling the multiple XLSX files or multiple CSV files into 1 shouldn't be too hard to do. At the end of this process, your 200 files become 2 files. At this point, it is easy enough to open them both in Excel and copy-paste between the 2 to get 1 excel file to work with.
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
May 4, 2021 at 5:11 pm
The first problem is your files - since Excel has a limitation of 1 million rows, opening any of those files that exceed that number of rows in Excel just truncates the additional rows of data.
The next issue is how you want to 'query' this data and present the results. It cannot be done using Excel - because of the row limitation, so you will need something else. That something else is going to be decided by what you currently have available, if anything. If you have Office 365 - you may already have Power BI available, but if not there is a free desktop version available.
It may be possible to load all those CSV files directly into Power BI - but that is going to depend on the actual data and how much of it you actually need. If you have to combine the data or not - or if you can have separate Power BI projects with subsets of the data.
More analysis will be needed on the data...it may be that loading to a database and then setting up your reporting system (Power BI, SSRS, Tableau, other) to use the tables in the database will be the best option. But it all depends on the reporting requirements and whether the data needs to be combined or not.
Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”
― Charles R. Swindoll
How to post questions to get better answers faster
Managing Transaction Logs
May 5, 2021 at 11:24 am
Ooh, yeah, PowerBI. Another vote for exploring that space as a possible solution. In fact, it might work for you and not require a lot of data movement. Maybe.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply