unstructured data

My Initial Attempts to Managing Unstructured Data – Part 2

Chris Farmer Data Insight, Managed IT Support Services

Continuing from my previous post My Initial Attempts of Managing Unstructured Data – Part 1

Email Archiving (2004)

To help with growth of email, we put in a little tool in by KVS, pulled out loads of email, centralised, de-duplicated and never looked back, great tool.

Our first run took out over a 1 million objects and brought our information stores down to reasonable sizes.

The backups worked in the time periods allowed and life was looking a little sweeter.

Automatic Growth Monitoring (2006)

So, back in part one, I was monitoring things manually, but I found something to watch our unstructured data, it could show duplicated data on a particular server, the age of those files and how much space was used and lost, but I had a loads of servers.

I found that I could install it on all of my file servers, but it would only report on them one at a time, so I didn’t know what my entire estate was up to, without still doing some manual excel work.

It also began to slow once my data sets got large, even though it was using a high end SQL server and the UI was clunky, but worked.

This software remained until the databases just got too large. I had to purge 90% of my historical data, so my growth metrics were shot.

Eventually the product was abandoned, too much work and admin required.

Virtualization (2006)

Now this really is only a side note, but we added a newer HP SAN and virtualized about 95% of our hardware infrastructure. This was excellent news for manageability, but it meant that server scrawl occurred.

Suddenly the team and I had more unstructured data than before. This was causing a backup nightmare. But life was good as our internal customer could have what they needed, when they need it.

Outsourcing (late 2007 onwards)

So, outsourcing everything was the end game for the multinational and was announced late 2007, but this left the MD/CEO FD/CFO aware of potential big budget hits.

Pricing was given as guidelines from the Central Corporate IT, which equated to a cost per MB.

This meant the that MD/CEO FD/CFO wanted to know, ‘How much data do we have?’, ‘what are the growth rates?’, ‘what do you predict in growth?’, ’how old is the data?’, ‘When was it last used?’, if only we (I) hadn’t junked that slow tool, or there was something on the market capable of providing that information.

Storage Refresh (2011)

With the high cost associated with outsourcing, the outsourcing process had stalled. However our VM storage was close to capacity, IO was dropping through the floor and we needed something.

A request was made to the FD/CFO and a response was returned, ‘do you really need that much? If so, prove it?’

Well, I couldn’t. All the information I had was, what I’d put in, back in 2006, what extra chassis and disks capacity I had added over the preceding years. As the company’s asset right off was over 5 years, I doubled what I needed and added a bit.

I was shot down!

I revised my request, added only 30% and put all my faith in the de-duplication capabilities of the storage.

Has it worked? I don’t know. I left well before the time for a storage renewal request to be put in, all I do know was it was at 70% capacity after only 2 years in service!

To Summarise

So, what could and would have helped me?

  • Something that could scan and report on all my unstructured and duplicated files across all my servers worldwide.
  • Ideally taking into account any semi-structured data sources, like Exchange and SharePoint.
  • It would need to be fast!
  • I’d want to be able to schedule the scanning of data, if performance was critical on the server.
  • Easy to understand reports, especially how many duplicated files and how much space am I losing to unstructured duplicated data, trending and something that the MD/CEO FD/CFO can understand.

There is loads more that I would have liked – report wise, but something policy driven that could allow me stub the data files on my servers, de-duplicate all the unstructured data at the same time, either to some cheap-arse on premise storage or even to the cloud and wasn’t expensive or need huge capital investment.

Alerts when users dump massive amounts of data that you aren’t prepared for! (that’s a whole other story).  As when the server is full and other cannot save, IT are the ones who get it in the neck.

So, unstructured data is still a problem. Growth is unpredictable and, in IT, you sometimes have to appreciate that your users just don’t have the time to spend sorting out older content and deleting it.  Most users would even fear doing this, just in-case they deleted the wrong thing.

So if I could do it again, what would I change?

I liked to have proper metrics at hand to allow me to show my MD/CEO FD/CFO  what I needed, with the kind of hard data I could have justified what I needed.

I would have like to have sorted out the archiving requirements, rather that just backing everything up and keeping it for ages.

Ideally, to keep my users happy and backups happy, just something to stub the data and move it to lower cost storage which was either replicated or sent to the cloud.

Find me on Twitter @novco007