Before making any decision at all. I looked at the data that would be stored.
Trick question. Whatever the business decides they want to do right now would be thrown out of the window next year, so be ready to take this as with a grain of salt.
I firmly believe in “right tool for the right job” your manager may say that as well. But actually doing this is rare. Good intentions usually do not survive corporate pressure to produce a solution quickly, predictably, using existing technology. You have to be realistic and see what the organization can support. Does the team around you have NoSql experience? Does the culture foster trial and error problem solving? Does it value experimentation? Can it tolerate a team going up against a learning curve? There will be a ton of inertia and cost pressure, hence you may have to piggy back on existing system, right tool be damned.
What will I have to integrate with? This is not a huge deal if you are going with a simple DB solution but in case you are opting for something like hadoop you will quickly be facing a menu of options: Hive, Pig, Hbase. You can also opt for putting data in the cloud like S3. AWS is building tools to allow you to query S3 buckets so that may also be an option. But when making this decision it is very important to consider ‘how this information will be accessed’.
This depends on the organizational culture and how “data savvy” they are. The question here is that now that we have access to the data what exactly are we doing with it. Do want we to simply “COUNT” and “SUM” a few fields nightly? Do we want to do that hourly? Or in real time?
Do we want to add a few more fields and do the same thing as above but this time the sums and counts should be grouped by some more fields? In my experience, data and queries simply lead to more data and queries. But time and space are not infinite so, urge the teams to think as forward as possible and come up with just enough flexibility to address their needs. Again if the data is not huge a RDBMS would suffice.
In the end, this decision is what your organization and teams can tolerate learning curve wise and maintenance wise. For this reason, no matter how excited about a new technology I am I try to take the long term value approach. If the company has a mature data ecosystem and an enthusiastic operations team I would happily go with a distributed solution like Hadoop. Next based on the nature of the data the choices will be NoSql or traditional RDBMS. Rule of thumb that I use: If the data is coming into your system of disparate systems an the schema is likely to change then go with a NoSql solution (Mongo/Cassandra). If the data is coming in from a source you control and can structure go with a RDBMS solution. This is not a hard and fast rule, obviously lots of other concerns come into play but those are my guiding principles.