packages

Look at the data

Before making any decision at all. I looked at the data that would be stored.

  • It was going to be unstructured.
  • It would be sparse (AF).
  • It would have non-standard chars (emojies)
  • It would store money

Asked what we would do with the data

Trick question. Whatever the business decides they want to do right now would be thrown out of the window next year, so be ready to take this as with a grain of salt.

  • Surface it to users (fast, of course)
  • Allow other systems to access the data via some API.
  • Perform a bunch of aggregations (Protip: Assume this is going to happen no matter what you hear)
  • Template based reporting and some Ad-hoc Queries against the system (Protip: Assume more Ad-hoc than not)

Expertise ( Quick Aside )

I firmly believe in “right tool for the right job” your manager may say that as well. But actually doing this is rare. Good intentions usually do not survive corporate pressure to produce a solution quickly, predictably, using existing technology. You have to be realistic and see what the organization can support. Does the team around you have NoSql experience? Does the culture foster trial and error problem solving? Does it value experimentation? Can it tolerate a team going up against a learning curve? There will be a ton of inertia and cost pressure, hence you may have to piggy back on existing system, right tool be damned.

Accessibility

What will I have to integrate with? This is not a huge deal if you are going with a simple DB solution but in case you are opting for something like hadoop you will quickly be facing a menu of options: Hive, Pig, Hbase. You can also opt for putting data in the cloud like S3. AWS is building tools to allow you to query S3 buckets so that may also be an option. But when making this decision it is very important to consider ‘how this information will be accessed’.

Quick Summary:

  • RDBMS Solution probably most standard way to go about this. May not scale well for big data.
  • Hadoop will scale but will add operational complexity and will require some more tools to since it will not be accessible easily out of the box.
  • S3 (cold storage) Although progress is happening quickly on way of accessing this data and querying it I like to think of this as a backup or cold storage for now. With time this may change.

Analysis

This depends on the organizational culture and how “data savvy” they are. The question here is that now that we have access to the data what exactly are we doing with it. Do want we to simply “COUNT” and “SUM” a few fields nightly? Do we want to do that hourly? Or in real time?

Do we want to add a few more fields and do the same thing as above but this time the sums and counts should be grouped by some more fields? In my experience, data and queries simply lead to more data and queries. But time and space are not infinite so, urge the teams to think as forward as possible and come up with just enough flexibility to address their needs. Again if the data is not huge a RDBMS would suffice.

In the end, this decision is what your organization and teams can tolerate learning curve wise and maintenance wise. For this reason, no matter how excited about a new technology I am I try to take the long term value approach. If the company has a mature data ecosystem and an enthusiastic operations team I would happily go with a distributed solution like Hadoop. Next based on the nature of the data the choices will be NoSql or traditional RDBMS. Rule of thumb that I use: If the data is coming into your system of disparate systems an the schema is likely to change then go with a NoSql solution (Mongo/Cassandra). If the data is coming in from a source you control and can structure go with a RDBMS solution. This is not a hard and fast rule, obviously lots of other concerns come into play but those are my guiding principles.