If you’ve just downloaded Kognitio on Hadoop, you’re probably wondering how the software will complementRead More
Which data scientist type are you?
I attended the recent data science fest event in London and could write about the talk topics but that wouldn’t be an exciting blog. It wouldn’t it be as detailed as listening in person so I’ve opted to write about an observation; which data scientist type are you?
I don’t get to go to a lot of data science focused events but I’ve realised from this one that it’s hard to find data scientists with the all the requirements those online diagrams say you should have.
The skill set
Whether you’re recruiting or aspiring to be a data scientist you’ve probably seen those venn diagrams showing all the skills you need or something of that ilk. And if you have all those skills then you’re, as described by Gianluca Campanella at his data science fest talk, a unicorn and unicorns don’t exist.
Why are these elusive creatures so rare? Well data science as a profession is relatively new and whilst there are university courses which cater to it, current data scientists are made up (mostly) of mathematics graduates. For me, I graduated with an MSc in a mathematics related field but I don’t have any computer science in my background. Sure I can pick it’s up but it’s certainly no walk in the park!
What do I really need/want to know?
Let’s go into the computer science part specifically, alternative names for this include software engineering, infrastructure, databases etc. It’s quite a broad term that’s not restricted to coding programming. This involves getting to know about Hadoop, applications that run on it and its file formats etc. And the learning curve is steep, everything Hadoop related is not simple and it’s shown in the deployment numbers across the industry. But does a data scientist need to know about infrastructure like this whether Hadoop is used or not? Or should they just be focused on data and algorithms?
This ultimately comes down to the type of company you work for and the focus. I work for a company that sells an analytical platform capable of processing big data so my job is to use data science to demonstrate the impressiveness of the platform. Recently I did a project on TfL buses and I had to harvest that myself which took a lot longer and much more stress that I would’ve liked (the weirdness of bash didn’t help). So I’d need to know about Hadoop and all its complexities including ETL and data preparation. But if you were a data scientist at a large retailer, there would already be data for you to analyse taken care of by a data engineer.
Generally the questions are – “Is the infrastructure and data provided for me? Should it be? Is my focus better used elsewhere?”
If it is then I really wouldn’t need to know anything other than mathematics, some programming and how to present it. What I really want to know is how to implement some clever algorithm in whatever tool I’m using…
At the event there was an emphasis from companies like ZPG, Data Iku, IBM on providing resources to ease the pain on data scientists so that they can focus on analytics but that’s assuming data scientists have that specific purpose. At companies like Flock (drone insurance) and Deliveroo, their aim is to write/optimise algorithms that drive their product but what if the product is an analytical platform? Does a SAS or Kognitio data scientist have such a laser focused aim? In my experience I’ve done some consulting, platform demonstration and being the guinea pig for a new BI tool so it’s a no for me. Take the typical workflow:
Data ETL > Data preparation > Analytics > Showcase with BI tool
Most would think that data scientists would only need to focus on the last two in that chain but there are others like myself who have to have do it all. And I’m sure there are other combination data scientists too. Doing less in this chain doesn’t make them any less of a data scientist or me any more but ultimately mathematicians/data enthusiasts prefer to focus on algorithms and the discoveries that bring value to the business that they work for. Personally I know more about infrastructure and less about algorithms but I’m still called a data scientist and so is someone with the reverse.
But there’s no tool that does everything the best in that workflow and restricting data scientists to a specific one doesn’t make their lives any easier. You’d certainly not use a hammer to drill or measure, at least I hope not! Similarly you wouldn’t use SQL for complex analytics nor would you use R for data wrangling. Picking the right tools is important for efficiency (and sanity!) but whether that freedom is allowed depends on the company whether restrictions are in place because of security, implementation etc.
So what type are you, fellow data scientist? Do you only care about the data and models or do you like to delve into Hadoop and why you’re using it? Something in between? How much support in the ETL and data preparation stages do you have?
And do your tools make it easier for you? Simpler access to data in Hadoop? Less data wrangling? Chances are, no tool does it all (not even Kognitio!) but which do you prioritise? If you get a choice?