The journey to becoming a Data Scientist.
- 31 January, 2017
- Article - Teams and Culture
The role of the data scientist is one of the broadest roles in the software engineering field at the moment with the expectations on these individuals being extremely high. However, there is growing controversy around the roles and responsibilities of a Data Scientist and at what point can you fairly call yourself a Data Scientist.
To begin, we need to be cognisant that the industry itself, especially in South Africa, is in its infancy. So industry standards which outline the breadth of knowledge required on Big Data and Analytics technologies have yet to be defined. Although, fundamentally, the role of the Data Scientist is closely linked to being Big Data developers and implementers of solutions that encompass large amounts of structured and unstructured data.
However, when hiring Data Scientists, we need to understand that there are only a handful of people with experience across the board. Organisations are mostly hiring people with pockets of knowledge or, graduates or juniors with the right passion. In reality, Data Scientists have many years of experience and a breadth and depth of knowledge that is relatively unique.
At Entelect, we believe the journey to Data Scientist begins as either a Big Data Engineer or a Big Data Analyst, and there is an important distinction between the two roles.
The Engineers are traditionally from a Software Engineering background. They have spent some time in the trenches learning Linux, networking and the Hadoop ecosystem. I’m not limiting the Engineer role to the Apache technologies, most vendors are releasing amazing solutions that would classify them into the Big Data field; however, they are leveraging the initial architecture and concepts of Hadoop.
The Analysts typically come from a Maths and Statistics background. They have been tinkering with databases, ETL’s and ELT’s as well as basic data mining models. These guys understand the difference between supervised and unsupervised learning, what a random forest is, and a vast array of other analytics techniques.
Both these roles set extremely strong foundations on the path to becoming a Data Scientist thanks to their problem-solving skills, both technical and logical. At this point, these individuals are now set to start moving up the ranks in terms of Data Science. However, there are still a number of learning imperatives required before becoming a Data Scientist.
At Entelect, we have established a training path to take these individuals towards the end goal of being a Data Scientist.
Moving to junior or intermediate level
There are several decisions the engineers and analysts need to make. Initially, a lot of them are related to breadth versus depth of knowledge. I have always believed you should develop a vast breadth of knowledge across all relevant technology stacks and then focus on achieving depth of knowledge in relevant areas over time. If you are really committed, this means you end up with both breadth and depth of knowledge.
To tackle this, the data solutions side of Entelect created a roadmap designed to take a Big Data Engineer or Analyst through a structured programme to provide the breadth across the engineer and analyst role.
This is similar to the Entelect Graduate Bootcamp (https://www.youtube.com/watch?v=sFNj6fzOIAc), where graduates or juniors are provided with an overview of the main stream technologies, which helps them develop a breadth of skill and from there, the Graduates can move into specific areas of software.
For the data solutions training programme, we leverage a few of the MOOC’s, a host of our internal training material but most importantly we use something known as a Structured Incubators approach.
The Structured Incubators provide real-world scenarios filled with mini challenges designed to expose everyone to all aspects of the Big Data environment. It covers everything from installing and configuring Linux clusters, all the way through to running their first data mining model. Several technologies are used, however the concepts learnt during these incubators are the key to success.
The most important key to a successful incubator is constant mentoring. At Entelect we have attracted several Big Data Engineers and Analysts and provide constant support to aspiring Data Scientists. Constant feedback loops are given to the participants though various mediums, frequent show and tell sessions are conducted, and several ad hoc challenges are given to the participants, ranging from taking 2 days out to enter a Kaggle competition all the way through to producing some real-world analytics on relevant subjects (http://source.technology/political-party-analysis).
Regardless of experience in the field, the initial training programme must be followed by all our engineers. If a person has experience in setting up clusters they are still required to complete the entire incubation period to ensure that everyone is on the same minimum level.
We have seen great success with this, constantly pushing the guys out of their comfort zone, throwing them in the deep end and making sure we provide the support for them to float.
This initial period takes roughly 10 weeks to achieve a broad overview. At this stage the individual is still operating at a Big Data engineer or analyst level.
Moving to intermediate level
This field changes far too quickly for anyone to stand still and rest on their laurels for 6 months so growth isn’t optional. We strive to help our engineers and analysts experience a variety of industry verticals and technologies to support their growth.
At this point, data engineers and analysts start finding their passion and gaining a depth of knowledge in a certain area. This is not a career direction, but more a career enhancement. The learning paths change slightly and more focus is given to the relevant area of interest. However, growth in the other areas never stop. If someone enjoy doing the data analysis we would focus the training roadmap more on R/Machine Learning/Python aspects. However, there will still be investment in furthering the knowledge on the engineering side and vice versa.
The roadmap for career growth is a long one. We estimate roughly two years of part-time focus to achieve solid, well-rounded knowledge to become a Junior Data Scientist. This would encompass understanding SQL, NoSQL HQL, Linux, Windows, Spark and another 90 technologies relevant to data science. All of this needs to be coupled with real world experience across various industry domains to develop the entire picture of implementations, use cases and critical success factors of big data projects.
So now you are a junior data scientist, what’s next?
At this point you should be able to take a blank slate, setup a big data environment, ingest data, execute useful analytics and provide a presentation layer to any organisation you are at.
The next step is to continue deepening your knowledge, the focus is then put on advanced statistics, graph theory, advanced data structures, streaming analytics and finally we move into the Probabilistic Modelling and Deep Learning.
Once you have proven competency, a solid track record and a vast breadth and depth of knowledge, at this point you are considered a Data Scientist.
As the industry becomes more established, we expect to see this path to Data Scientist becoming more commonplace. At Entelect, we use the roadmap to ensure our Data Scientists are true craftsmen in their field and can consistently provide the highest quality service and solutions to their clients at all times.