How to setup data team at your startup or any organisation?

  1. Data Engineer (DE)
  2. Data Analyst (DA)
  3. Data Scientist (DS)

Subscribe my newsletter here — https://distributeddataengines.substack.com/

First, understand the need for hiring, ideally a specialist data scientist should not be the first hire in data team! The first data scientist should be that enthu-hacker guy who keeps talking about AI but works as a backend engineer.

Now, Why a backend engineer? It is because ideally It should have the DDL of your data in its mind. It would know where the data lies which it can quickly pull in and setup a baseline for you.

A Data Engineer should be your first hire in the data team quickly succeeded with a data analyst. Data Engineer will setup the necessary pipelines, will take calls on table normalisation/denormalisation. Data Engineer will also help you out in setting up the database correctly.

Data Engineer will help you out with choosing between let’s say AWS Aurora/AWS RDS. It should tune things like autovaccum, optimising the queries in the APIs, probably by thinking about Indexes, CTEs etc.

Meanwhile, The Data Analyst can be hired and it can be given the task to create actionable reports/insights for your startup based on the data which you have already. Now the choice of this report is a combined effort of business and eng. Don’t think that DA can do this alone.

Depending on your business requirements, priorities can be decided. Let’s say if you want to analyse the marketing funnel then DA adds more value by creating reports than DE. However, if u are receiving TBs of data/logs which u want to analyse, DE is priority.

Usually the skills to look for in DE are the following —

  1. Spark/Flink — python/scala
  2. airflow
  3. knowledge of aws emr etc(depending on your cloud provider).
  4. OLAP — Redshift etc.
  5. columnar file format — parquet, orc
  6. row file format — avro
  7. OLTP — Postgres etc.

DA should have —

  1. Good story telling ability
  2. EDA skills
  3. A BI Tool skill — Superset, Tableau etc

In case you are wondering, how DA will work without DE. There are some tools available open source which can help you in setting up a basic data pipeline — Airbyte for example.

Now your backend enthu-hacker would have been working super hard. Time to bring in a DS only when you have to do predictions! This is important because in today’s market, DS are costly. Don’t hire a DS to build generic report that a DA can do.

DS should have —

  1. Good programming understanding (it is a myth that DS need not be good programmers, a bad coder DS creates a lot of technical debt).
  2. Classical ML knowledge — Decision Trees, Linear Regression, etc.
  3. Exposure with sklearn, pandas, numpy, etc
  4. Statistics and Mathematics knowledge.

Usually, a DS with these skills would be sufficient for you. Most of the business problems can be tackled with classical ML and statistics.

In case you are working with images, audio data or language processing in needed you would need to hire a niche DS.

A computer vision DS needs to have the following —

  1. opencv
  2. Deep Learning knowledge
  3. numpy, pytorch/tensorflow, etc.

Similarily for NLP (rasa, nltk etc.).

Your team is ready! A good DS should be able to deploy its models in production initially. It should be knowing about Quantization and Pruning. Along with a devops, It can setup an MLOps pipeline.

Cool, you have all the key players in place. Now, consult them in order to expand the team for your startup/business! I hope this thread helped you.
Feel free to DM if you need any help in building data teams.

I love stuff about data. Data Engineering and Data Science | Computer Vision | Mathematics | Contact me — vaibhaw[dot]vipul[at]gmail[dot]com