This has got me by for the past 20 years when asked by various relatives and friends exactly what it is that I do. It does mean I have to “fix” working computers, install virus scanners, get printers working (throw it away), and fix iTunes for my mum on a regular basis and generally I am considered an authority on anything that is slightly more technical than average.
However, in the last decade (and especially the last 3 years) the technical landscape has shifted exponentially (you can read my personal thoughts here: “Loyalty should not be career suicide“) with machine learning now accessible to anyone with a browser, so this answer no longer suffices as dinner parties. Out of interest, the drivers for this are things like access to more data (IoT, faster networks) and the abstraction of the AI tools used by Google et al into elastic cloud services and various others – that is a whole post in itself.
Everyone seems to have heard of Data Scientists. It was even labelled “The Sexiest Job of the 21st Century” (google it – I don’t know who said it first). But when I say I am a Machine Learning Engineer I often draw blank looks. (I say “AI” Engineer to the non-technical to get a nod of recognition).
So what exactly is the difference between a Data Scientist and a Data Engineer, and what is a Machine Learning Engineer? This too has been discussed to death however I read an article that summed it up perfectly. I am also currently working on a project (that shall remain nameless) that highlights the points made in this article perfectly. Do yourself a favour and read this first then come back here for a real example:
Some background: There’s a limit in statistics and maths that I hit fairly soon where I am happy to hand over to someone who specializes in it. I wish I had paid more attention at school during maths class and stopped having so much fun… trying telling that to a teenager though!
I understand basic stats, I can train a Linear Regression model, I can tell Azure to run AutoML for me and I can hypertune a model using SparkML. I can build a pretty decent app end to end to identify hotdogs or not hotdogs on the edge. But I cannot tell you WHY these params worked better than those ones. WHY the Random Forest resulted in a higher accuracy or what the best metrics are to use to evaluate the outcome of 100 training runs for an X model. Fortunately, the Data Scientist can, and he loves the complex maths! Finally…. a job that is not boring!
But… not everyone with a PHD knows how to train their models in parallel using distributed code (I will try not to mention Databricks yet again ;). Most Data Scientists use Pandas/numpy and don’t necessarily know (or care, to be fair) about the potential limitations when it comes to training. Nor do they necessarily care about ordering a beefy 128 Gig GPU machine to run their experiments overnight because it is taking 8 hours to train a model. Suggesting to use PySpark or Dask just gets an irritated look as it detracts from valuable experimenting time. When requested to deploy his model as an API driven by a Git commit with automatic model drift monitoring it is met with a disgruntled snort…
I do, for example, appreciate the beauty of distributed compute and the wonderfully scaleable architecture of Spark. I love a good API and data pipeline as much as the next Data Engineer and can spend hours refactoring code until it passes all the definitions of “Clean Code” (Consultant Tip: If you want to meet your budget and project plan then find the healthy balance between technical perfection and the real value that the code will generate. We are doing all of this for a reason, and it’s not to get the code onto 1 line). I love the concept of CI/CD and I adore simplicity, practicality and optimizing things like cloud services, code, processes and every day life. Needless to say this does not always go down well with other humans, however it’s a common trait in Data Engineers and Programmers.
So… now that we understand the personalities of the Data Scientist and Data Engineer let’s put them together, focus on their strengths and make an amazing team that can meet the business requirement as quickly as possible whilst consuming as little time and $$$ as possible.
Before I go further, obviously there are exceptions to the rule and lucky people (usually without kids) who are in fact able to bridge both roles… we’ll focus on the average here. Even if you can bridge both roles… should you? A healthy team is a diverse team.
I’ve seen projects where a Data Engineer is given a complex Machine Learning project and a couple of days to figure it out. Whilst it is possible, I believe this is not a good idea. Data Science and Machine Learning engineering ARE NOT the same thing. I have also seen projects where a Data Scientist is put on a project which involves Big Data (whatever that is) with no data engineering support and in both cases everyone wonders why they are taking so long to get any results.
The project I am on right now is a fantastic example of the article above. We have a Data Scientist (insert any number of PHDs here) and myself as the Machine Learning Engineer/Data Engineer (insert any number of Azure cloud certifications here). As a team we are approaching the problem according to our strengths and, of course, based on what we prefer to do, which is important if you want to retain your staff (did I mention that this “AI” stuff is in hot demand and everyone wants to do it but doesn’t know how?).
For example, early on in the project, training a single model (we have over 40) was taking over 10 hours.
One option would be to scale up and get a bigger VM which is the hammer and nail approach. These beasts are not cheap and halfway through a 10 hour training session could fail and the process needs to be repeated. This was the selected approach to get us past that blocker and is working. However, in parallel I am looking the Data Scientist’s code, rewriting it from Pandas into PySpark (Note: there’s 101 other ways to do this – I am just a Databricks fanboy) and building the system to log experiment results and deploy the models as containerized APIs microservices with an Azure function to orchestrate the results asynchronously. Put a near real-time PowerBI report and alerting to watch for model drift and an Azure function to trigger model retraining and it’s a work of art! Damn I love my job.
Together we make an awesome team as the whole is greater than the sum of the parts. Our roles overlap a lot and I am improving my understanding of stats and ranking better in Kaggle contests. He is learning new ways to improve his workflow and understanding more about data engineering.
To summarize: The best result is to understand the roles and challenges unique to a Machine Learning project and to plan appropriately from a time and effort POV -anything is possible with enough time. They share many aspects with standard application development projects and the approach is not too dissimilar. You wouldn’t ask a API engineer to do UX would you? (Don’t get me started – this happens a lot!).
Just put the Data Scientists and the Data Engineers together in a room and let the magic happen….
- 10 reasons to use Azure SQL in your next analytics project - November 3, 2020
- A Developer’s Guide to Building AI Application - September 4, 2020
- Things You Wish You Had Known Earlier About Databricks Performance - August 31, 2020