Data Science for Dummies – Data Engineering with Titanic dataset + Databricks + Python (Tech Talk 3 of 9)

I put together a tech talk on Machine Learning and Databricks which is the 3rd part of an 9 part Data Science for Dummies series: Data Engineering with Titanic dataset + Databricks + Python.

Preparing & feature engineering highlighted the importance of domain knowledge, even with something as simple as a 10 column dataset! It also aptly demonstrated how much time is spent on ingesting and prepping data for machine learning versus the actual modelling. I also get asked how important the maths and statistics are to get started. There’s no doubt they are essential for this field, however, I personally enjoy the data engineering/DataOps role and am happy to hand over to a dedicated data science when it gets too hairy. It’s important for all roles involved to have an idea of the end to end workflow. With tools like AutoML I can focus on data engineering & architecture.

I’ll be back for Part 2 where we’ll finish the feature engineering and then run the training data through a series of machine learning classifiers to determine which gives the best accuracy.

Slides can be found here (Note: Powerpoint animation is not working so well 😉

Here’s the rest of the series:

  1. Data Science overview with Databricks
  2. Titanic survival prediction with Azure Machine Learning Studio + Kaggle
  3. Data Engineering with Titanic dataset + Databricks + Python
  4. Titanic with Databricks + Spark ML
  5. Titanic with Databricks + Azure Machine Learning Service
  6. Titanic with Databricks + MLS + AutoML
  7. Titanic with Databricks + MLFlow
  8. Titanic with .NET Core + ML.NET
  9. Deployment, DevOps/MLOps and Productionisation Z
20190502 224959 • Data and AI Analytics
Rodney Joyce

