Different Types of Machine Learning

With some datasets, simple regression models may not be sufficient for our analysis. For example, a simple linear regression model wouldn’t be very useful for datasets with distribution of many classes, if we didn’t know what class our observations belonged to. This is where we need more complex machine-learning. But many machine-learning models use regression principles as a foundational layer to begin the process of teaching a computer model to make decisions.

Depending on the type of data available and the kind of problem we want to solve, we’ll probably select one of two machine-learning types: Supervised and unsupervised ML. But there are a couple of other types of machine learning besides them.

Supervised machine learning

Supervised machine learning uses labeled datasets to train algorithms to classify or predict outcomes. Because supervised learning problems occur more frequently in the workplace, data professionals use this type most often.

Data professionals use supervised machine learning for prediction. Labeled data is data that has been tagged with a label that represents a specific metric, property or class identification.

To summarize, supervised machine learning algorithms use data with answers already in it, and use it to make more answers either by categorizing or by estimating future data.

Unsupervised machine learning

Unsupervised machine learning uses algorithms to analyze and cluster unlabeled datasets. In this type, data professionals ask the model to give them information without telling the model what the answer should be.

Reinforcement learning

Reinforcement learning is often used in robotics and is based on rewarding or punishing a computer’s behaviors. Based on whether it received a reward or punishment, the computer will update its policy trying to optimize for rewards or minimize penalties. This process will repeat until a satisfactory policy is found.

Deep Learning

Deep learning models are made of layers of interconnected nodes. Each layer of nodes receives signals from its preceding layer. Nodes that are activated by the input they receive, then pass transformed signals either to another layer or to a final output.

Artificial Intelligence

Another term we often hear in connection with machine learning is artificial intelligence. Artificial intelligence includes all types of machine learning. Without getting lost in terms, we will just acknowledge that machine learning and artificial intelligence refer to the same principle: Training a computer to detect patterns in data without being explicitly programmed to do so. 

Finally, there’s one aspect of machine learning and data science that every data professional should know; quality is more important than quantity. A small amount of diverse and representative data is often more valuable for data professionals than a large amount of biased and unrepresentative data.

Categorical versus Continuous data types and models

As a data professional, knowing whether the features we input into a machine-learning algorithm are continuous or fixed will be essential to choosing the correct model and the evaluation metric for that model. However, recognizing whether data features are continuous is not the only indicator to consider when deciding which machine-learning model to use, but it is a very helpful one.

Determine when features are infinite

Continuous features can take on an infinite and uncountable set of values. Supervised learning models that make predictions that are on a continuum are called regression algorithms. 

As a sample, weight is a continuous feature because it has an uncountable set of possible values. Because we could measure to 2 decimal places, like 15.76, 16.09 and 15.56. The measurement here is continuous because the weights could be any infinite number between those measured points, like 15.762950.

Categorical features and classification models

Like with continuous features, whether or not a particular model is appropriate for a problem like sorting by characteristic is largely determined by what type of variable it must predict.

Categorical variables contain a finite number of groups or categories. For example, we might use a categorical variable to classify a vehicle type like car, motorbike, or bus. Discrete features are with a countable number of values between any two values. For instance, the height of a tree is a continuous variable, but the number of trees in a park is a discrete variable. 

Discrete variables are able to be counted and categorical variables are able to be grouped. For example, the paint color of a house is categorical, while the number of houses in a neighborhood painted lavender is discrete.

Another sample: The algorithm for grouping the cats and dogs based on images from a camera would use categorical data as part of a supervised machine-learning model.

Guide user interest with recommendation systems

A real-life sample of using machine learning can be the recommendation system. The main goal of it is to quantify how similar one thing is to another, and use this information to suggest a closely related option.

Content-based filtering

Comparisons are made based on attributes of content itself. Let’s consider a music app: To make this comparison, there must be data about each song that’s a deconstruction of its attributes. In other words, everything that makes the song unique is identified and labeled, like the artist’s voice type, the rhythm or beat, or whether a certain instrument is featured.

Content-based filtering is ineffective at making recommendations across content types, because different content types don’t use the same features. For instance, a book doesn’t have beats per minute, so the same streaming service won’t be able to use our song preferences to recommend a new novel. The use cases can be limited.

Another drawback of this method of recommendation is that we probably like -say- videos about various topics, but the system will use our feedback to only suggest similar videos to the ones we liked.

Collaborative filtering

Comparisons are made based on who else liked the content. Collaborative filtering works regardless of what the items are.

There are drawbacks too. Let’s use movies as an example: There are hundreds of thousands of movies in existence, but most people have only viewed a small fraction of them. Each person’s movie data would have missing values for all the movies they haven’t experienced. A recommendation system would need to use advanced filtering techniques to manage all that empty space.

This brings us to the issues with biased approaches.

Ethics in Machine Learning

Popularity bias is the phenomenon of more popular items being recommended too frequently.

Bias in machine learning is particularly deceptive because it stems from human bias. But because a computer makes the prediction, it’s easy for the result to seem objective. Often, the bias is unintentional.

A sample: If all the people we use to generate the templates were, say, older than 30, perhaps the service didn’t work well on young adults or maybe we used far more people on one end of the gender spectrum.

Build ethical models: Explainable predictions

After we’ve planned our process and analyze the data, it’s time to construct the model and ask additional questions. For instance, is it important that the model’s predictions be explainable? 

With some modeling methodologies, it may be difficult to know where their predictions came from. This is sometimes known as a black box model. Neural networks are widely known for being difficult to explain and therefore they’re not appropriate for many applications where transparency is important. 

Algorithms like random forest, Adaboost, and XGBoost aren’t completely black-box, but they may require additional efforts to explain and justify their predictions. At the other end of the spectrum, linear and logistic regression methods are highly explainable so are decision trees.


Preparing for ML models

PLAN for a machine learning project

First we need to ensure that the machine learning model we plan to construct meets the actual business needs. This may seem a bit obvious, but given the potential complexity of the problem, multiple departments will likely be involved in the output.

Sample: For a data set about houses in a certain area with house features and sale prices, we will need to create a supervised, continuous model to get the desired numerical result. Continuous models are the only ones that can help us predict housing prices for this example.

The plan that we create during this stage will be carried through the whole process, so it is important to really make sure we’ve considered all the aspects and constraints of the project. However, that isn’t to say that the plan we create must stay unchanging, we can absolutely reassess as we progress. It is there to get you started heading in the right direction.

The first thing to do when forming our plan is to consider the end goal. What exactly are we trying to model, and what types of results from the model are needed? Something that can be determined immediately is what type of machine learning model we’ll need: 

  • Supervised vs unsupervised 
  • Regression or classification
ANALYZE data for a machine learning model

The main focus of the analyze phase is to develop a deeper understanding of the data, while keeping in mind what the model needs to eventually predict. For example, if we’re creating a supervised learning model, the first thing we’ll need to know is what our model is trying to predict. In other words, we’ll need to understand our response variable. The question here is: Regression or classification?

Often, as a data professional, our data isn’t structured exactly the way we need it to be. Different units or not correctly labeled data points might need to be changed. 

After getting a solid understanding of what our response variables are and how they’re structured, the next step is exploring our predictor variables. Understanding the relationships that exist between variables in our dataset is essential to building a model that will produce valuable results.

The predictor variables might also not be in the format or style that we want. We’ll need to figure out how we want our data structured before building our model.


Python for machine learning

When creating any Python script or program, development is almost always done inside an Integrated Development Environment or IDE. An IDE is a piece of software with an interface to write, run, and test a piece of code.

In situations when it’s not necessary for a human to check the code while it’s running, data professionals generally prefer to use Python script. Scripts are especially useful when the program incorporates several files. Scripts are also helpful when there are many errors in the program that require debugging since scripts can take advantage of additional functionality that notebooks cannot. 

However, Python scripts typically aren’t ideal for data science. Data analytics professionals, especially during EDA, need to use Python to interactively explore data sets and view the outputs of their code in near real-time. Often, these results are shared with colleagues and must be in a human-readable format, hence a Python notebook.

Different types of Python IDEs

Knowing whether or not we want to use a Python notebook or a Python script can help us visualize the overall workflow of our project, but when we start a project using one type, it doesn’t mean we need to use it throughout the whole project. Data professionals change their development environment in the middle of their workflow often.

We can always switch if we realize that we need some functionality that is offered by a different IDE or file type. Jupyter Notebooks is one of the most commonly used IDEs that support Python notebooks. However, it only offers support for Python notebooks, not Python scripts. Other ideas such as Spider will only support Python scripts. Some IDEs such as Visual Code Studio can support both Python notebooks and scripts.

Something else to consider when we’re selecting our IDE is the tools that are built into the software. Many of them are relatively simple and make development more efficient. Code completion is a very common feature.

Python packages

Generally, there are three types of python packages that we’ll be using as a data professional. The first category is operational packages. They’re also the first packages we’ll normally use in the analytical process. Operational packages load, structure and prepare a data set for further analysis. 

When creating a python file for analysis, the first thing we have to do is read in our data. The Pandas package is often the most useful for doing this, but the Pandas read CSV function is only a tiny percentage of what’s included in the package. For efficient analysis and modeling we can use functions that are built into the Pandas package. This makes it easier to complete tasks including preliminary data inspection, cleaning data and merging and joining the data frames. Other operational packages such as Numpy and Scipy, provide functions for advanced mathematical operations.

The second category is data visualization packages. Matplotlib is usually the go-to library for basic visualizations in Python. Seaborn is another visualization package that is focused on statistical visualization. Plotly is often used for presentations or publications such as creating a data visualization for an interactive dashboard.

The final category of packages are for machine learning. Scikit learn is a machine learning library that is built upon many of the packages we’ve already discussed.

Python Packages and Libraries

Operational Packages

  • NumPy
    • Allows for more mathematical operations in Python, provides functions for array-like objects, etc.
  • Pandas
    • Creation of data frames, analyzing data, cleaning data, manipulating data, performing efficient operations on large data sets.

Visualization Packages

  • MatPlotLib
    • Easy-to-learn difficult-to-master graphing library for Python. Great for quick, exploratory graphs.
  • Seaborn
    • Built on top of matplotlib, allows for easier customization of plots compared to matplotlib.
  • Plotly
    • Easy to create beautiful, presentation quality plots and graphs. Lots of built in functionality and can have interactive elements.

Machine Learning Packages

  • scikit-learn
    • Provides functionality for a host of machine learning models and analytical tools.

Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources. 

I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate program.