When I was starting out in this field I used to think that it would take me a very long time to build my first ML model but it turns out that is not actually true.
In this guide, I will tell you how you can build your simple ML model from scratch without any high-level knowledge of the stuff.
WHAT IS AN ML MODEL?
You can think of a machine learning model as a simple input and output function. You put some input in it and it gives you an output but before that as a machine learning engineer or whatever you consider yourself to be, you need to train the model so that it can predict the results.
It’s just like teaching a child in the beginning it makes mistakes but then gradually it improves itself and then comes out as a champ of “ENGLISH ALPHABETS”. Building a machine learning model goes through the same process you feed the data into the model then it learns from that data and makes predictions if they get wrong then it corrects itself, leading to an increase in the overall accuracy of the model and in the end it becomes very good at predicting the outputs for a particular input.
Below is a simple workflow of a machine learning algorithm (Thanks to Jovian)
STEP 1
GET A DATASET
As already described above a machine learning model requires data to be fed into it for learning from the data. For that purpose you need to get the dataset first, there are multiple datasets on the internet. Especially if you are a beginner I would suggest you make an account on Kaggle and explore some datasets there. For the sake of this tutorial, we are going to use a simple dataset. Don’t care about what it contains for now, just try to get a basic idea.
STEP 2
IMPORTING SOME LIBRARIES
To work with the dataset you need some libraries and Python provides us with many built-in libraries. I would suggest you to get familiar with these libraries first.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Here we have just imported some of the common libraries in Python.
PANDAS → To work with the dataset
NUMPY → To work with some numerical stuff
MATPLOTLIB → To plot the graphs of the data if required
STORE THE LINK
This is the time to store the link of the dataset that I have given you above in a variable so that we can easily use that as many times as we like. Just click that link above, copy the URL from the top and store that as I did below
data = pd.read_csv("https://raw.githubusercontent.com/dataprofessor/data/refs/heads/master/delaney_solubility_with_descriptors.csv")
Now, it’s time to look at how our data looks you can simply do that by using this command →
data.head()
This will show you what our dataset looks like, you can take a glance at what are the columns of our dataset and what type of data they contain. If you fill in a number as an argument inside the head you will get to see those many rows from the top.
In my case it looks like this →
This is a very simple dataset with 5 columns in it. You can take a decent glance at it and if you love it you can make a frame out of that. We are not going to do data cleaning for this data as it contains no empty rows or any data which will harm our ML model.
Now that you have a basic idea of how our dataset looks it’s time to go to the next step
STEP 3
SPLITTING THE DATA
As mentioned in the short decent introduction you need to train your model on some data and also test it to find out if you need to shift to some other model as it does not fit well according to your needs or if you need to make some other modifications.
In order to split the data you need to use the sci-kit learn library, don’t worry too much about what it is, just go on YouTube to learn about it.
So we are going to split our data into training and test sets, one to train our ML model and the other one to find out how well the model performs.
But before splitting the data we need to tell the model what columns to take as input and what to take as target variable. Basically, the input columns will be used to figure out the output of the model and make some improvements if needed.
x= data.drop('logS', axis=1)
y = data['logS']
Here, we are taking a group of columns (data frame in Pandas) as inputs and storing them in x. Here we are first selecting the whole data and then dropping some of the columns that we do not need as input in this case it is logS as we are taking it as an output column and if we take it as an input column that will be similar to making a child learn all the answers of the questions that are in the test.
Then we are taking y as a target variable.
Now the time comes when we split the data →
from sklearn.model_selection import train_test_split
x_trains, x_test, y_trains, y_test = train_test_split(x,y,test_size=0.2, random_state=100)
Now, as mentioned above we are going to use the scikit learn library so we are using that here in this case we are importing a tool called train_test_split from the submodule in scikit learn.
Now when you apply this tool to the data you need to fill in the input (x) and output (y) columns in it as we have done above.
Along with that, you need to mention how much of your data you want to keep for testing, in our case, we are using 0.2 to mention that use only 20% of the data as a test set.
Then there is this thing called random_state which for now you might think of as a fixed number that you need to fill there so that whenever you run this command you will always get the same rows.
Then after mentioning all of this, we are storing all of that stuff in four columns x_trains, x_test, y_trains, and y_test.
STEP 4
TRAINING THE MODEL
After all of this struggle, we have arrived at the point where we are going to train our ML model. In our case, we are going to use a linear regression model.
So, to briefly understand what a linear regression model is you might think of it as a function where you put something and you get the output to the other side. Mathematically you can think of it as a linear equation with multiple inputs.
So now let us import our ML model from sci-kit learn
from sklearn.linear_model import LinearRegression
lr= LinearRegression()
lr.fit(x_trains,y_trains)
Here we have imported a linear regression model from scikit learn.
Then in the second line we are creating a linear regression object so what does it do? Let us know what ChatGPT has to say on that →
Yes, you cheeky genius, when you write
lr = LinearRegression()
, you're creating an instance of theLinearRegression
class from thesklearn
library—basically, a linear regression object.It's like saying, "Yo, sklearn, give me a shiny new linear regression model to mess with!" That object holds all the tools and methods to fit your data, predict stuff, and even pull out some sexy stats like coefficients and intercepts.
So, now that you know what is at line 2 let us find out what fit means.
Fitting a model to your data means to train the model on your data. So, here model learns everything it can from your data - that’s what the 3rd line does.
CONGRATS YOU HAVE TRAINED YOUR FIRST ML MODEL
STEP 5
TESTING THE MODEL
Now that you have trained the model it is time to test how well it performs on the data you give to it.
For that, you need to understand two new things: r2 score and mean squared error
- R2 Score: So what the hell is it?
$$R^2 = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}i)^2}{\sum{i=1}^n (y_i - \bar{y})^2}$$
Now here for now you just need to understand what it means if it gives you certain values. If your R2 score is 1 that means that mostly your model is perfect and works perfectly fine but it might mean that it is overfitting.
If your R2 score is 0 that means that your model does not explain anything and it is useless
If your R2 score is less than 0 then your model’s predictions are not good
- As the name suggests mean squared error is just the mean of square of error the lesser it is the better it is. But what the hell is an error? Error is basically how far away your data point is from the actual value for a particular input or the difference between the actual value and the predicted value. The table below gives a decent idea of my argument.
Misplaced \hlineMisplaced \hline
Now the mathematical formula of mean squared error is simple →
$$\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2$$
I will dive into the mathematical aspects of both of these articles in detail in the future.
To know how our model performs we first have to know what are its predictions and we know that by using .predict which gives us a numpy array (ndarray) of the model predictions. Here we are using the .predict on two different parts of our data x_trains and x_test in which we split our original data.
The reason we are doing this is because since we have fitted our model or in short have trained it now it’s time to find out how it predicts on the inputs of both training and test data and then we are storing the predictions as arrays in these two variables.
We are storing the predictions on the inputs of training data in y_lr_train_pr and predictions for the inputs of test data in y_lr_test_pr.
y_lr_train_pr = lr.predict(x_trains)
y_lr_test_pr = lr.predict(x_test)
To check how our model has performed on both training and test data we do the evaluation.
Now let us do some model evaluations.
from sklearn.metrics import mean_squared_error, r2_score
lr_train_mse = mean_squared_error(y_trains, y_lr_train_pr)
lr_train_r2 = r2_score(y_trains, y_lr_train_pr)
lr_test_mse = mean_squared_error(y_test, y_lr_test_pr)
lr_test_r2 = r2_score(y_test, y_lr_test_pr)
Here we are importing the scikit learn library and from the submodule of scikit learn which is metrics we are using two tools mean_squared_error and r2_score to calculate the things we discussed earlier. I will go deep into their mathematical aspects in future articles.
So, with the help of these tools, you can easily predict how good or bad your model performed.
So a good model has these features →
R2 score of more than 0 or around 1
A fairly low MSE
Let us try to plot our model predictions in a simple table through the code below
lr_results = pd.DataFrame(["Linear Regression", lr_train_mse, lr_train_r2, lr_test_mse, lr_test_r2]).transpose()
lr_results.columns = ['METHOD', 'TRAINING MSE', 'TRAINING R2', 'TEST MSE', 'TEST R2']
The second line allows us to make columns in the DataFrame so we have used .columns there.
In the first line we are basically adding the content to the pandas dataframe and using transpose to convert the content which would otherwise form multiple rows under a single column to a single row under multiple columns.
Fig 1 below is with transpose and Fig 2 is without transpose to make you understand my explanation.
Here is the table with our model evaluations →
You can see that our training and test MSE are nearly the same and our R2 scores for both training data and test data are also the same, which proves that we have made a decent model.
FINAL WORDS
Congrats on building your first ML model, I hope this guide might have helped.
If you are having some problems then I suggest learning about the libraries mentioned in this short course.
Thanks for reading and your attention, you can read some of my other articles and share your thoughts below.