Tagged: Data science, Feature Selection, Machine Learning, Python
- This topic has 0 replies, 1 voice, and was last updated 1 year ago by
Idowu.
- AuthorPosts
- February 5, 2020 at 8:36 pm #85226Participant@idowu
A friend once told me: “I’ll finally have to abort the model, since its accuracy would not improve”.
Imagine how painful it is when you put in efforts to train a model on a data set and all you got thereafter was a trained model with low accuracy.
Beyond training a model, which is often erroneously believed to be the most important step in machine learning – cleaning and sorting, data preprocessing and feature selection respectively are the most important and time consuming steps in machine learning, they’re all referred to as Data Engineering.
Take a look at my previous article on some of the ways in which you can clean your data using Python.
Digressing a bit, data preprocessing is a method of converting your data into readable formats, which the machine can easily interpret.
Imagine you having an unstructured data such as images. You can convert each of them into integer format of 0s, 1s and 2s which a machine can process better and faster. Another example is having values “yes” or “no”, which can be preprocessed into 1 and 0 respectively by your machine.
However, this article will only focus more on feature selection.
What is a Feature as it Relates to Data?
Before we take a step further into looking at the meaning of feature selection, we might just want to take a quick dive into knowing the meaning of the word “feature” as it relates to data management.
I will like to simplify the word feature as the characteristic of each column in a structured data set. It’s the name given to a particular column on a table in the database. An example is listing out the price and cost of commodities. We could as well present such information on a table, which will contain those two attributes (price and cost), each in a column.
What is Feature Selection?
For a better understanding of the term feature selection, let’s see the meaning of some terms:
An Output:
An Output is also known as a dependent variable, which is a variable or a feature whose outcome depends on other variables or features in a data set. It’s usually referred to as the predicted or target variable.
Independent variables or Predictors:
Features other than the dependent variable are called independent variables or predictors, which predicts the dependent variable.
The output is usually on the Y-axis, while the predictors are usually on the X-axis.
Feature Selection is therefore a process in Machine Learning, which involves either automatic or manual selection of the most important features or attributes in a data set, which have significant relationship with the target output or dependent variable (the variable to be predicted).
It is therefore an important step in Machine Learning which selects the best features that will significantly predict an outcome, thereby increasing the accuracy of a model. In some instances, it’s been correctly regarded to as a filter.
Feature selection can be:
- Manual feature selection; or
- Automatic feature selection
Example of Manual Feature Selection
For instance, if your aim is to predict whether a wine will taste bad or good, you’ve been supplied with 3 bottles and you decided to consider some possible predictors (those things that will define if it’s good or bad) like; wine age, carbon content, expiry date, sugar content and bottle colour. You can also call them the attributes of the wine which will predict the outcome good or bad.
You then thought that some of these features may not really be relevant in predicting whether the wine is good or bad, so you decided to narrow down the features to the most important ones which you feel will do well at telling you about the state of the wine. Reasonably, you selected the wine age and the expiry date, while considering the rest to be irrelevant, as including them will result into what we call Noise in Machine Learning. This scenario is a manual feature selection process
Example of an Automatic Feature Selection
However, you just noticed that as a human, you can be emotional and bias at properly selecting the best features that can predict the output, so you decided to employ a machine to do the selection for you. To do that, you’ll need to instruct the machine on what to do, how to do it and the approach to use in completing the task, so you fed it with some algorithms (set of instructions) which it will apply on the data you provided, while it returns the best features to you, now you’re rest assured that the features selected are based on facts, figures, and are free of bias – which is an important advantage of automatic feature selection over manual selection. No matter how perfect you feel you can be, it’s sometimes better to leave the job to a machine.
Importance of Feature Selection
Honestly, the term “feature selection” got me confused at some points, but ever since I cracked its meaning and relevance in Machine Learning, I started to experience a turnaround in the accuracy of my model.
Feature selection is however even more useful when:- The aim is to increase the accuracy of your model
- There is particularly less data
- You aim at significantly reducing training time
- You intend to reduce noise making which causes machine confusion as a result of redundancies
Actually, we often jump into conclusion, while we leave out critical information about a subject matter and expect miracles to happen, this is not true at all for a machine, they’re very unintelligent and as such, they work on a feed-in-feed-back system.
However, in a case whereby a Machine Learning model is devoid of good Feature Selection – it’s more like going on a shopping for shoes and getting so confused and worked up on which to select because of the many fancy shoes in the store, you might even end up buying more than you should. That’s when your budget’s accuracy shrinks as you’ve spent more than you really intended.
Data Feature Selection Techniques
Life has been made so easy for programmers over the years, as large pool of libraries and frameworks are being made available for consumption, which reduces the time spent in programming. We will therefor look through some of the feature selection libraries in the next few lines.
Please note that this article will only explain 3 of the most commonly used feature selection techniques and how you can apply them to a real world data. We will not be using an actual data set. The aim is only to explain the basic concepts behind each technique that will be covered.
The 3 feature selection techniques that will be explained in this article include:
- Univariate Feature Selection
- Principal Component Analysis (PCA)
- Recursive Feature Elimination (RFE)
Univariate Feature Selection
This selection technique selects the best features with the use of a statistical test of association, which is known as the Chi-square. It looks across the data and tests differences and association across features, thus returning those attributes which are more associated and will predict our outcome better.
This can be done by using the SelectKBest library from sklearn.feature_selection by applying the line of codes below.
12345678910111213141516171819202122232425262728"""Load the following libraries"""from sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import chi2from numpy import select_printoptions"""Load the dataset"""df=pd.read_csv(r'example_data.csv')"""Slice the features into input and output"""X=df[:, 0:8]Y=df[:, 8]"""Apply the univariate sslection to select the best 3 features, by using k=3"""Univar=SelectKBest(score_func=chi2, k=3)Apply=Univar.fit(X,Y)"""Set the precision to 2"""set_printoptions(prrecision=2)print(fit.scores_)Selected_features=fit.transform(X)print(Selected_features)The codes above will give an output that will select the features with the best association
The Principal Component Analysis (PCA)
This is usually referred to as a data reduction process, which applies linear algebra to compress the initially several features into the best few features needed to predict an outcome. It selects the best components which associates better with the outcome. It can also be used to visualize the data and draw great insights from it.
1234567891011121314"""Now, import PCA class with the libraies you've loaded before"""from sklearn.decomposition import PCAdf=pd.read_csv(r'example_data.csv')"""Slice the features into input and output"""X=df[:, 0:8]Y=df[:, 8]prc= PCA(n_components=4)compress=prc.fit_transform(df)print(prc.explained_variance_ratio_)By declaring
n_components=4
, it means we’re selecting the best four features from our data.The output will be one with the four most important or associated component features.
Recursive Feature Elimination (RFE)
This selection technique eliminates noise and selects the best features in a recursive manner. The line of codes below can be used to implement this:
12345678910111213141516171819"""Load RFE from sklearn"""from sklearn.featture_selection import RFEfrom sklearn.linear_model import LogisticRegressiondf=pd.read_csv(r'example_data.csv')"""Slice the features into input and output"""X=df[:, 0:8]Y=df[:, 8]model=LogisticRegression()recurse=RFE(model, 4) """To select the best 4 features"""features=recurse.fit(X, Y)print("Number of features: %d" % fit.n_features_)print("Selected features: %s" % fit.support_)print("feature ranks: %s" % fit.rankings_)The feature ranks will give you the raking of the features in form of hierarchy.
For examples, if we have the following columns: [cost, price, age, gender, name, number_of_sales, date, income, expenditure] and the output rankings are: [1, 1, 2, 4, 3, 5, 4, 1, 1], then the 4 best features will be [cost, price, income, expenditure]. You can help yourself by comparing the position of the 1s with the column list. You can then decide to model on those.
Conclusion
Feature selection is a very important step that should be done effectively when training a model and its importance can never be overemphasized.
- AuthorPosts
- You must be logged in to reply to this topic.