Why we need Data Manipulation ?
Real world data is so messy , we by doing certain operations make data meaningful based on one’s requirement this process of transforming messy data into insightful information can be done by data manipulation.There are various language that do support data manipulation (eg:-sql,R,Excel..etc). In this blog we will broadly discuss Pandas for data manipulation.In this section I will take titanic dataset for broader understanding.
Seaborn will load example dataset that is present in online repository.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = sns.load_dataset("titanic")
It is so hectic to go through each and every row of dataset so for the Cursory Glance we see first /last five row. …
This dataset consists information about used car listed on cardekho.com. It has 9 columns each columns consists information about specific features like Car_Name gives information about car company .which Year the brand new car has been purchased.selling_price the price at which car is being sold this will be target label for further prediction of price.km_driven number of kilometre car has been driven.fuel this feature the fuel type of car (CNG , petrol,diesel etc).seller_type tells whether the seller is individual or a dealer. transmission gives information about the whether the car is automatic and manual.owner number of previous owner of the car. …
What is Feature importance ?
It assigns the score of input features based on their importance to predict the output. More the features will be responsible to predict the output more will be their score. We can use it in both classification and regression problem.Suppose you have a bucket of 10 fruits out of which you would like to pick mango, lychee,orange so these fruits will be important for you, same way feature importance works in machine learning.In this blog we will understand various feature importance methods.let’s get started…….
It is Best for those algorithm which natively does not support feature importance . It calculate relative importance score independent of model used.It is one of the best technique to do feature selection.lets’ …
When we predict one class out of multi class known as multi class classification .Suppose your mother has given you a task to bring mango from a basket having variety of fruits , so indirectly you mother had told you to solve multi class classification problem.
But our main is to apply the binary classification approach to predict the result from multi class.
There are some classification algorithm which has not been made to solve multi class classification problem directly these algorithms are LogisticRegression and SupportVectorClassifier. By applying heuristic approach to these algorithms we can solve multi class classification problem. …
The dataset I chose is the affairs dataset that comes with Stats models. It was derived from a survey of women in 1974 by Red book magazine, in which married women were asked about their participation in extramarital affairs.I decided to treat this as a classification problem by creating a new binary variable affair (did the woman have at least one affair?) and trying to predict the classification for each woman.Variables that is present in the dataset for prediction are :-rate_marriage(women’s rating for her marriage) ,age(women’s age),yrs_married(number of years married), children(no. …
What is correlation?
Correlation defines the mutual relationship between two or more features. Suppose you want to purchase a house , property dealer has shown some houses to you and you observed that the house price is increased with increase in size of house.Here Size of the house is strongly correlated with price.
Suppose you are a player and there is humongous recession came into white collar jobs . This recession won’t affect your earning because the recession of white collar job has nothing to do with your profession.In this case there is no any correlation between both the features.
Here in chi squared test we decide whether a feature is correlated with target variable or not using p-value. …
Let’s take a binary example (class 0 and class 1). K is the number of neighborhood points we would take to decide in which class our test data belongs.K should not be taken in even number it should be odd i.e:- 1,3,5,7,9.
But the question is how we take best value of K for prediction of test data belongings. Let’s get started.
To find the best value of k we take some range of values of k and then we will calculate the mean error rate of all these Ks.let’s understand it using python.
lets’ understand what is the accuracy score and confusion matrix for randomly taken k value. …
What is pruning ?
In general pruning is a process of removal of selected part of plant such as bud,branches and roots . In Decision Tree pruning does the same task it removes the branchesof decision tree to overcome the overfitting condition of decision tree. This can be done in two ways, we will discuss both the techniques in detail. let’s get started……….
What is Imbalanced Dataset ?
It is most commonly found in medical sector related dataset,fraudulent dataset etc. Suppose Apollo hospital has made a dataset of people came for diabetes checkup ,the dataset consists binary ouput that is either person will be diabetic or not.
let’s say out of 1000 records 100 people are diabetic and rest are normal, so according to output our dataset has been divided into two parts.
Person is diabetic =100 and person is non diabetic =900 here large amount of dataset has been inclined towards a particular class (negative class) hence it leads to formation of imbalanced dataset. …
What is Feature Engineering and why we use it ?
Success of every machine learning model depends upon how you have presented data and the suitable presentation of data can be done by using feature engineering.It gives machine learning engineer better flexibilty to understand data easily, to apply less complex model to achieve better result.Even less optimal parameter with well engineered feature will give you accurate result. In this section we will discuss about categorical variable’s encoding technique . let’s disuss it in detail….
It is a techinque used to convert categorical Variable into numerical Variable,
why we need to convert categorical variable into numerical variable ? …