Map dropbox as network drive windows 10

With a Linux server you can set it up as a ssh server and use sshfs or directly use ssh or sftp, from terminal or with a third party client. 1. level 2. Manawqt. Op · 1m. You can set up linux in a VM…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Titanic Dataset with Logistic Regression

RMS Titanic was an Olympic-class transatlantic cruise ship owned by White Star Line. Manufactured at the Harland and Wolff (Belfast, Ireland) shipyards. On the night of April 15, 1912, it hit an iceberg on its first voyage and sank into the icy waters of the North Atlantic in about two hours and forty minutes. When its construction was completed in 1912, it was the largest steam passenger ship in the world. Its sinking resulted in the deaths of 1,514 people and went down in history as one of the greatest maritime disasters.

The great loss of life caused by the sinking of the Titanic was attributed to many reasons, but the fact that stood out over time was that the ship did not carry enough lifeboats for everyone. Although the full capacity of the Titanic was 3,547 people, the total capacity of the lifeboats on board was 1,178 people. Also, the number of men who died in total was very inproportionate because they prioritized women and children during the accident.

PassengerId is a continuous/numeric variable but acts like an index.

Let’s Examine Categorical Variables

Let’s Review Continuous/Numerical Variables

In Dataset

let’s analyze.

Age

Observation units between 0 and 1 year old

Observation units with float data type in the 0–1 age range express the age of a newborn baby.

Number of days since birthday / 365

Fare

The condition that the minimum value of the ticket price is zero

As a result
If the price paid for the ticket is zero

It cannot be ruled out that there are passengers with the same ticket number and all passengers embark from the same port.
It can be commented that, with the possibility that a certain number of promotional tickets were distributed according to the ticket grade, the age of the passengers included in the ticket class number two is not known, creating a false identity and ignoring the ship security.

Let’s Examine the Target/Dependent/Output Variable

Target Variable by Category Variables

Hedef degisken → by Survived

The number of missing observations in the Age variable is approximately one-fifth of the dataset. The number of missing observations in the Cabine variable is approximately four-fifth of the dataset. The number of missing observations in the imbarked variable is approximately two-thousandths of the dataset.

if

Cabin

Because Cabin, which is a categorical and Cardinal variable, has a large number of unique classes and the ratio of missing observation units to total observation units is high, it is a logical decision to remove/delete/drop Cabin Variable from the data set.

If a significant variable is obtained for the target variable after the Cabin variable is converted to a bool categorical variable, it should be deleted after the Feature Engineering phase is completed.

When the Cabin variable is converted to a binary variable, it becomes meaningful for the target variable, so it should be deleted after it is converted to a new variable in the Feature Engineering stage.

Age

It would be correct to fill the observation units of the variable Age, which is a continuous variable, with a statistical metric.

Standard deviation is 14 and Age is a high ratio since it has values between 0 and 80. Therefore, filling with median should be preferred instead of filling with mean.

Embarked

The ratio of missing observation units to the total observation units is two per thousand.

Outlier Observation Units

Name

The names of the passengers can give us information about their ethnic origin and financial situation.

Adjective

Labels that individuals have can also be an important factor in terms of survival.

Cabin

Ticket

There are many passengers with the same ticket tag. Tickets are cut after a single transaction and their tags are unique, so a ticket represents a group and is charged according to the number of people.

If the number of people in the group is in the range of [1–5], there is a positive correlation between the unique class and the target variable, but if the number of people is in the range of [6–7], there is a negative correlation between the unique class and the target variable.

The higher the number of people in the group, the higher the survival rate.

Ticket

Area

Age

Fare

Ticket numbers are unique, a ticket can belong to a group or to an individual.

Examining the existence of unique rare classes of Categorical Variables

If you are not carrying out a project in an industry (e.g. : health) where even small odds are of great value, it is necessary to delete rare classes.

After One Hot Encoding, a new variable is created for each class, but since the correlation between the variable created from rare classes and the target variable is at a very low level, residue occurs in the data set.

If the area under the curve increases (AUC →Area Under Curve), the accuracy score increases.

My LinkedIn Address :

My Medium Address :

My Github Address :

My Kaggle Address :

Add a comment

Related posts:

Advantages of Online Data Science Courses

Data Science has been hailed as the transformative trend that is set to re-wire the industries and re-invent the ways people do things. Products and applications are being developed in agriculture…

Downloading PubChem Bioassays made easy

The PubChem Bioassay Database is the largest open-source repository of bioactivity data. With over 290 million data points from well-described biological experiments, it is an invaluable source of…

When Ableism and Racism Collide

Before I could submit this article for publication, it appears Black Panther star Letitia Wright has deleted her Twitter and Instagram accounts following online backlash for a video she shared. I suppose that’s a frustratingly apt metaphor for what I’m about to discuss.