import pandas as pd
Introduction
This dataset encompasses a collection of 8,414 housing advertisements sourced from Zameen.com, specifically pertaining to properties located in Karachi. The dataset is available on Kaggle, a popular platform for data science and machine learning enthusiasts. It can be accessed through the following link: Karachi, Pakistan Property Prices 2023 .
It’s important to note that this dataset represents only a fraction of the comprehensive data available. It was meticulously scraped and compiled by Faiq Ali, who, at the time, was a student at the University of Malaya. This dataset, listed on his Kaggle page, provides valuable insights into the real estate market of Karachi as of the year 2023. We will try to wrangle the data and prepare it for machine learning.
= "karachi-pakistan-property-prices-2023.csv" file_path
def prep_karachi_data(file_path):
return (pd.read_csv(file_path)
)
= prep_karachi_data(file_path=file_path)
df print(df.info())
df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8414 entries, 0 to 8413
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 8414 non-null object
1 price 8413 non-null object
2 date added 8047 non-null object
3 type 8413 non-null object
4 bedrooms 8414 non-null int64
5 bathrooms 8414 non-null int64
6 area 7322 non-null object
7 location 8047 non-null object
8 complete location 8413 non-null object
9 description 8413 non-null object
10 keywords 7551 non-null object
11 url 8414 non-null object
dtypes: int64(2), object(10)
memory usage: 788.9+ KB
None
title | price | date added | type | bedrooms | bathrooms | area | location | complete location | description | keywords | url | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 600 Yard Bungalow For Sale In DHA Phase 6 | 11.5 Crore | 14 hours ago | House | 5 | 6 | 600 Sq. Yd. | DHA Defence, Karachi, Sindh | DHA Phase 6, DHA Defence, Karachi, Sindh | Chance Deal 600 Yard Bungalow For Sale | Built in year: 1,Parking Spaces: 5,Flooring,Ot... | https://www.zameen.com/Property/d_h_a_dha_phas... |
1 | 3 BEDS LUXURY 125 SQ YARDS VILLA FOR SALE LOCA... | 1.45 Crore | 5 hours ago | House | 0 | 0 | NaN | Bahria Town Karachi, Karachi, Sindh | Bahria Town - Ali Block, Bahria Town - Precinc... | 3 BEDS LUXURY 125 SQ YARDS VILLA FOR SALE LOCA... | Bedrooms: 3,Bathrooms: 3,Kitchens: 2 | https://www.zameen.com/Property/bahria_town_pr... |
2 | 4 BEDS LUXURY SPORTS CITY VILLA FOR RENT BAHRI... | 2.12 Crore | 5 hours ago | House | 0 | 0 | NaN | Bahria Town Karachi, Karachi, Sindh | Bahria Sports City, Bahria Town Karachi, Karac... | 4 BEDS LUXURY SPORTS CITY VILLA FOR RENT BAHRI... | Bedrooms: 4,Bathrooms: 4,Kitchens: 2 | https://www.zameen.com/Property/bahria_town_ka... |
3 | 3 BEDS LUXURY 235 SQ YARDS VILLA FOR SALE LOCA... | 1.5 Crore | 5 hours ago | House | 0 | 0 | NaN | Bahria Town Karachi, Karachi, Sindh | Bahria Town - Precinct 31, Bahria Town Karachi... | 3 BEDS LUXURY 235 SQ YARDS VILLA FOR SALE LOCA... | Bedrooms: 3,Bathrooms: 3,Kitchens: 2 | https://www.zameen.com/Property/bahria_town_ka... |
4 | Buying A Flat In Clifton - Block 9? | 4 Crore | 6 hours ago | Flat | 3 | 3 | 200 Sq. Yd. | Clifton, Karachi, Sindh | Clifton - Block 9, Clifton, Karachi, Sindh | Apartment for sale | Flooring,Electricity Backup,Broadband Internet... | https://www.zameen.com/Property/clifton_clifto... |
One of the first things we are going to do is to change the name of some of the columns. I want to get rid of spaces.
def prep_karachi_data(file_path):
return (pd.read_csv(file_path)
=lambda x: x.replace(" ", "_")) #no spaces in column names
.rename(columns
)
= prep_karachi_data(file_path=file_path)
df print(df.info())
df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8414 entries, 0 to 8413
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 8414 non-null object
1 price 8413 non-null object
2 date_added 8047 non-null object
3 type 8413 non-null object
4 bedrooms 8414 non-null int64
5 bathrooms 8414 non-null int64
6 area 7322 non-null object
7 location 8047 non-null object
8 complete_location 8413 non-null object
9 description 8413 non-null object
10 keywords 7551 non-null object
11 url 8414 non-null object
dtypes: int64(2), object(10)
memory usage: 788.9+ KB
None
title | price | date_added | type | bedrooms | bathrooms | area | location | complete_location | description | keywords | url | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 600 Yard Bungalow For Sale In DHA Phase 6 | 11.5 Crore | 14 hours ago | House | 5 | 6 | 600 Sq. Yd. | DHA Defence, Karachi, Sindh | DHA Phase 6, DHA Defence, Karachi, Sindh | Chance Deal 600 Yard Bungalow For Sale | Built in year: 1,Parking Spaces: 5,Flooring,Ot... | https://www.zameen.com/Property/d_h_a_dha_phas... |
1 | 3 BEDS LUXURY 125 SQ YARDS VILLA FOR SALE LOCA... | 1.45 Crore | 5 hours ago | House | 0 | 0 | NaN | Bahria Town Karachi, Karachi, Sindh | Bahria Town - Ali Block, Bahria Town - Precinc... | 3 BEDS LUXURY 125 SQ YARDS VILLA FOR SALE LOCA... | Bedrooms: 3,Bathrooms: 3,Kitchens: 2 | https://www.zameen.com/Property/bahria_town_pr... |
2 | 4 BEDS LUXURY SPORTS CITY VILLA FOR RENT BAHRI... | 2.12 Crore | 5 hours ago | House | 0 | 0 | NaN | Bahria Town Karachi, Karachi, Sindh | Bahria Sports City, Bahria Town Karachi, Karac... | 4 BEDS LUXURY SPORTS CITY VILLA FOR RENT BAHRI... | Bedrooms: 4,Bathrooms: 4,Kitchens: 2 | https://www.zameen.com/Property/bahria_town_ka... |
3 | 3 BEDS LUXURY 235 SQ YARDS VILLA FOR SALE LOCA... | 1.5 Crore | 5 hours ago | House | 0 | 0 | NaN | Bahria Town Karachi, Karachi, Sindh | Bahria Town - Precinct 31, Bahria Town Karachi... | 3 BEDS LUXURY 235 SQ YARDS VILLA FOR SALE LOCA... | Bedrooms: 3,Bathrooms: 3,Kitchens: 2 | https://www.zameen.com/Property/bahria_town_ka... |
4 | Buying A Flat In Clifton - Block 9? | 4 Crore | 6 hours ago | Flat | 3 | 3 | 200 Sq. Yd. | Clifton, Karachi, Sindh | Clifton - Block 9, Clifton, Karachi, Sindh | Apartment for sale | Flooring,Electricity Backup,Broadband Internet... | https://www.zameen.com/Property/clifton_clifto... |
We will bring out the second column price
for further investigation. We will want it to be a number but we can see that it is an object.
'price'] df[
0 11.5 Crore
1 1.45 Crore
2 2.12 Crore
3 1.5 Crore
4 4 Crore
...
8409 5 Crore
8410 1.2 Crore
8411 1.55 Crore
8412 70 Lakh
8413 1 Crore
Name: price, Length: 8414, dtype: object
This seems to have a combination of units.
“Lakh” and “Crore” are units of numerical value commonly used in the Indian subcontinent, including countries like India, Pakistan, Bangladesh, and Nepal. They are part of the South Asian numbering system and are widely used in these regions for financial transactions, population counts, and more.
- Lakh:
- One lakh is equal to 100,000 (10^5).
- For example, in international notation, 5 lakh would be written as 500,000.
- Crore:
- One crore is equal to 10 million, or 100 lakh (10^7).
- In international notation, 1 crore would be expressed as 10,000,000.
These terms provide a more convenient way to express large numbers, particularly in the context of financial transactions and population statistics in the Indian subcontinent. For instance, it’s more common to hear about a budget of 5 crore rupees rather than 50 million rupees.
We will split that column into two, one for the figures and the other for the units. The values and the units are separated by a space.
'price'].str.split(" ", expand=True) df[
0 | 1 | |
---|---|---|
0 | 11.5 | Crore |
1 | 1.45 | Crore |
2 | 2.12 | Crore |
3 | 1.5 | Crore |
4 | 4 | Crore |
... | ... | ... |
8409 | 5 | Crore |
8410 | 1.2 | Crore |
8411 | 1.55 | Crore |
8412 | 70 | Lakh |
8413 | 1 | Crore |
8414 rows × 2 columns
We will now include that in our function that we are slowly building. We will also convert the values in our new column to a float.
def prep_karachi_data(file_path):
return (pd.read_csv(file_path)
=lambda x: x.replace(" ", "_")) #no spaces in column names
.rename(columns= lambda x: x["price"].str.split(" ", expand=True)[0],
.assign(price_ = lambda x: x["price"].str.split(" ", expand=True)[1])
currency_name "price_":float})
.astype({
)
= prep_karachi_data(file_path=file_path)
df print(df.info())
df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8414 entries, 0 to 8413
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 8414 non-null object
1 price 8413 non-null object
2 date_added 8047 non-null object
3 type 8413 non-null object
4 bedrooms 8414 non-null int64
5 bathrooms 8414 non-null int64
6 area 7322 non-null object
7 location 8047 non-null object
8 complete_location 8413 non-null object
9 description 8413 non-null object
10 keywords 7551 non-null object
11 url 8414 non-null object
12 price_ 8413 non-null float64
13 currency_name 8413 non-null object
dtypes: float64(1), int64(2), object(11)
memory usage: 920.4+ KB
None
title | price | date_added | type | bedrooms | bathrooms | area | location | complete_location | description | keywords | url | price_ | currency_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 600 Yard Bungalow For Sale In DHA Phase 6 | 11.5 Crore | 14 hours ago | House | 5 | 6 | 600 Sq. Yd. | DHA Defence, Karachi, Sindh | DHA Phase 6, DHA Defence, Karachi, Sindh | Chance Deal 600 Yard Bungalow For Sale | Built in year: 1,Parking Spaces: 5,Flooring,Ot... | https://www.zameen.com/Property/d_h_a_dha_phas... | 11.50 | Crore |
1 | 3 BEDS LUXURY 125 SQ YARDS VILLA FOR SALE LOCA... | 1.45 Crore | 5 hours ago | House | 0 | 0 | NaN | Bahria Town Karachi, Karachi, Sindh | Bahria Town - Ali Block, Bahria Town - Precinc... | 3 BEDS LUXURY 125 SQ YARDS VILLA FOR SALE LOCA... | Bedrooms: 3,Bathrooms: 3,Kitchens: 2 | https://www.zameen.com/Property/bahria_town_pr... | 1.45 | Crore |
2 | 4 BEDS LUXURY SPORTS CITY VILLA FOR RENT BAHRI... | 2.12 Crore | 5 hours ago | House | 0 | 0 | NaN | Bahria Town Karachi, Karachi, Sindh | Bahria Sports City, Bahria Town Karachi, Karac... | 4 BEDS LUXURY SPORTS CITY VILLA FOR RENT BAHRI... | Bedrooms: 4,Bathrooms: 4,Kitchens: 2 | https://www.zameen.com/Property/bahria_town_ka... | 2.12 | Crore |
3 | 3 BEDS LUXURY 235 SQ YARDS VILLA FOR SALE LOCA... | 1.5 Crore | 5 hours ago | House | 0 | 0 | NaN | Bahria Town Karachi, Karachi, Sindh | Bahria Town - Precinct 31, Bahria Town Karachi... | 3 BEDS LUXURY 235 SQ YARDS VILLA FOR SALE LOCA... | Bedrooms: 3,Bathrooms: 3,Kitchens: 2 | https://www.zameen.com/Property/bahria_town_ka... | 1.50 | Crore |
4 | Buying A Flat In Clifton - Block 9? | 4 Crore | 6 hours ago | Flat | 3 | 3 | 200 Sq. Yd. | Clifton, Karachi, Sindh | Clifton - Block 9, Clifton, Karachi, Sindh | Apartment for sale | Flooring,Electricity Backup,Broadband Internet... | https://www.zameen.com/Property/clifton_clifto... | 4.00 | Crore |
What we are going to do next is to multiply all the Crores by 100 to convert them to Lakhs. Then we drop off the columns we just created. Out code will check the currency_name
column to see if the value is Crore
before making the conversion. Next it will replace the column price
with the result. We will use the mask method for that.
The mask
method is used to replace values in a DataFrame or Series under certain conditions.
= lambda x: x["price_"].mask(x["currency_name"] == "Crore", x["price_"] * 100)) .assign(price
- Using
mask
inside.assign
:- The code is creating or modifying the ‘price’ column in the DataFrame.
x["price_"].mask(...)
: This applies themask
method on theprice_
column of the DataFrame.
- Condition in
mask
:- The first argument of
mask
is a condition:x["currency_name"] == "Crore"
. This checks each row in thecurrency_name
column to see if it equals “Crore”.
- The first argument of
- Replacement in
mask
:- The second argument of
mask
isx["price_"] * 100
. This is what themask
method will replace the original value with, but only where the condition is met (i.e., where ‘currency_name’ is “Crore”).
- The second argument of
- How It Works in our Example:
- For each row in the DataFrame, the code checks if the
currency_name
for that row is “Crore”. - If it is “Crore”, the corresponding value in the ‘price_’ column is multiplied by 100 and this new value replaces the original value in the
price
column. - If it is not “Crore”, the value in the
price
column remains as it is in the ‘price_’ column.
- For each row in the DataFrame, the code checks if the
We will now insert that code in the function we are building and also delete the new columns we created since we no longer need them.
def prep_karachi_data(file_path):
return (pd.read_csv(file_path)
=lambda x: x.replace(" ", "_")) #no spaces in column names
.rename(columns= lambda x: x["price"].str.split(" ", expand=True)[0],
.assign(price_ = lambda x: x["price"].str.split(" ", expand=True)[1])
currency_name "price_":float})
.astype({= lambda x: x["price_"].mask(x["currency_name"] == "Crore", x["price_"] * 100))
.assign(price =['price_', 'currency_name'])
.drop(columns
)
= prep_karachi_data(file_path=file_path)
df print(df.info())
df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8414 entries, 0 to 8413
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 8414 non-null object
1 price 8413 non-null float64
2 date_added 8047 non-null object
3 type 8413 non-null object
4 bedrooms 8414 non-null int64
5 bathrooms 8414 non-null int64
6 area 7322 non-null object
7 location 8047 non-null object
8 complete_location 8413 non-null object
9 description 8413 non-null object
10 keywords 7551 non-null object
11 url 8414 non-null object
dtypes: float64(1), int64(2), object(9)
memory usage: 788.9+ KB
None
title | price | date_added | type | bedrooms | bathrooms | area | location | complete_location | description | keywords | url | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 600 Yard Bungalow For Sale In DHA Phase 6 | 1150.0 | 14 hours ago | House | 5 | 6 | 600 Sq. Yd. | DHA Defence, Karachi, Sindh | DHA Phase 6, DHA Defence, Karachi, Sindh | Chance Deal 600 Yard Bungalow For Sale | Built in year: 1,Parking Spaces: 5,Flooring,Ot... | https://www.zameen.com/Property/d_h_a_dha_phas... |
1 | 3 BEDS LUXURY 125 SQ YARDS VILLA FOR SALE LOCA... | 145.0 | 5 hours ago | House | 0 | 0 | NaN | Bahria Town Karachi, Karachi, Sindh | Bahria Town - Ali Block, Bahria Town - Precinc... | 3 BEDS LUXURY 125 SQ YARDS VILLA FOR SALE LOCA... | Bedrooms: 3,Bathrooms: 3,Kitchens: 2 | https://www.zameen.com/Property/bahria_town_pr... |
2 | 4 BEDS LUXURY SPORTS CITY VILLA FOR RENT BAHRI... | 212.0 | 5 hours ago | House | 0 | 0 | NaN | Bahria Town Karachi, Karachi, Sindh | Bahria Sports City, Bahria Town Karachi, Karac... | 4 BEDS LUXURY SPORTS CITY VILLA FOR RENT BAHRI... | Bedrooms: 4,Bathrooms: 4,Kitchens: 2 | https://www.zameen.com/Property/bahria_town_ka... |
3 | 3 BEDS LUXURY 235 SQ YARDS VILLA FOR SALE LOCA... | 150.0 | 5 hours ago | House | 0 | 0 | NaN | Bahria Town Karachi, Karachi, Sindh | Bahria Town - Precinct 31, Bahria Town Karachi... | 3 BEDS LUXURY 235 SQ YARDS VILLA FOR SALE LOCA... | Bedrooms: 3,Bathrooms: 3,Kitchens: 2 | https://www.zameen.com/Property/bahria_town_ka... |
4 | Buying A Flat In Clifton - Block 9? | 400.0 | 6 hours ago | Flat | 3 | 3 | 200 Sq. Yd. | Clifton, Karachi, Sindh | Clifton - Block 9, Clifton, Karachi, Sindh | Apartment for sale | Flooring,Electricity Backup,Broadband Internet... | https://www.zameen.com/Property/clifton_clifto... |
We may have other columns to drop later on and we would add them to the .drop
method. But this is all we are going to do with this right now. We have used the method of chaining to clean our data. We are yet to drop null values from the price
column.
This will be all for today. Thanks for reading.