Skip to content

shanto173/SQL-2024_laptop_dataset

Repository files navigation

Laptop Data Cleaning and Exploratory Data Analysis Project useing SQL

This project involves cleaning a dataset of laptops, which includes steps such as backing up the table, removing unnecessary columns, handling missing values, removing duplicates, and transforming columns into more useful formats for analysis.

Before Cleaning data

Before Cleaning Data

After Cleaning data

Before Cleaning Data

Table of Contents for Data Cleaning

  1. Step 1: Creating Backup
  2. Step 2: Checking the Number of Rows and Columns
  3. Step 3: Checking Memory Usage
  4. Step 4: Dropping Non-important Columns
  5. Step 5: Dropping Rows with Null Values
  6. Step 6: Dropping Duplicates
  7. Step 7: Cleaning Columns
  8. Step 8: Handling Screen Resolution Information
  9. Step 9: Cleaning and Organizing Memory Information
  10. Step 10: Final Step: View the Cleaned Laptop Table

Table of Contents for exploratory data analysis.

  1. Exploratory-Data-Analysis
  2. Introduction
  3. Data-Preview
  4. Univariate-Analysis-of-Numerical-Columns
  5. Bivariate-Data-Analysis(Numerical Vs Numerical)
  6. Bivariate-Data-Analysis(Categorical Vs Numerical)
  7. Missing Value Treatment (Imputing Missing Price)
  8. Feature_Engineering
  9. Encoding(One Hot Encoding)

SQL Code and Explanation

Step 1: Creating Backup

The first step is to create a backup of the laptop table.

  1. Create the table structure for backup:
CREATE TABLE laptop_backup LIKE laptop;
  1. Insert all the values into the backup table:
INSERT INTO laptop_backup 
SELECT * FROM laptop; 

Step 2: Checking the Number of Rows and Columns.

To verify the number of rows in the laptop table:

SELECT COUNT(*) FROM laptop;

Step 3: Checking Memory Usage

Check how much memory the data occupies:

SELECT * FROM information_schema.tables 
WHERE TABLE_SCHEMA = 'data_cleaning_project' AND table_name = 'laptop';

Convert the memory from bytes to kilobytes:

SELECT data_length / 1024 FROM information_schema.tables 
WHERE TABLE_SCHEMA = 'data_cleaning_project' AND table_name = 'laptop'; 

Step 4: Dropping Non-important Columns

The column Unnamed: 0 is irrelevant, so I dropped it:

ALTER TABLE laptop DROP COLUMN `Unnamed: 0`;

Step 5: Dropping Rows with Null Values

Identify and delete rows where all relevant columns are null:

WITH ind AS (SELECT * FROM laptop 
WHERE company IS NULL AND TypeName IS NULL AND Inches IS NULL
AND ScreenResolution IS NULL AND Cpu IS NULL AND Ram IS NULL
AND Memory IS NULL AND Gpu IS NULL AND OpSys IS NULL
AND Weight IS NULL AND Price IS NULL)
DELETE FROM laptop WHERE `index` IN (SELECT `index` FROM ind);

Step 6: Dropping Duplicates

Remove duplicate rows based on the main features:

WITH duplicates AS (
    SELECT Company, TypeName, Inches, ScreenResolution, Cpu, Ram, Memory, Gpu, OpSys, Weight, Price, MIN(`index`) AS duplicate_index
    FROM laptop 
    GROUP BY Company, TypeName, Inches, ScreenResolution, Cpu, Ram, Memory, Gpu, OpSys, Weight, Price)
DELETE FROM laptop WHERE `index` NOT IN (SELECT duplicate_index FROM duplicates);

Step 7: Cleaning Columns

  1. Convert the Inches column from text to decimal format in order to reduce file size:
     ALTER TABLE laptop MODIFY COLUMN Inches DECIMAL(10,1);
  2. Remove the "GB" from the RAM column with an empty string and convert the RAM column to integer:
     UPDATE laptop t1
     SET Ram = (
     SELECT REPLACE(t2.Ram, 'GB', '')
     FROM (SELECT * FROM laptop) t2
     WHERE t2.`index` = t1.`index`
     );
     ALTER TABLE laptop MODIFY COLUMN Ram INTEGER;
  3. Fix the Weight column by replacing "kg" and converting it:
  UPDATE laptop t1
  SET Weight = REPLACE(Weight, 'kg', '') 
  WHERE `index` = t1.`index`;
  1. Round the Price column to the nearest integer and update the value and convert the double column to an integer:

      UPDATE laptop t1
      SET Price = (
      SELECT ROUND(Price) 
      FROM (SELECT * FROM laptop) t2 
      WHERE t1.`index` = t2.`index`
    );
    ALTER TABLE laptop MODIFY COLUMN Price INTEGER;
  2. Clean the OpSys column by categorizing the operating systems:

         UPDATE laptop 
         SET OpSys = CASE
             WHEN OpSys LIKE '%mac%' THEN 'macOS'
             WHEN OpSys LIKE '%windows%' THEN 'windows'
             WHEN OpSys LIKE '%linux%' THEN 'Linux'
             WHEN OpSys LIKE '%No OS%' THEN 'NA'
             ELSE 'others'
         END;
  3. Step 8: GPU Column Modification First, create 2 columns for storing specific GPU information Extract gpu_brand and gpu_name from the Gpu column and drop the old column:

             ALTER TABLE laptop
             ADD COLUMN gpu_brand VARCHAR(255) AFTER Gpu,
             ADD COLUMN gpu_name VARCHAR(255) AFTER OpSys;
             
             UPDATE laptop t1
             SET gpu_brand = (
                 SELECT SUBSTRING_INDEX(Gpu, ' ', 1) 
                 FROM (SELECT * FROM laptop) t2 
                 WHERE t1.`index` = t2.`index`
             );
             
             UPDATE laptop t1
             SET gpu_name = (
                 SELECT REPLACE(Gpu, gpu_brand, '') 
                 FROM (SELECT * FROM laptop) t2 
                 WHERE t1.`index` = t2.`index`
             );
             
             ALTER TABLE laptop DROP COLUMN Gpu;
      
  4. Create three new columns to store the CPU brand, name, and speed.

    ALTER TABLE laptop
    ADD COLUMN cpu_brand VARCHAR(255) AFTER Cpu,
    ADD COLUMN cpu_name VARCHAR(255) AFTER cpu_brand,
    ADD COLUMN cpu_speed DECIMAL(10,1) AFTER Cpu_name;

7.1 Extracting the CPU Brand

SELECT Cpu, SUBSTRING_INDEX(Cpu, ' ', 1) FROM laptop;

7.2 Updating the cpu_brand Column The extracted CPU brand values are inserted into the cpu_brand column.

 UPDATE laptop t1
    SET cpu_brand = (
    SELECT SUBSTRING_INDEX(Cpu, ' ', 1)
    FROM (SELECT * FROM laptop) t2
    WHERE t1.index = t2.index
    );

7.3 Extracting CPU Name The CPU name is extracted by removing the brand from the original Cpu column and stored in the cpu_name column.

        UPDATE laptop t1
        SET cpu_name = (
        SELECT REPLACE(Cpu, cpu_brand, '')
        FROM (SELECT * FROM laptop) t2
        WHERE t1.index = t2.index
        );

7.4 Extracting CPU Speed This step extracts the CPU speed from the cpu_name and stores it in the cpu_speed column.

            UPDATE laptop t1
            SET cpu_speed = (
                SELECT REPLACE(t3.speed, 'GHz', '')
                FROM (
                SELECT cpu_name, SUBSTRING_INDEX(cpu_name, ' ', -1) AS 'speed'
                FROM (SELECT * FROM laptop) t2
                WHERE t1.index = t2.index
            ) t3
        );

7.5 Cleaning Up the cpu_name Column The CPU speed is removed from the cpu_name column.

            UPDATE laptop t1
            SET cpu_name = (
                SELECT SUBSTRING_INDEX(cpu_name, cpu_speed, 1)
                FROM (SELECT * FROM laptop) t2
                WHERE t1.index = t2.index
            );

7.6 Dropping the Original Cpu Column Finally, the original Cpu column is dropped from the table.

            ALTER TABLE laptop DROP COLUMN Cpu;
            SELECT * FROM laptop;

step-8-handling-screen-resolution-information

Step 1: Creating Columns for Screen Resolution Data Three new columns are added to store height, width, and whether the laptop has a touchscreen.

ALTER TABLE laptop
ADD COLUMN height INTEGER AFTER ScreenResolution,
ADD COLUMN width INTEGER AFTER height,
ADD COLUMN touchscreen INTEGER AFTER width;

Step 2: Extracting Height and Width The height and width are extracted from the ScreenResolution column.

    UPDATE laptop t1
SET height = (
    SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(ScreenResolution, ' ', -1), 'x', 1)
    FROM (SELECT * FROM laptop) t2
    WHERE t1.index = t2.index
);

UPDATE laptop t1
SET width = (
    SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(ScreenResolution, ' ', -1), 'x', -1)
    FROM (SELECT * FROM laptop) t2
    WHERE t1.index = t2.index
);

Step 3: Identifying Touchscreen Laptops A new touchscreen column is updated based on whether the laptop has a touchscreen feature.

UPDATE laptop t1
SET touchscreen = (
    SELECT CASE
        WHEN ScreenResolution LIKE '%Touchscreen%' THEN 1
        ELSE 0
    END AS touch
    FROM (SELECT * FROM laptop) t2
    WHERE t1.index = t2.index
);

Step 4: Dropping the ScreenResolution Column The original ScreenResolution column is dropped after the data has been extracted.

ALTER TABLE laptop DROP COLUMN `ScreenResolution`;
SELECT * FROM laptop;
  1. Renaming CPU Names for Easier Understanding This section standardizes the CPU names in the cpu_name column for easier understanding and analysis.
UPDATE laptop t1
SET cpu_name = (
    CASE
        WHEN cpu_name LIKE '%Core i5%' THEN 'Core i5'
        WHEN cpu_name LIKE '%Core i7%' THEN 'Core i7'
        WHEN cpu_name LIKE '%A9%' THEN 'A9'
        -- Add other conditions as needed
        ELSE 'others'
    END
);
SELECT * FROM laptop;

step-9-cleaning-and-organizing-memory-information

Step 1: Creating New Columns for Memory Information Three new columns are created to store different types of memory information.

ALTER TABLE laptop
ADD COLUMN secondary_storages VARCHAR(255) AFTER primary_storage;

Step 2: Identifying Memory Types A temporary column is used to identify the type of memory.

    SELECT Memory,
CASE
    WHEN Memory LIKE '%SSD%HDD%' THEN 'hybrid'
    WHEN Memory LIKE '%SSD%' THEN 'SSD'
    WHEN Memory LIKE '%HDD%' THEN 'HDD'
    WHEN Memory LIKE '%Flash%' THEN 'Flash drive'
END AS new
FROM laptop;

Step 3: Updating the Primary and Secondary Storage The primary and secondary storage values are extracted and updated accordingly.

UPDATE laptop t1
SET primary_storage = (SUBSTRING_INDEX(Memory, ' ', 1));

UPDATE laptop t1
SET secondary_storages = (
    SELECT CASE
        WHEN t1.Memory = sec THEN NULL
        ELSE sec
    END AS new_sec
    FROM (
        SELECT Memory, SUBSTRING_INDEX(Memory, ' +', -1) sec
        FROM (SELECT * FROM laptop) t2
        WHERE t1.index = t2.index
    ) t3
);

Step 4: Further Cleaning of Storage Data The storage data is cleaned and standardized.

UPDATE laptop
SET secondary_storages = REPLACE(REPLACE(TRIM(REPLACE(secondary_storage, 'HDD', '')), 'SSD', ''), '1.0TB', '1TB');

UPDATE laptop t1
SET primary_storage = (
    CASE
        WHEN primary_storage = '1TB' THEN '1024GB'
        WHEN primary_storage = '2TB' THEN '2048GB'
        ELSE primary_storage
    END
);

Step 5: Converting Storage Columns to Integer The primary_storage and secondary_storages columns are converted to integer types for efficient data handling.

    ALTER TABLE laptop 
    MODIFY COLUMN primary_storage INTEGER;
    
    ALTER TABLE laptop
    MODIFY COLUMN secondary_storages INTEGER;

step-10-final-step-view-the-cleaned-laptop-table

SELECT * FROM laptop;

Exploratory Data Analysis

Introduction

In this analysis, I perform Exploratory Data Analysis (EDA) to gain insights into the dataset and prepare it for further modeling. The EDA process is divided into several key steps:

  1. Univariate Data Analysis:

    • I start by analyzing individual variables to understand their distributions and characteristics. This includes examining numerical columns and identifying basic statistical measures.
  2. Bivariate Data Analysis:

    • Then explore the relationships between pairs of variables. This involves:
    • Numerical-Numerical Analysis: Investigating the correlation and patterns between numerical variables.
    • Categorical-Categorical Analysis: Analyzing the interactions between categorical variables.
    • Categorical to Numerical Analysis: Examining how categorical variables impact numerical variables.
  3. Data Cleaning and Preparation:

    • Based on insights gained from the univariate and bivariate analyses, I address missing values by imputing them appropriately.
    • Outliers are identified and handled to ensure they do not skew the analysis.
  4. Feature Engineering:

    • New features are created to enhance the dataset. For example, I may calculate metrics like Pixel Per Inch (PPI) using existing columns such as height and width, which can serve as valuable indicators of laptop price.
  5. Categorical Data Encoding:

    • Finally, categorical data is transformed into a numerical format using one-hot encoding to prepare it for machine learning models.

This structured approach ensures a thorough understanding of the dataset and prepares it for accurate and effective analysis.

1-Data-Preview

This section provides an overview of the data in the laptop table, including a preview of the top and bottom rows, as well as a random sample.

Head

SELECT * FROM laptop ORDER BY `index` LIMIT 5;

Top 5 Rows

Retrieves the first 5 rows from the laptop table, ordered by the index column. Useful for quickly viewing the initial entries in the dataset.

Tail

SELECT * FROM laptop ORDER BY `index` DESC LIMIT 5;

Top 5 Rows

Retrieves the last 5 rows from the laptop table, ordered by the index column in descending order. for understanding the data, and quickly viewing the final entries in the dataset.

Sample

SELECT * FROM laptop ORDER BY RAND() LIMIT 5;

Results: Top 5 Rows

Retrieves 5 random rows from the laptop table for getting a random sample of data to examine various parts of the dataset.

2-Univariate-Analysis-of-Numerical-Columns

Numerical-Columns(Price) This section focuses on the price column, which is a numerical column of interest. It includes steps for calculating key statistical measures and identifying outliers.

Creating a New Index for Percentile Calculation

WITH temp AS (
  SELECT `index`, price, ROW_NUMBER() OVER(ORDER BY price) AS new_row
  FROM laptop
)
UPDATE laptop t1
JOIN temp t2 ON t1.index = t2.index
SET t1.new_index = t2.new_row;

Adds a new column new_index to the laptop table, which contains the row number ordered by price. This is used to facilitate percentile calculations because there are no the function that I can use for percentile.

Statistical Summary and Quantiles

    SELECT 
  COUNT(price) AS count,
  MIN(price) AS min_price,
  MAX(price) AS max_price,
  AVG(price) AS average_price,
  STD(price) AS std_dev_price,
  (SELECT price FROM laptop WHERE new_index IN (FLOOR((25*(SELECT COUNT(*) FROM laptop)+1)/100))) AS Q1,
  (SELECT price FROM laptop WHERE new_index IN (FLOOR((50*(SELECT COUNT(*) FROM laptop)+1)/100))) AS Median,
  (SELECT price FROM laptop WHERE new_index IN (FLOOR((75*(SELECT COUNT(*) FROM laptop)+1)/100))) AS Q3
FROM laptop;

Results: 8 number Summary

Provides a statistical summary of the price column, including count, minimum, maximum, average, and standard deviation. Additionally, calculates the 1st quartile (Q1), median, and 3rd quartile (Q3) using the new_index column to determine the appropriate percentile values.

observation: -- **Here count value is the same as the total number of rows so there are no null values are present**
             -- **The minimum laptop price is 9271. which may be an outlier because a laptop can't be that cheap**
             -- **The maximum laptop price is over 3 lakh which is quite expensive for a laptop**
             -- **avg price is 60k and the median is 52k which indicates that there are some outliers because of this the data is skewed**
             -- **standard deviation is 37k which is a lot, which means data is not that centered, data is quite scattered**
             -- **25 percentile value is 32k which indicates that 25 percent of laptop price price is less than 32k**
             -- **50 percentile value is 52.5k which indicates that 50 percent of the laptop price is less than 52.5k**
             -- **75 percentile value is 79.5k which indicates that 75 percent of the laptop price is less than 79.5k**

3. Missing Value Detection

This section identifies rows where the price column has missing values (i.e., null entries).

Finding Missing Price Values

SELECT * FROM laptop WHERE price IS NULL;

Results: Null values

Retrieves all rows from the laptop table where the price column is null. This helps in identifying any missing values that need to be imputed or handled before further analysis.

4. Outlier Detection

Outliers can significantly affect the results of the analysis, so this section focuses on identifying them based on the interquartile range (IQR) method.

SELECT * FROM laptop;

SELECT * 
FROM (
  SELECT *,
    (SELECT price FROM laptop WHERE new_index IN (FLOOR((25*(SELECT COUNT(*) FROM laptop)+1)/100))) AS Q1,
    (SELECT price FROM laptop WHERE new_index IN (FLOOR((75*(SELECT COUNT(*) FROM laptop)+1)/100))) AS Q3
  FROM laptop
) t
WHERE t.price < (Q1 - (1.5 * (t.Q3 - t.Q1))) OR 
      t.price > (Q1 + (1.5 * (t.Q3 - t.Q1)));

Results: Outliers Detection

Description: The first query retrieves all rows from the laptop table for a general overview. The second query calculates the 1st quartile (Q1) and 3rd quartile (Q3) and identifies any rows where the price is an outlier. Outliers are defined as values below Q1 - 1.5*(Q3 - Q1) or above Q1 + 1.5*(Q3 - Q1).

Results: Justifying outliers

--Out of 1244 rows there are 151 outliers according to IQR but they are not outliers, if we consider the specification.

5. Horizontal Histogram of Laptop Prices

Overview This SQL query generates a horizontal histogram of laptop prices by categorizing the prices into predefined ranges (or "buckets") and displaying the frequency of laptops in each range using asterisks (*). This visual representation helps to quickly understand the distribution of laptop prices.

SELECT t.bucket, 
       COUNT(price), 
       REPEAT('*', COUNT(price)/5) AS histogram
FROM (
  SELECT price,
    CASE 
      WHEN price BETWEEN 0 AND 25000 THEN '0-25k'
      WHEN price BETWEEN 25001 AND 50000 THEN '25k-50k'
      WHEN price BETWEEN 50001 AND 75000 THEN '50k-75k'
      WHEN price BETWEEN 75001 AND 100000 THEN '75k-100k'
      ELSE '>100k'
    END AS bucket
  FROM laptop
) t
GROUP BY t.bucket;

Results: Laptop Histograme Bucket Creation: The CASE statement groups the laptop prices into five predefined ranges Count of Laptops: The COUNT(price) function counts how many laptops fall into each price range (bucket). Horizontal Histogram.

The REPEAT('', COUNT(price)/5) function generates a string of asterisks () representing the count of laptops in each bucket. The division by 5 reduces the number of asterisks to make the output more concise and readable. Each asterisk represents approximately five laptops, but this can be adjusted by modifying the divisor in COUNT(price)/X.

Observation: - most of the laptop price range between 25k to 50k and 50k to 75k 
             - only 300 laptops price range between 75k to over 100k because of their brand name and specs their prices are quite high.

6.Data Analysis on weight column

The following SQL query provides a basic statistical summary of the Weight column, including the count, minimum, maximum, average, and standard deviation.

SELECT COUNT(Weight), 
       MIN(Weight), 
       MAX(Weight), 
       AVG(Weight), 
       STD(Weight)
FROM laptop;

Results: weight 5 Number Summary

observation:- By seeing the count value I can say there are no null values present 
            - By seeing the price and max price there are some outliers present in the data
            - By seeing Std, the std less I can say the weight data is mostly centered

6.1 Outlier Detection of weight

Weight of 0.0002 kg: This weight is considered an outlier since it is far below a reasonable laptop weight. The following query is used to identify such records

SELECT * FROM laptop 
WHERE Weight = 0.0002;

6.2 Weights Above 5 kg:

Laptops weighing more than 5 kg are likely outliers since most laptops are designed to be portable, with weights generally between 0.5 and 3 kg. This query finds such outliers:

SELECT * FROM laptop 
WHERE Weight > 5;

6.3 Outliers removal

Once outliers are detected, they can be removed from the dataset using the following DELETE queries.

DELETE FROM laptop WHERE `index` = 349;  -- Remove row with weight 0.0002 kg
DELETE FROM laptop WHERE `index` IN (326, 587);  -- Remove rows with weight > 5 kg

6.4Visualizing the Weight Distribution

After removing the outliers, I visualize the distribution of laptop weights by plotting a horizontal histogram using SQL. The histogram groups laptops into buckets based on their weight, with each asterisk (*) representing a certain number of laptops in each bucket.

SELECT bucket, 
       COUNT(bucket), 
       REPEAT('*', COUNT(bucket)/6) AS histogram
FROM (
  SELECT Weight,
    CASE
      WHEN Weight BETWEEN 0.5 AND 3 THEN '0.5kg-3kg'
      WHEN Weight BETWEEN 3.1 AND 5 THEN '3kg-5kg'
      ELSE '>5kg'
    END AS bucket
  FROM laptop
) t 
GROUP BY bucket;

Bucket Creation: The CASE statement groups the laptop weights into the following categories: 0.5kg-3kg: Lightweight laptops. 3kg-5kg: Heavier laptops.

5kg: Unusually heavy Laptops (if any remain after outlier removal).

Results: Weight_histograme

7.CPU Speed Analysis in SQL

The following SQL query provides a statistical overview of the CPU speeds in the dataset, including the total number of records, the slowest and fastest CPU speeds, and the average and standard deviation of the values.

SELECT COUNT(cpu_speed), 
       MIN(cpu_speed), 
       MAX(cpu_speed), 
       AVG(cpu_speed), 
       STD(cpu_speed)
FROM laptop;

Results: Cpu_5_Number_summary

observation: - AVG(cpu_speed): The average CPU speed of laptops is approximately 2.30 GHz, reflecting a typical mid-range performance level for laptops.
             - STD(cpu_speed): The standard deviation is 0.5049 GHz, indicating that most CPU speeds fall within half a GHz from the average.
             - Modern laptops typically have CPU speeds ranging from 1.5 GHz to 3.5 GHz or higher. A speed of 0.9 GHz seems unusually low, which might indicate an                      - outlier. It could represent very old or specialized low-power processors (like those in ultra-portable devices) 

1.Categorical Data Analysis (Company)

The SQL query below retrieves the number of laptops for each company in the dataset.
It helps identify which brands have the most and least representation in the dataset.

SQL Code:

SELECT company, COUNT(company) 
FROM laptop 
GROUP BY company;

Result:

Company_pie_chart

Insights: Top 3 Brands is 
            - Lenovo: 281 laptops
            - Dell: 280 laptops
            - HP: 260 laptops
            These brands dominate the dataset, indicating their strong presence in the laptop market.
          
          Smaller Representation:
            Asus, Acer, Apple, and Fujitsu have very limited representation.

2.Analyzing Touchscreen Feature

The SQL query below retrieves the count and percentage of laptops that are equipped with a touchscreen versus those that are not. The percentage of each category is calculated based on the total number of laptops in the dataset.

    SELECT 
  touchscreen,
  COUNT(touchscreen) AS touchscreen_count,
  COUNT(touchscreen) / (SELECT COUNT(*) FROM laptop) AS touchscreen_percentage
FROM laptop
GROUP BY touchscreen;

Result:

Touchscreen percentage

Insights: - Non-TouchScreen Laptops:
             - The majority of laptops in the dataset, approximately 85.41%, do not have a touchscreen.
          - TouchScreen Laptops:
             - About 14.59% of the laptops are equipped with a touchscreen, indicating that touchscreens are less common.

3. Analyzing CPU Brand

The SQL query below retrieves the count and percentage of laptops by CPU brand. It shows how common each CPU brand is within the dataset by calculating the percentage share of each.

SELECT 
  cpu_brand,
  COUNT(cpu_brand) AS cpu_brand_count,
  COUNT(cpu_brand) / (SELECT COUNT(*) FROM laptop) AS cpu_brand_percentage
FROM laptop
GROUP BY cpu_brand;

Result:

Cpu_brand_pie_char

Insights:
Intel dominates the dataset with 95.08% of laptops using Intel CPUs, making it by far the most common CPU brand.
AMD accounts for 4.83% of the laptops, making it the second most common but significantly less popular than Intel.
Samsung represents just 0.08% of the dataset, with only one laptop in the dataset using a Samsung CPU.

4. Analyzing CPU Brand

The SQL query below retrieves the count and percentage of laptops by CPU brand. It shows how common each CPU brand is within the dataset by calculating the percentage share of each.

SELECT 
  cpu_brand,
  COUNT(cpu_brand) AS cpu_brand_count,
  COUNT(cpu_brand) / (SELECT COUNT(*) FROM laptop) AS cpu_brand_percentage
FROM laptop
GROUP BY cpu_brand;

Result:

Cpu_brand_pie_char

Insights:
Intel dominates the dataset with 95.08% of laptops using Intel CPUs, making it by far the most common CPU brand.
AMD accounts for 4.83% of the laptops, making it the second most common but significantly less popular than Intel.
Samsung represents just 0.08% of the dataset, with only one laptop in the dataset using a Samsung CPU.

3-Bivariate-Data-Analysis(Numerical-Column)

3.1 Bivariate Analysis (Company vs. Touchscreen)

The SQL query below performs a bivariate analysis by grouping the laptops based on Company and Touchscreen status. It calculates the number of TouchScreen and Non-TouchScreen laptops for each company.

SELECT 
  Company,
  SUM(CASE WHEN touchscreen = 1 THEN 1 ELSE 0 END) AS 'TouchScreen_Yes',
  SUM(CASE WHEN touchscreen = 0 THEN 1 ELSE 0 END) AS 'TouchScreen_No'
FROM laptop
GROUP BY Company, touchscreen;

Result:

Touchscreen_yes_Vs_no

Dell has the highest number of TouchScreen laptops (61) compared to other brands, showing a strong presence of touchscreen technology in its models.
HP, Acer, Asus, and Lenovo also offer touchscreen models, but they predominantly manufacture Non-TouchScreen laptops.
Some companies like Apple, Chuwi, MSI, Huawei, and Vero don’t have any TouchScreen models in the dataset.

3.2 Bivariate Analysis (Company vs. CPU Brand)

The SQL query below calculates the number of laptops from each company that use Intel, AMD, and Samsung CPUs.

SELECT 
  company,
  SUM(CASE WHEN cpu_brand = 'Intel' THEN 1 ELSE 0 END) AS 'Intel',
  SUM(CASE WHEN cpu_brand = 'AMD' THEN 1 ELSE 0 END) AS 'AMD',
  SUM(CASE WHEN cpu_brand = 'Samsung' THEN 1 ELSE 0 END) AS 'Samsung'
FROM laptop
GROUP BY company;

Result:

Company_brand_useage

Insights:
Dell, HP, Lenovo, and Asus primarily use Intel CPUs in their laptops, with very few models using AMD.
HP has the highest number of AMD laptops (23), followed by Lenovo (16).
Only Samsung uses its own CPU in one of its laptops, while other companies exclusively use Intel or AMD.
Brands like Apple, MSI, and Chuwi solely use Intel CPUs.

3.3 Bivariate Analysis (Company vs. Average Price)(Category Vs numerical)

The SQL query below calculates the average laptop price for each Company.

SELECT 
  Company, 
  ROUND(AVG(Price)) AS Avg_Price
FROM laptop
GROUP BY Company;

Result:

Company_Avg_Price

Insights:
Razer has the highest average laptop price (178,283), followed by LG (111,835), indicating these brands focus on premium, high-end laptops.
Apple, Microsoft, and MSI also have high average prices, suggesting they cater to a higher-tier market.
Chuwi, Vero, and Mediacom have the lowest average prices, indicating these brands offer budget laptops.
HP, Acer, and Lenovo are positioned as mid-range brands, with average prices ranging between 33,796 and 58,377.

3.3 Bivariate Analysis (Company vs. Price Range)(Category Vs numerical)

The SQL query calculates the minimum and maximum laptop prices for each company.

SELECT 
  Company, 
  MIN(Price) AS Min_Price, 
  MAX(Price) AS Max_Price
FROM laptop
GROUP BY Company;

Result:

Company_Price_Range

   Insights:
1. Razer has the widest price range, with laptops ranging from 54,825 to 324,955, indicating a broad product line that includes both premium and ultra-premium models.
2. HP and Lenovo also show a wide price range, with HP ranging from 11,136 to 233,846 and Lenovo from 12,201 to 261,019.This suggests they offer both budget and high-end laptops.
3. Vero and Chuwi are focused on budget-friendly laptops, with their maximum prices not exceeding 23,923.
4. Apple and MSI tend to focus on mid-to-premium laptops, with Apple's prices ranging from 47,896 to 152,274, while MSI’s range is from 44,702 to 149,131.
5. Google and Huawei offer products in a narrower range, primarily targeting the higher end of the market.

3. Missing Value Treatment

3.1 Missing Value Treatment (delet all those missing rows)

DELETE FROM laptop
WHERE price IS NULL;

3.2 Missing Value Treatment(Imputing Missing Price by avg price)

The SQL query below updates the price column, filling any null values with the average price from the dataset.

UPDATE laptop
SET price = (SELECT AVG(price) FROM laptop)
WHERE price IS NULL;
Reasoning:
Imputation using mean is a common and simple technique to handle missing numerical data.
In this case, I chose to use the average price because it provides a reasonable estimate that minimizes the impact of missing values on        further analysis.

3.3 Missing Value Treatment (Imputing Missing Price by Company)

The SQL query below updates the price column by setting null values to the average price for each company, ensuring that the imputed values are more specific and relevant to the brand.

UPDATE laptop t1
SET price = (
    SELECT AVG(price) 
    FROM laptop t2 
    WHERE t1.company = t2.company
)
WHERE t1.price IS NULL;
Explanation:
The query calculates the average price for each company by grouping laptops by the company column.
If a laptop's price is missing, the query updates that row with the average price of the laptops from the same company.

3.4 Missing Value Treatment Based on Company and CPU Name

The fourth approach for treating missing values involves filling the price where it is NULL by calculating the average price for laptops that belong to the same Company and have the same cpu_name. This method ensures that we provide a meaningful and relevant estimate for missing prices based on similarities in brand and processor.

UPDATE laptop t1
SET price = (
    SELECT AVG(t2.price)
    FROM (SELECT * FROM laptop) t2
    WHERE t2.Company = t1.Company
    AND t2.cpu_name = t1.cpu_name
)
WHERE price IS NULL;
Explanation:
This method ensures that missing prices are filled with relevant averages based on both the laptop’s brand (Company) and its processor             (cpu_name). This provides more precision compared to using a simple overall average.

4 Feature Enginerring

After performing EDA now i have better knowledge about the data set but there are some column that are not usefull for analysis such as Height and width and inches, so with this three i can now create new column PPI which is much more helpfull for analysis.

PPI formula is :

PPI_formula

4.1 Adding a New Column for PPI Calculation

The first operation is to add a new column named ppi to the laptop table. This column will store the calculated PPI for each laptop.

ALTER TABLE laptop
ADD COLUMN ppi INTEGER;

4.1.2 Updating the PPI Column with Calculated Values

UPDATE laptop t1
SET ppi = (
  SELECT ROUND(SQRT((width * width) + (height * height)) / Inches)
  FROM (SELECT * FROM laptop) t2
  WHERE t1.index = t2.index
);

4.2 Updating the Screen Size Column with Categorized Values

There is another column name inches, because of there are so many value we can create a categories from that when inches <= 14 then 'Small_size' when inches >14.1 and Inches < 16 then 'medium' when inches >= 16 then 'large'

4.2.1 Adding a New Column for Screen Size Classification

We first create a new column, screen_size, in the laptop table. This column will hold the categorized values corresponding to the size of the screen.

ALTER TABLE laptop
ADD COLUMN screen_size VARCHAR(255) AFTER inches;

4.2.2 Updating the Screen Size Column with Categorized Values

WITH temp AS (
  SELECT inches,
    CASE 
      WHEN inches <= 14 THEN 'Small_size'
      WHEN inches > 14.1 AND inches < 16 THEN 'medium'
      WHEN inches >= 16 THEN 'large'
    END AS 'screen_size'
  FROM laptop
)
UPDATE laptop t1
JOIN temp t2 ON t1.inches = t2.inches
SET t1.screen_size = t2.screen_size;

This query classifies the inches values into categories and updates the screen_size column for each laptop in the dataset accordingly.

5 Encoding (One-Hot Encoding)

One-hot encoding is used to convert the gpu_brand categorical column into binary features.

-- One-Hot Encoding for GPU brand
SELECT gpu_brand,
    CASE WHEN gpu_brand = 'Inter' THEN 1 ELSE 0 END AS 'inter',
    CASE WHEN gpu_brand = 'AMD' THEN 1 ELSE 0 END AS 'amd', 
    CASE WHEN gpu_brand = 'nvidia' THEN 1 ELSE 0 END AS 'nvidia',
    CASE WHEN gpu_brand = 'arm' THEN 1 ELSE 0 END AS 'arm'
FROM laptop;

One_hot_encoding

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published