Skip to content

FabianPedreros/SQL_data_exploration_Covid19

Repository files navigation

SQL_data_exploration_Covid19

Data exploration with SQL to a Covid 19 dataset

Here you can find the SQL file

Objective

Make some basic data exploration on SQL to understand a given data set.

Data set and guide

image

Data preparation and load

  1. We created two different data sets from the original data, one that containts the deaths data and the other containts the vaccination data.

image

  1. We loaded the data set on Microsoft SQL Server in two different tables.

image

SQL queries

Query all the CovidDeaths data order by location and then for date.

SELECT * FROM  Covid19..CovidDeaths
ORDER BY 3, 4;

Imagen1

Query the columns we are interested

SELECT location, date, total_cases, new_cases, total_deaths, population
FROM Covid19..CovidDeaths
WHERE total_cases IS NOT NULL
ORDER BY 1, 2;

image

Calculating the deaths_percentage at Colombia, we have to the 2022-03-02 a total of 6.067.023 cases with a deaths percentage of 2.29%

SELECT location, date, total_cases, total_deaths, (total_deaths/total_cases)*100 AS deaths_percentage
FROM Covid19..CovidDeaths
WHERE total_cases IS NOT NULL AND location LIKE 'Colombia'
ORDER BY 1, 2

image

Colombia is at Rank 47 of countries with more deaths percentage.

SELECT location, date, total_cases, total_deaths, (total_deaths/total_cases)*100 AS deaths_percentage
FROM Covid19..CovidDeaths
WHERE total_cases IS NOT NULL AND date LIKE '2022-03-02'
ORDER BY 5 DESC

image

image

Almost the 11.8% of the population in Colombia have had Covid

SELECT location, date, total_cases, population, (total_cases/population)*100 AS cases_percentage
FROM Covid19..CovidDeaths
WHERE location LIKE 'Colombia'
ORDER BY 1, 2;

image

The top five countries by cases percentage for the total population are countries with low population.

SELECT TOP 20 location, date, total_cases, population, (total_cases/population)*100 AS cases_percentage
FROM Covid19..CovidDeaths
WHERE date LIKE '2022-03-02'
ORDER BY 5 DESC;

image

Group by for looking the cases percentages

SELECT location, population, MAX(total_cases) as HighestInfectionCount, MAX(total_cases/population)*100 as PercentagePopulationInfected
FROM Covid19..CovidDeaths GROUP BY location, population
ORDER BY PercentagePopulationInfected DESC

image

Showing the countries with highest death count by population

SELECT location, population, MAX(total_deaths) as HighestDeathsCount, MAX(total_deaths/population)*100 as PercentagePopulationDeath
FROM Covid19..CovidDeaths GROUP BY location, population
ORDER BY PercentagePopulationDeath DESC

Here we can see that the top countries per percentage deaths by population are south European countries, and the highest rate is for Peru.

image

Quering the data we can see an error, the total_deaths atrribute has been define as Varchar, so we can use CAST function to used it as an integer.

SELECT location, MAX(CAST(total_deaths AS INTEGER)) as DeathsCount
FROM Covid19..CovidDeaths GROUP BY location
ORDER BY DeathsCount DESC

image

Querying the data we can see an error, we have aggrupations for continents, region and income. So we have to filter the data. All this aggrupations does not have a value in continent.

SELECT location, MAX(CAST(total_deaths AS INTEGER)) as DeathsCount
FROM Covid19..CovidDeaths 
WHERE continent IS NOT NULL
GROUP BY location
ORDER BY DeathsCount DESC

So the top three countries by deaths are USA, Brazil and India.

image

Breaking Down the deaths count by aggrupations, we can see the total deaths on the world and Europe, as continent, has the highest quantity.

SELECT location, MAX(CAST(total_deaths AS INTEGER)) as DeathsCount
FROM Covid19..CovidDeaths 
WHERE continent IS NULL
GROUP BY location
ORDER BY DeathsCount DESC

image

Now we want to know the total cases by date

SELECT date, SUM(new_cases) as Total_cases
FROM Covid19..CovidDeaths
WHERE continent IS NOT NULL
GROUP BY date
ORDER BY date

image

Now we calculate the total deaths percentage by date

SELECT date, SUM(new_cases) as Total_cases, SUM(CAST(new_deaths AS int)) as Total_deaths, 100*SUM(CAST(new_deaths AS int))/SUM(new_cases) AS TotalDeathsPercentage
FROM Covid19..CovidDeaths
WHERE continent IS NOT NULL
GROUP BY date
ORDER BY date

image

The total deaths percentage is 1.35% to the date.

SELECT SUM(new_cases) as Total_cases, SUM(CAST(new_deaths AS int)) as Total_deaths, 100*SUM(CAST(new_deaths AS int))/SUM(new_cases) AS TotalDeathsPercentage
FROM Covid19..CovidDeaths
WHERE continent IS NOT NULL

image

Join the two tables to see the new vaccinations per day

SELECT DEA.date, DEA.continent, DEA.location,DEA.population, DEA.total_deaths, VAC.new_vaccinations FROM
Covid19..CovidDeaths AS DEA JOIN Covid19..CovidVaccinations AS VAC
ON DEA.date = VAC.date AND
DEA.location = VAC.location
WHERE DEA.continent IS NOT NULL
ORDER BY DEA.location, DEA.date

image

Looking at the cumulative quantity new vaccinations per day in Colombia

SELECT DEA.date, DEA.continent, DEA.location,DEA.population, DEA.total_deaths, VAC.new_vaccinations, 
SUM(CONVERT(bigint, VAC.new_vaccinations)) OVER (PARTITION BY DEA.location ORDER BY DEA.location, DEA.date) AS CumulativeVaccinations
FROM Covid19..CovidDeaths AS DEA JOIN Covid19..CovidVaccinations AS VAC
ON DEA.date = VAC.date AND
DEA.location = VAC.location
WHERE DEA.continent IS NOT NULL AND DEA.location LIKE 'Colombia'
ORDER BY DEA.location, DEA.date

image

Calculation of the accumulated percentage of vaccinations

WITH PopvsVac (date, continent, location, population, total_deaths, vaccinations, CumulativeVaccinations)
AS
(
SELECT DEA.date, DEA.continent, DEA.location,DEA.population, DEA.total_deaths, VAC.new_vaccinations, 
SUM(CONVERT(bigint, VAC.new_vaccinations)) OVER (PARTITION BY DEA.location ORDER BY DEA.location, DEA.date) AS CumulativeVaccinations
FROM Covid19..CovidDeaths AS DEA JOIN Covid19..CovidVaccinations AS VAC
ON DEA.date = VAC.date AND
DEA.location = VAC.location
WHERE DEA.continent IS NOT NULL AND DEA.location LIKE 'Colombia'
)
SELECT *, (CumulativeVaccinations/population)*100 AS CumulativeVaccinationsPercentage
FROM PopvsVac

image

Calculation of the accumulated percentage of vaccinations with a temp table

DROP TABLE IF exists #PercentPopulationVaccinated
CREATE TABLE #PercentPopulationVaccinated

(
Date date,
Continent NVARCHAR(255),
Location NVARCHAR(255),
Population NUMERIC,
New_vaccinations NUMERIC,
CumulativeVaccinations BIGINT
)

INSERT INTO #PercentPopulationVaccinated

SELECT DEA.date, DEA.continent, DEA.location, DEA.population, VAC.new_vaccinations, 
SUM(CONVERT(bigint, VAC.new_vaccinations)) OVER (PARTITION BY DEA.location ORDER BY DEA.location, DEA.date) AS CumulativeVaccinations
FROM Covid19..CovidDeaths AS DEA JOIN Covid19..CovidVaccinations AS VAC
ON DEA.date = VAC.date AND
DEA.location = VAC.location
WHERE DEA.continent IS NOT NULL AND DEA.location LIKE 'Colombia'

SELECT *, (CumulativeVaccinations/Population)*100 AS CumulativeVaccinationsPercentage
FROM #PercentPopulationVaccinated;

image

Creating a view to be used in data visualization

DROP VIEW IF EXISTS PercentPopulationVaccinated

CREATE VIEW PercentPopulationVaccinated AS

    SELECT DEA.date, DEA.continent, DEA.location, DEA.population, VAC.new_vaccinations, 
    SUM(CONVERT(bigint, VAC.new_vaccinations)) OVER (PARTITION BY DEA.location ORDER BY DEA.location, DEA.date) AS CumulativeVaccinations
    FROM Covid19..CovidDeaths AS DEA JOIN Covid19..CovidVaccinations AS VAC
    ON DEA.date = VAC.date AND
    DEA.location = VAC.location
    WHERE DEA.continent IS NOT NULL AND DEA.location LIKE 'Colombia'

image

Views creation for data analysis!

DROP VIEW IF EXISTS COVID_DB
CREATE VIEW dbo.COVID_DB AS
    SELECT DEA.continent, DEA.location, DEA.date, DEA.population, DEA.total_cases, DEA.new_cases, DEA.total_deaths, DEA.new_deaths,
    VAC.people_vaccinated, VAC.new_vaccinations, VAC.people_fully_vaccinated
    FROM Covid19..CovidDeaths AS DEA JOIN Covid19..CovidVaccinations AS VAC
    ON DEA.date = VAC.date AND
    DEA.location = VAC.location

Querying COVID_DB view

SELECT * FROM COVID_DB
ORDER BY location, date

image

-- View with only countries data, cleaning aggrupations

DROP VIEW IF EXISTS COVID_DB_COUNTRIES

CREATE VIEW COVID_DB_COUNTRIES AS
    SELECT * FROM COVID_DB
    WHERE continent IS NOT NULL

Querying COVID_DB_COUNTRIES view

SELECT * FROM COVID_DB_COUNTRIES
ORDER BY location, date

image

View with data for income aggrupations

DROP VIEW IF EXISTS COVID_DB_INCOME

CREATE VIEW COVID_DB_INCOME AS
    SELECT * FROM COVID_DB
    WHERE continent IS NULL and location LIKE '%income'

Querying COVID_DB_INCOME view

SELECT * FROM COVID_DB_INCOME
ORDER BY location, date

image

View with data for continents

DROP VIEW IF EXISTS COVID_DB_CONTINENTS

CREATE VIEW COVID_DB_CONTINENTS AS
    SELECT * FROM COVID_DB
    WHERE continent IS NULL and location IN ('Africa', 'Asia', 'Europe', 'North America', 'Oceania', 'South America')

Querying COVID_DB_CONTINENTS view

SELECT * FROM COVID_DB_CONTINENTS
ORDER BY location, date

image

View with data for all the world

DROP VIEW IF EXISTS COVID_DB_WORLD

CREATE VIEW COVID_DB_WORLD AS
    SELECT * FROM COVID_DB
    WHERE continent IS NULL and location LIKE 'WORLD'

Querying COVID_DB_WORLD view

SELECT * FROM COVID_DB_WORLD
ORDER BY location, date

image

Some analysis by country

Top five countries by people fully vaccinated

SELECT TOP 5 location, MAX(CAST(people_fully_vaccinated AS INT)) AS Fully_vaccinated
FROM COVID_DB_COUNTRIES
GROUP BY location
ORDER BY Fully_vaccinated DESC

image

Top five countries by people fully vaccinated over population

SELECT TOP 5 location, MAX(CAST(people_fully_vaccinated AS INT)/population)*100 AS Percentage_Fully_vaccinated
FROM COVID_DB_COUNTRIES
GROUP BY location
ORDER BY Percentage_Fully_vaccinated DESC

image

Top five countries by less people fully vaccinated NOT NULL

SELECT TOP 5 location, MAX(CAST(people_fully_vaccinated AS INT)/population)*100 AS Percentage_Fully_vaccinated
FROM COVID_DB_COUNTRIES
GROUP BY location
ORDER BY Percentage_Fully_vaccinated DESC

image

Top five countries by less people fully vaccinated over population NOT NULL

SELECT TOP 5 location, MAX(CAST(people_fully_vaccinated AS INT)/population)*100 AS Percentage_Fully_vaccinated
FROM COVID_DB_COUNTRIES
GROUP BY location
HAVING MAX(CAST(people_fully_vaccinated AS INT)/population)*100 > 0
ORDER BY Percentage_Fully_vaccinated

image

Some analysis by income

SELECT location as income, SUM(CAST (new_cases AS BIGINT)) AS cases, SUM(CAST(new_deaths AS BIGINT)) AS deaths, SUM(CAST(new_vaccinations AS BIGINT)) as vaccinations
FROM COVID_DB_INCOME
GROUP BY location
ORDER BY cases DESC

image

About

Data exploration with SQL to a Covid 19 dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published