Skip to content

Latest commit

 

History

History
398 lines (253 loc) · 23 KB

File metadata and controls

398 lines (253 loc) · 23 KB

Week 15: Geographical data

Objective

Understand geographical data

Libraries:

  • geopandas
  • geopy
  • folium
  • plotly
  • matplotlib

Geographical System

Following are the major steps and considerations when dealing with geographical data:

  1. Geocode: turn geographical names into longitude and latitude coordinates. For example, you can not plot Hong Kong on a map, but you can plot (114.141, 22.362) on the map. (you can use geojson.io to quickly get the data).
  2. Projection: even if you get the geo coordinates somehow, it still can not be plotted on the screen directly. We need a translation from the geo coordinates to screen coordinates. For example, if we want to put HK in the center of the a 640px by 480px 2D map, we need to establish a mapping like (114.141, 22.362) --> (320px, 240px). This process is called projection. The actual project is more complex than that. Here's a demo of different methods of projection.
    • Scatter plot/ bubble plot -- simply project the point coordinates
    • Choropleth -- one needs to project a geometry
  3. Base layer: maps are usually organised into layers. Besides puting the data points we are interested in onto the map, we also show some geographical information, like consitutuency boundaries, streets and ontours. This is the benefit of map -- put new data points onto a plate that people are already familiar with. This kind of information usually comes with the "base layer", whereas the above plotted elements are in "data layers". Choices for base layer are like Google Maps, Open Street Map, Mapbox, etc.

Geocoding: turn string address data into geo coordinates

Geocoding is usually done via a web service. The service is costly so you can seldom find free service nowadays. geopy has encapsulated many useful geocoding services for your selection. Nominatim is a frequently used free service. You just need to specify your user agent (any string would work) and control the request rate.

from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent='specify_your_app_name_here')
location = geolocator.geocode('The address that you want to geocode')
location.point

Google Map once provided a free API. It ceased operation since July 2018. Now you must apply a Google API key before using this service. The first hundreds requests everyday are free. Followup requests are charged at US$5 per 1000 requests. You can checkout details in the billing plan. The core code is as follows:

from geopy.geocoders import GoogleV3
geolocator = GoogleV3(api_key='Your API Key from Google API')
location = geolocator.geocode('The address that you want to geocode')
location.point

One can refer to this notebook for a real and complete working example of geocoding. It is part of the HK Sichuan food growth map visualization.

Geographical Reference Systems (GRS)

A Point-Of-Interest (POI) is denoted by a two dimensional coordiate system (two-element tuple). The location of POI is always in a relative sense. In order to systems to communicate with each other and accurately refer to the same location on earth, Geographical Reference Systems (GRS) needs to be specified. GRS is like the protocol between GIS systems. One GRS specifics the followings:

  • The projection method
  • Center of the map
  • Scaling factor of the map

For example, the geocoding results from above section are a pair of (longitude, latitude) values, which are referencing to the "WGS1984 CRS" (EPSG:4326), i.e. longitude in range [-180, 180] and latitude in range [-90, 90]. If you checkout the district council boundary file from Hong Kong's Census and Statistics Department, you will find the coordinates are very large numbers. That is because Hong Kong conventionally used "Hong Kong 1980 CRS" (EPSG:2326) in government official files. If you put those files into some visualisation tools like mapshaper, there is no problem displaying them individually. However, when you use those files in modern mapping libraries, the plotted geographical elements may not be at the location you expect.

Most modern mapping library and GeoJSON file use WGS1984 CRS. Usually this step is hidden from a normal user. However, if you encountered some ancient files, you may need to handle the CRS conversion. Here is a practical case of CRS conversion.

Projection system

We live on a spherical surface but the computer screen is rectangular. The process to convert shapes from the former to the latter is called "projection". Here is a demo of different methods of projection.

Mercator projection

Mercator projection is the most widely used projection. One can see the most familiar world map using Mercator projection.

Image from Jason Davies's map projection explorer

Here is an excellent video to show you how our conventional world map can be misleading.

Image from "How the World Map Looks Wildly Different Than You Think"

The key take-away is that, the farther away from earth equator, the larger the distortion.

File Formats

GeoSeries and GeoDataFrame

Those are the subclasses (extension) of pandas's DataFrame and Series. In essence, it adds one geometry column to the ordinary pandas table. Technically, there can be more columns representing geometry but one needs to be active in order to enable geometric arithmetics. The geopandas module implements all the above GIS basic operations, like geocoding, converting CRS, calculating projections, etc. It also supports geometric arithmetics like intersects and contains. Those operations can be carried out on a vector for convenience. There are also counterpart like join two DataFrame in GeoDataFrame -- sjoin, a.k.a "spatial join", who leverages supported geometric arithmetics on geo elements.

GeoJOSN

GeoJSON is a lightweight file format to store geographical data. It is based on JSON and can be easily load/ processed by many programming languages. Read more about the file format specification on http://geojson.org/ and try to draw GeoJSON files on http://geojson.io/.

TopoJSON

GeoJSON format can result in very large files. It can be a prohibitive factor for widely deployed web service. TopoJSON can significantly reduce the file size. It is based on the following key ideas:

  • Reduce redundant/ shared arcs between geometries to save space
  • Use fixed-precision delta-encoding for integer coordinates

KML

KML was an early format intended for web based mapping services. It is supported by Google and still works as main (or only) format in Google services. Since it is XML like file format, it usually has larger file size than JSON based format (GeoJSON/TopoJSON). Find more on wiki and try to visualise KML via Google Fusion Map.

Mapping

"Mapping" refers to the process of visualizing data on maps, a.k.a data visualization on maps. So the key issue of mapping is to determine which visual element is used to present the data. That leads to three major types of maps: visualise by point, by line and by area.

Map types

This section discusses some common map types.

Plotting points: (Point of Interest; POI)

Plotting lines:

Plotting areas:

Map components

  • Feature
  • Layer
    • Layer
    • Background Layer
  • Auxiliary
    • Tooltip
    • Highlight
    • Toolbox

When map is used to show correlation

Map can be very effective when the data value is correlated with geolocation. It helps one to identify patterns and discover anomalies.

When map hides key information

Map can often hide key information, because the larger area on a map does not carry an equivalent importance of data. In the above example, Europe, especially Spain, needs to be highlighted because those are the major targets of Qingtian migrants. However, due to the map area issue, Brazil catches one's eye more easily.

One way to solve this problem technically is to plot Cartogram. However, bar chart could come handy most of the time.

Case studies

This section includes some selected map visualization cases made in Python. There are many other tools that can help you make maps, most notably QGIS, D3 and Carto. We leave pointers in the "other tools" section for readers' reference.

Shanghai rental sources map visualization

  • Use geopandas for geocoding, visualization.
  • Use GeoDataFrame.sjoin() to correlate points into polygons (administrative areas).
  • Use GeoDataFrame.groupby().count() to count the POIs in each area.

More details can be found on this article.

Air crash map using plotly

Following is an example of plotting interactive map with plotly. It's a report about the air crashes in the past 70 years around the world.

For the visualization of the map, the key data we should get is the longitude and latitude of each cities, and organize the start station and the end station of each path. Then, with help of plotly, we can get an interactive map.

Plotly interactive map

The tools and process:

  • Get the data from the open source website in Socrata,World Bank, and NGIA.
  • Use pandas to curate and restructure all the data source
  • Use plotly to visualize the map

You can refer here for the whole story. The dataset and codes can be found here.

Openrice Sichuan Food using folium

Following is an animated map showing how Sichuan restaurants rolled out in Hong Kong.

The tools and process:

  • Use requests, selenium, and beautifulsoup to collect data
  • Use geopy to perform geocoding, i.e. turn address into geo-location
  • Use folium (built-on leaflet.js) to visualise circles on map
  • Use selenium to take screenshot
  • Use ImageMatick and gifsicle to combine screenshots into gif

Code repo: https://github.com/hupili/openrice-data-blog-201811

Global data journalist distribution and contribution map using ploty

Following is an example of scatter plots on maps about how data journalists distribute all over the world.

Like the air crash map above, this map's key data that we should get is also the longitude and latitude of each cities. In addition, we need another dimension to assign the color of each point on this map. In this case, the depth of color represents a journalist's overall github contribution from 2008.

The tools and process:

  • Get the geographical data and other information of a journalist from the csv files
  • Use aggregate() to accumulate each journalist's Github contribution data from 2008
  • Use plotly to visualize the map

The dataset and codes can be found here.

American journalist job market map using ploty

This is an example of choropleth map and a report about the condition of the employment market for journalists in the U.S. In this map, the key data is the number of opening positions in each state of the U.S. In addition, this map recognise the states with their abbreviations(VA, NY...).

The tools and process:

  • Get the data from 0 jobs.csv
  • Use pandas to manipulate the location data from 0 jobs.csv into US-States.csv to code the states of the U.S into their abbreviations
  • Use list.count() to categorise the jobs into different states
  • Use plotly to visualize the map

The dataset and codes can be found here.

Tourists footmark of Domestic Tourist Cities

This is the interactive map about footmarks of tourists from the top5 domestic tourist cities. As you can see these dots and arrows, most of them are not only hot destinations, but also hot origins. Therefore, they connect with each other and form a network. These footmarks almost concentrate on Eastern China.

Tourists bookmarks.png

The tools and process:

  • Get the data from new_final.csv
  • Use pyecharts and GeoLines to visualize the maps.

The dataset and codes can be found here.

Number of journalists killed in different countries

The following map is an interaction diagram(svg) about numbers of journalists killed worldwide. The deeper the color of a country, the more news workers were killed in that country and the more dangerous it is for reporters. However, there is not any interaction in this page. You can click here to try out the interaction effect.

Journailists killed worldwide

The tools and precess:

  • Get data from CPJ
  • Use pygal.maps.world to plot the map

The dataset and codes can be found here.

England and Ireland pubs using matplotlib

A map of Britain and Ireland pubs.

Britain & Ireland mapping pubs

The tools and process:

  • Get data from OpenStreetMap and provided by osm-x-tractor.
  • Draw geo scatter plot via matplotlib

Data and codes can be found here: England and Ireland seen from pub locations.

Hong Kong property price bubble chart using folium

Visualizing various property prices in Hong Kong.

Property prices in hongkong.png

The tools and process:

  • Dataset extracted from Midland Realty Property Price Chart with help of pdf to csv converter
  • overpy for geocoding.
  • Using pandas to combine the property names, prices, and coordinates into one huge dataframe for mapping.
  • Drawing a map with Folium

Data and codes can be found here: Visualising HK property prices.

United States unemployment rate 2012 choropleth using folium

An example of a choropleth map made using the Folium library. This example comes directly from the documentation of this library, you can find more examples here.

Choropleth map with folium

The tools and process:

  • First get A shape file in the geojson format & A data frame that gives the values of each zone in your case
  • Plot choropleth using folium.

Data and codes can be found here: United States unemployment rate 2012 choropleth map.

Bonus: Other GIS and mapping tools

QGIS

https://www.qgis.org/en/site/

QGIS is written in Python. It provides a nice GUI so people without coding background can also use this tool. It integrates very well with Python. One can first try to process the geographical data via QGIS GUI. Once the prototyping is done, one can automate the workflow and take it to a massive scale using some glue code written in Python.

One major advantage of QGIS is being FOSS.

ArcGIS

https://www.arcgis.com/index.html

It is a high quality commercial GIS system.

Carto

https://carto.com/

Very easy to use online tools. However, the free version limits number of POIs to 10,000. You may consider your data scale before using this tool.

D3

D3 is a widely used data visualization library in Javascript. It provides convenience tools for the users to handle geo project and turn GeoJSON data into SVG path elements. The tools is highly flexible and favoured by many web designers.

Here is a case made by D3

MapShaper

http://mapshaper.org/

MapShper can help one to preview maps files and convert between different formats. It is also available as a command line tool. See it in action in the geohk project.

Google Fusion Table

Being in the Google toolchain is a major advantage. However, this online tool requires KML format to plot a map. The current de facto standard GeoJSON is not supported as of this writing.

Exercises

Distances among cities

  1. Calculate the "straight line" distance on earth surface from several source cities to Hong Kong. The source cities: New York, Vancouver, Stockholm, Buenos Aires, Perth. For each source city, print one line containing the name of the city and distance.
  2. You can find "Great-circle distance" formula here.
  3. Use list and for loop to handle multiple cities.
  4. Use function to increase the reusability.
  5. Modules you need: math, you may need to use trigonometric functions.

NOTE: Our objective of the whole course is to get you onboard a new tool -- Python. You should use the tool but not be constrained by this tool. When you get stuck with a challenge, try to use your way, combining non-Python methods, to solve it and then iterate for better solution. For example, one key question for this exercise is to get the geo-locations of the cities in terms of longitudes and latitudes. Only with those coordinates, you can fit them into the great-circle distance formula. You can do this by searching Google, Google Map or Wikipedia as a start. After you have a basic version, try to think of automatic ways, in case there are a large number of interested cities in our real challenge, which makes the manual searching method infeasible.

Extended exercise of geo distance

There is a package called geopy. It can automatically search the geo-locations in terms of longitude and latitude based on the location names. Further more, it can directly compute the distance between two geo-locations, without requiring to write the formula all by one's own.