Quick Geographies Explainer

If you’re used to dealing mostly with tabular data, then working with geographic data can seem confusing, complicated, and frustrating. That’s ok! We’re going to try to give you a head start, here, and save some frustration.

First: nothing is square. Nothing.

To help our delicate human brains, we’ve divided the Earth up into seemingly square latitude and longitude coordinates and projected it onto flat-looking maps. But, of course, the Earth isn’t actually square: it’s a sphere. And it’s not even a perfectly round sphere; it’s more a spheroid and oblong-ish along the equator. If you get deep enough into the details, you learn that your whole life was a lie and there isn’t even any such thing as a definitive “sea level”.

So, first thing to work through the stages of grief to learn is this: nothing is square. Nothing. Not even if it has square-looking lat/lon coordinates.

GIS-style polygons help us deal sanely with our spherical home

Luckily, industry has converged on a relatively ubiquitous standard for working with shapes on Earth. In Snowflake (and in Postgres, and an increasing number of other databases) there’s a native datatype for holding geographic shapes called “geography” and there are native functions for calculating things on top of geographies like the distance between two points, the intersections of two polygons and if a point lies within a polygon.

We use the geography data type in the geo_bounds column

The main CMIP6_PLACES table has a column called geo_bounds that represents the polygon on the surface of the Earth that that row is talking about. If you display it in what’s known as “WKT” format…

select st_aswkt(geo_bounds) from CMIP6_PLACES limit 1;

…then you get something that looks like this:

POLYGON((-98.125 -45.236,-98.125 -46.178,-96.875 -46.178,-96.875 -45.236,-98.125 -45.236))

That is the Well Known Text (WKT) format of a geographic polygon. The numbers inside are a series of lat/lng pairs that describe the vertices of the polygon. In this case, there are four corners of what is, nomincally, a “square”. But, of course, because that square is wrapped around the sphere of the Earth, it’s not exactly square, it’s more a vaguely-bent polygon.

The geography data type gives us a lot of native capability, such as…

Use built-in functions to calculate geographic "joins"

If you’ve got a list of, say, factories and you want to know what the predicted temperature for them will be in the year 2060, then you’ll need to use the ST_CONTAINS() function to figure out which row of data corresponds to the 100km square where your factory is, like so:

ST_CONTAINS(cmip.geo_bounds, my_factories.lat_lon_point)

That function will return true if my_factories.lat_lon_point is entirely contained within the cmip.geo_bounds polygon

Also use built-in functions to calculate distance

If you want to say something like “give me the average temperature for the area in a 1,500km radius around my factory” then you’d use the ST_DISTANCE() function like so:

ST_DISTANCE(cmip.geo_bounds, my_factories.lat_lon_point)

That function gives you the spherical distance between your factory and the closest point on the cmip.geo_bounds polygon. You can also use the ST_DWITHIN() function, which returns true if one geography is within a certain distance of the other (good for putting in join statements).

Geographic functions are expensive, so take some shortcuts

Unfortunately, geographic functions can tend to be expensive, because Snowflake can’t index them quite the same way it indexes other “normal” columns. For that reason, we’ve also included seemingly-redundant columns like “center_lat” and “hemisphere”. If you know that all the locations you want to deal with are in a particular part of the world (say, the United States, or the Southern Hemisphere) then you can give Snowflake a massive head start and allow it to prune records better if you also specify a comfortably-oversized “neighborhood” in terms of those two columns. For example, you might have a where clause like this if you know all your locations are in the Southern Hemisphere:

select cmip_v.temp_f from CMIP6_PROJECTIONS_BY_TIME_AND_PLACE cmip_v inner join CMIP_PLACES cmip_p on cmip_p.place_id = cmip_v.place_id inner join my_factories on ST_CONTAINS(cmip_p.geo_bounds, my_factories.lat_lon_point) where cmip_v.hemisphere = 's'

This allows Snowflake to prune literally half the records using hemisphere = ‘s’ before it starts scanning the rest and applying the ST_CONTAINS() function.

Ok, enough about Snowflake’s handling of geographies. Now let’s talk about some quirks and cautions in the data itself…

Why don't all the sources have the same polygons?

If you’ve played with the data yet, you’ll probably notice that each source_id positions its polygons in slightly different places. In other words, the 1x1 squares in source_id = “CESM2” don’t exactly overlap with the 1x1 squares in source_id = “MPI-ESM1-2-HR” even though they are basically the same size. This is annoying, but it is a reality resulting from the fact that each supercomputer simulation is run with slightly different parameters. Unfortunately, it’s not scientifically valid for us to simply nudge one set of polygons to be in line with the other, so you’ll either have to average across them, or you’ll have to pick one source id to work with all the time and just stick with it. We recommend “MPI-ESM1-2-HR”.

Can I take averages across polygons?

Yes and no.

You can’t take a straight average across polygons. The reason is not all polygons represent the same surface area of the Earth, so if you take a simple average across them then the smaller polygons will have more influence over that average than they have a right to. Instead, you have to take a weighted average, and the weight you have to use is the surface area of the polygon.

Luckily for you, we’ve calculated the surface area of the polygon ourselves and stored it in another column in the same row. So, for example, to take a global average of all tempereratures in February of 2020 for one source_id, you’d do this:

select sum(cell_area_m_sq * temp_f) / sum(cell_area_m_sq) from CMIP6_PROJECTIONS_BY_TIME_AND_PLACE where year = 2030 and month = 2 and source_id = 'MPI-ESM1-2-HR'

I want a smaller grid than 100km!

At the moment, the smallest resolution this data comes in is 100km squares. There are regional models being developed that have smaller resolution for only parts of the globe, and when those are freely available, we’ll load those, too.

Taking a larger model and making it smaller is something called “downscaling”. It’s hard. Graduate students write entire dissertations about how best to do it. But, we’re working on it.