Vector data sources (split from “Most widely used projections”)

General discussion of map projections.
Milo
Posts: 271
Joined: Fri Jan 22, 2021 11:11 am

Re: Most widely used projections

Post by Milo »

daan wrote: Thu Aug 31, 2023 10:38 amWhile more rigorous, that’s probably a lot less interpretable for most mapmakers.
People are never going to learn to understand something if they're not exposed to it. But it's not a difficult concept.
daan wrote: Thu Aug 31, 2023 10:38 amThe SQL tables are part of the Shapefile spec and found in the files having the .dbf extension.
Hmm, they're not opening in SQLite3. I do see the website mentioning SQLite, so should I be getting SQLite2? Isn't that 19 years old?

In any case, a quick string search of ne_10m_minor_islands.dbf shows that all of its entries are simply named "Minor island".
daan wrote: Thu Aug 31, 2023 10:38 amI don’t think that comment in the Lands link is accurate. The Shapefile spec does not even permit mixed geometry types, so if a Shapefile has some polygons in it, then all of what it holds must be polygons. I can confirm that all of Eurasia/Africa is a single polygon, as is North and South America, in the coastline data set.
Might be a difference between the Land and Coastline sets. I've seen someone commenting that they're almost the same, except that one uses polygons and one uses polylines. So long as the polylines are closed like polygons would be, I guess there's not much difference. Theoretically it might be harder to tell which side is "inside", but I could just compute it as whichever side is smaller, since no continent on Earth occupies more than 50% of the surface. (And if one did, I'd be more interested in the seas.)
daan wrote: Thu Aug 31, 2023 10:38 amFor example, there is a minor_islands_coastline data set, and a minor_islands_label_points dataset that presumably pairs with the coastline data set.
Ah, found it. I was looking at the Physical Labels set, which combines labels with polygons. Apparently there's a completely separate "label points" set under Physical Building Blocks. "Points" implies that it would take some work to link them to polygons, but so long as the points always lie inside the polygons this should be doable. And what does that set mean with "seams"?

Also I just checked "ne_10m_minor_islands_label_points.dbf" and it still just labels all of them as "Minor island", same as the polygon-based "ne_10m_minor_islands.dbf" that I already looked at.

In fact, even "ne_10m_land_ocean_label_points.*" doesn't actually include any labels? (A case-insensitive string search for "Australia" turns up no matches. Only "ne_10m_geography_regions_polys.dbf" has that.) What is the point of these files? Is it just to tell map-makers what's a good place to print a label in the "center" of the region it refers to? I don't need that.
daan
Site Admin
Posts: 977
Joined: Sat Mar 28, 2009 11:17 pm

Re: Vector data sources (split from “Most widely used projections”)

Post by daan »

I browsed some of these tables in a database viewer. The minor islands are a bust: they are not named. Whether or not a feature is named depends on the data set. ne_10m_populated_places, for example, gives names for all the rows that I looked at. That data set has a lot of columns to provide the context of the jurisdiction, along with their labels. You would do SQL queries to join the point locations for labels with the labels, as well as to select the administrative level and so forth. The minor islands repetitive label that you found is just the feature class.
Milo wrote: Might be a difference between the Land and Coastline sets. I've seen someone commenting that they're almost the same, except that one uses polygons and one uses polylines.
That matches my observation. Most (all?) of the data sets that have polygon representations also have corresponding linestring data sets.

There are a lot of coverages with a lot of overlap; you choose the subset of coverages whose structure and levels of detail meet your needs. It takes experience to figure out what that would be to make a real map in your own workflow. The data library is intended to be used to create cartographic products (as opposed to research or experimentation), so perhaps it doesn’t meet your needs.

— daan
Milo
Posts: 271
Joined: Fri Jan 22, 2021 11:11 am

Re: Vector data sources (split from “Most widely used projections”)

Post by Milo »

I suppose the minor island names aren't that essential, I could always look them up manually later when (if) I'm done with my calculations. I mostly wanted them included to confirm that certain minor islands are, in fact, represented in the dataset.

The real problem here is that the datasets come with so little documentation explaining what they actually contain. It's not even a matter of "this doesn't suit my purposes", but rather "it would take a tedious and in-depth examination to even check if this suits my purposes or not". And I'm reluctant to do the work of writing code to process the data format when I'm not sure it's even what I want, and the data I'm actually looking for might turn out to be in an entirely different format, creating a bit of a chicken-and-egg problem.
daan
Site Admin
Posts: 977
Joined: Sat Mar 28, 2009 11:17 pm

Re: Vector data sources (split from “Most widely used projections”)

Post by daan »

Sounds like you want this.

— daan
Milo
Posts: 271
Joined: Fri Jan 22, 2021 11:11 am

Re: Vector data sources (split from “Most widely used projections”)

Post by Milo »

I don't want a broken website, no.

Right now I'm looking at the GSHHG data. Some points:
  • It claims to be "high-resolution" in its title, but completely fails to mention how high, in any metric (even the stupid ones). However, its main source, WVS, claims that "WVS Plus specifications for data layers at 1:250000 scale require that 90% of shoreline features be within 500-meter circular error of their true geographic location."... although it's also unclear whether GSHHG is actually based on that one or one of the less accurate WVS versions.
  • Format is pretty simple and straightforward, easy to parse even without dedicated GIS software. However, it includes no name data whatsoever. Still, I can interpolate what most of its islands appear to be by comparing their size and coordinates against other sources.
  • Afro-Eurasia and America are split at the Suez and Panama canals, but at least the documentation warns about this and it's easy to adjust for.
  • Seems to be pretty comprehensive, with a total of 180174 listed landmasses after Afro-Eurasia and America are merged. I've found apparent matches for all four islets making up Ducie (covered separately), but not for Maher Island, which is presumably subsumed into the Antarctic ice shelf. As for Motu Nui, see below.
  • It conveniently comes with built-in area data for all its polygons. However, the values are up to 10% off from the ones in Wikipedia. For example:

    Code: Select all

                  |      GSHHG      |  Wikipedia  | Error
    Borneo        | 732586.663 km^2 | 748168 km^2 | 2.1%
    Madagascar    | 590729.739 km^2 | 587041 km^2 | 0.6%
    Baffin Island | 480654.953 km^2 | 507451 km^2 | 5.3%
    Sumatra       | 428899.509 km^2 | 443065 km^2 | 3.2%
    Presumably this is due to inaccuracies in the polygons, but it's jarring that the error is so proportionally large even for really large islands. The scant documentation does promise that the areas are calculated on the WGS-84 ellipsoid (as of version 2.2.0).

    Really tiny islands can be even worse. The island I'm guessing is Motu Nui is listed as 0.117892315 km^2, whereas Wikipedia says only 0.039 km^2. I don't think I'm mistaking it for one of the larger islands in the area, since the Easter Island mainland is a considerably larger 169.371677 / 163.6 km^2 (according to GSHHG and Wikipedia, respectively), and there aren't supposed to be any islands between that range in the area. Possibly this is actually Motu Nui, Motu Iti, and Motu Kau Kau combined as if they're one island.
Last edited by Milo on Fri Sep 01, 2023 4:27 am, edited 1 time in total.
PeteD
Posts: 251
Joined: Mon Mar 08, 2021 9:59 am

Re: Vector data sources (split from “Most widely used projections”)

Post by PeteD »

Milo wrote: Fri Sep 01, 2023 3:13 am WVS Plus specifications for data layers at 1:250000 scale require that 90% of shoreline features be within 500-meter circular error of their true geographic location.
So features on a map printed at that scale can be up to 2 mm away from where they're supposed to be? That's a lot further than I would have expected.
daan
Site Admin
Posts: 977
Joined: Sat Mar 28, 2009 11:17 pm

Re: Vector data sources (split from “Most widely used projections”)

Post by daan »

Milo wrote: Fri Sep 01, 2023 3:13 am I don't want a broken website, no.
Works for me.

— daan
Milo
Posts: 271
Joined: Fri Jan 22, 2021 11:11 am

Re: Vector data sources (split from “Most widely used projections”)

Post by Milo »

I managed to find a Global Islands / Global Shoreline Vector (apparently a description, rather than an actual name?) that boasts 30-meter resolution and a total of 340691 islands including an explicit layer for ones smaller than 0.0036 km^2. That... should be enough for my purposes.

Now what the heck is an .mpk file and how do I open it?

It appears to be a 7-Zip archive, but the files inside it are equally opaque. A lot of them have "gdb" in their name, which probably stands for Geo Database or something (and not GNU Debugger like I'm used to).

And why does it contain two separate copies of its gigabyte-large data, differing only in the tiny .mxd header file at their root?

I can probably open it with this. Doesn't explain what the .mpk or .mxd files are.

Ah, found it. (Here, too.) Well, sorta. Still doesn't clarify what the info in the .mxd file is and whether I actually need it, but I think probably not. Looks like it's largely relevant to drawing the data as an actual graphic map, which I'm not doing.

Okay. I can probably work with this. Let's see...

Yup, I'm in!

The areas this time are:

Code: Select all

              |                    |  Wikipedia  | Error
Borneo        | 723154.066521 km^2 | 748168 km^2 | 3.34%
Madagascar    | 592521.410312 km^2 | 587041 km^2 | 0.93%
Baffin Island | 507204.948981 km^2 | 507451 km^2 | 0.05%
Sumatra       | 428134.156904 km^2 | 443065 km^2 | 3.37%
Maybe it's Wikipedia that's wrong.

Maher Island and Motu Nui are definitely in this time, with area values that look right. Pandora might be... I'm finding too many islands named Pandora and it would take some more work to figure out which is the relevant one. Oh yeah, this database has name data as well.

Though confusingly, this dataset appears to give two different area values for each island. The other one is called by the database field name "Area_Geode", which is presumably short for "geodesic", and not, say, mineral geodes (there seems to be a limit of 10 characters in field names). Why doesn't this thing come with documentation?

...Okay, found something. Hidden deep inside "a00000004.gdbtable" is the following junk code that I would be surprised if there's any way to retrieve from within GDAL:

Code: Select all

<Process ToolSource="c:\arcgis\pro\Resources\ArcToolbox\toolboxes\Data Management Tools.tbx\CalculateGeometryAttributes" Date="20200323" Time="164400">CalculateGeometryAttributes FinalMerged_GlobalIslands_Clean "IslandArea_km2 AREA;IslandCoastline_km PERIMETER_LENGTH;Area_Geodesic_km2 AREA_GEODESIC;Coastline_Geodesic_km PERIMETER_LENGTH_GEODESIC" Kilometers "Square kilometers" #</Process>
Apparently "AREA_GEODESIC" is an actual data type in ArcGIS.

Here's the documentation. And... what. WHAT. Why would you even include non-geodesic measurements!?

The planar and geodesic areas are almost the same because the data is nominally in the Mollweide projection, which is, of course, equal-area (at least when you account for ellipsoidal flattening before projecting). Which makes it slightly worrying that they're not exactly the same. But the data also includes planar coastline lengths and why would you include that. Not that I need coastline lengths for my current application, but still. If I ever do, I'll need to be careful to query "Coast_Geod" and not "IslandCoas".

Okay, so using PROPER areas this time:

Code: Select all

              |                    |  Wikipedia  | Error
Borneo        | 718332.882620 km^2 | 748168 km^2 | 3.99%
Madagascar    | 589438.117224 km^2 | 587041 km^2 | 0.41%
Baffin Island | 509659.742371 km^2 | 507451 km^2 | 0.44%
Sumatra       | 425283.080335 km^2 | 443065 km^2 | 4.01%
It's not like I was expecting them to match. If they did, I'd have noticed sooner.

I seriously hope that the actual polygon coordinates are not stored in the Mollweide projection rather than something sensible.

Terrible data formats aside, I really think this is the dataset I'm looking for. Now to see if I can get the other half of my scheme to work...

EDIT: It looks like mainland Antarctica is missing from the dataset, even though Antarctic islands are present. (Or at least, Siple Island and Maher Island are included. Some other Antarctic islands seem to be missing.)

UPDATE: Okay, so it turns out that yes, the internal coordinates are in the Mollweide projection.

Why would you take a perfectly good dataset and then do this to it. Why.

Also, the dataset appears to do the "splitting at the 180th meridian" thing. The dataset recognizes six continents: Australia, Africa, South America, North America, Eurasia, and Chukchi Peninsula. (They aren't actually named in the files - despite having names for minor islands, they didn't bother to name the continents for some reason - but I identified most of them by matching their areas, and Chukchi by deciphering its coordinates.)

Now I'm looking, I can tell that Vanua Levu is likewise split, with the data explicitly having entires for "Vanua Levu east of dateline" (area 5784.13 km^2) and "Vanua Levu west of dateline" (area 281.57 km^2). Confusingly, since according to Wikipedia the whole island is only 5587.1 km^2. It isn't even using "dateline" correctly, since the international date line isn't simply the 180th meridian all the way through, and it definitely runs east of Vanua Levu.

I'm not certain what's up with other islands that are supposed to be on the 180th meridian. Taveuni only gets named in the dataset once.

ANOTHER EDIT: So it turns out this dataset thinks Delmarva is an island, instead of a peninsula. Among other issues.

Not that I care greatly about Delmarva specifically, but how do you mistake a peninsula for an island and then claim to have 30 meter accuracy? (According to Wikipedia, the narrowest point of the isthmus connecting Delmarva to the mainland is 19 kilometers.)
Post Reply