Always good to see geospatial analysis, especially with the toolchain available in Python, these days!
One of the downsides of forming polygon from line segments in this way is that it implicitly assumes all lines have a vertex at the intersection of another line.
For example, something like this would fail ("o" represents a vertex):
o o o
\ \/
\ /\
\ / \ o
\ / \ /
/ /
/ \ / \
o \ / \
\/ \
/\ o
/ \
o \
o
You can handle this in a couple of ways:
1) Intersect the lines before-hand, ensuring that you wind up with a vertex at every intersection. (In shapely's case, you'd call `shapely.ops.cascaded_union(lines)` before applying `polygonize`.)
2) Cut a bounding polygon with the lines, rather than forming polygons from individual line segments.
The second method has the advantage of ensuring that everything inside of the bounding polygon is split up into smaller, block-level polygons. Usually this is what you want, though for something like city blocks, you'd need to filter out non-urban areas (e.g. water).
Explicit intersections are typical for OSM. Crossings without an intersection (or layer information) are flagged by various QA tools (which is a euphemism for an error when you don't have a prescriptive data model).
What about bridges and tunnels? It seems like an ideal (or idealistic) solution is to clean up the source data to include intersections when there are intersections so they don't have to be inferred. Is that impractical or would it explode the size of the dataset?
Bridges and tunnels are still represented as linear features in most road datasets. In good ones, there's an attribute to mark it as a bridge or tunnel, and you can choose to use them or not.
Including intersections wouldn't significantly change the size of the dataset, but that type of clean-up is best done as a pre-analysis step. There's no good reason to do it in the "raw" data.
Furthermore, it's dependent on the projection/etc that you do it in. The intersection of two lines in lat/long space is not at the same point as the intersection of two lines in another lat/long space with a different datum, and it's not the same as the intersection in a projected coordinate system.
Ideally, you'd only include the intersection if you've actually measured it at that location. It's best to leave the observations as observations and try not to alter them too much.
At any rate, by the time you're analyzing the data, you've already made the type of assumptions about cartesian vs. spherical vs. real space that I'm referring to above. Therefore, it's usually a good idea to defer cleanup that's specific to a particular operation (this one is) until you're doing that operation.
The US Census Bureau has a TIGER database of street information. They use the "block face" as a primitive. A block face is one side of a street bounded by an intersection or the end of a street. This works even for streets that dead end. Not all block faces define the outlines of blocks. Addresses are associated with block faces.
The original post was trying to do it for Riga, Latvia, so they can't use US TIGER files.
One of the downsides of forming polygon from line segments in this way is that it implicitly assumes all lines have a vertex at the intersection of another line.
For example, something like this would fail ("o" represents a vertex):
You can handle this in a couple of ways:1) Intersect the lines before-hand, ensuring that you wind up with a vertex at every intersection. (In shapely's case, you'd call `shapely.ops.cascaded_union(lines)` before applying `polygonize`.)
2) Cut a bounding polygon with the lines, rather than forming polygons from individual line segments.
The second method has the advantage of ensuring that everything inside of the bounding polygon is split up into smaller, block-level polygons. Usually this is what you want, though for something like city blocks, you'd need to filter out non-urban areas (e.g. water).