Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Output Columns #7

Open
amoslai5128 opened this issue Nov 14, 2024 · 3 comments
Open

Custom Output Columns #7

amoslai5128 opened this issue Nov 14, 2024 · 3 comments
Labels
question Further information is requested

Comments

@amoslai5128
Copy link

amoslai5128 commented Nov 14, 2024

Thank you for giving us this amazing map matching tool, it works so cool.

However, I'm figuring out how to custom the headers & columns in a output result csv, since the current output headers are like:
id,aborted,duration,track,prepared,match,track_length,prepared_length,match_length

And, in some cases I need to match the geo points from GPS records with their timestamp in the result, therefore, here is what I want to output:
id, aborted, duration, ..., *point_timestamp*, *speed*
* custom column header

For your reference, here's the input dataset:

License, Data Timestamp, Location Timestamp, Lat, Long, Speed Direction, Mileage, ACC Status, Location
Car0001, 2024-03-17 05:44:39, 2024-03-17 05:44:25, 22.34051,114.192995, 0.0, N,29719.2, Y, ABC Street
Car0001, 2024-03-17 05:44:45, 2024-03-17 05:44:25, 22.34051,114.192995, 0.0, N,29719.2, Y, ABC Street
Car3000, 2024-03-17 05:44:39, 2024-03-17 05:44:28, 22.34051,114.192995, 0.0, N,29719.2, Y, EFG Street
Car3000, 2024-03-17 05:44:45, 2024-03-17 05:44:28, 22.34051,114.192995, 23.0, NW12°,29719.3, Y, EFG Street
@addy90
Copy link
Collaborator

addy90 commented Nov 14, 2024

Hi @amoslai5128,

thank you for your nice words about my work and your question!

So, I want to understand your issue first. You are trying to match the points 1-by-1 so that for every original point you get the final matched point in the output, without loosing the original columns, right? Technically, this is not possible, because a match, which is a concatenation of road network edges, can contain more or less points than the original track, for example, because curves contain more points and straight lines contain less points than the original track. So how would you output exactly one output point for each input point in such a situation?

Here is an example of what I mean:
match
In the left panel is the original track with all points in red. In the middle panel is the track after the trajectory simplification algorithm has been applied in orange. In the right panel is the final match in green.

As you can see, in the curve section in the middle of the track, the match contains much more points than the original track. So for one input point, multiple output points would exist. Furthermore, in the left part of the track on the straight line, the original track contained in one part more points than the match, but the prepared track contained less points.

Let me elaborate on this: Currently, the output (the match) is given as a linestring, and because a 1-to-1 mapping is not feasible, the time and speed information is omitted. It would not make much sense any more in the match, as the points are completely different to the input points, the correlation would not be correct any more.
Furthermore, the trajectory simplification algorithm (which is on by default) already omits the time and speed information, since it is purely geometric. For details, please read section 2.1 in my article: https://doi.org/10.1111/tgis.13107.
The matching phase currently also only uses geometric metrics, it also does not use any time information.
The advantage of this approach, to omit any time information, is that any geometric (even hand-drawn) line can be matched with this approach. The disadvantage of omitting the time information is, of course, that it could help to improve the results by incorporating speed metrics. This was a design choice for this approach. We went for more general applicability.

Now for the first good part: The trajectory simplification can be completely disabled by applying --simplify-track off --median-merge off (maybe also --filter-duplicates off if you have duplicate points). However this slows down the matching performance (both in terms of speed and quality). For details, please refer to section 3.3 in my article: https://doi.org/10.1111/tgis.13107.

With the trajectory simplification disabled (or by implementing a trajectory simplification algorithm that does not omit the time information), it would technically be possible to retain the time information throughout the process. However, the first issue with the example image above cannot be solved this way. Outputting the result as a list of points would still not correlate 1-to-1 with the input points and it would be totally questionable where to put which timestamp and speed information. It would be possible to interpolate the values, but this would also be an additional assumption (the match itself already is an assumption, we cannot know if it is correct to the ground truth).

Nevertheless, there exists information about to which point of the match each input point was mapped to. This result can be obtained by applying --candidates candidates.csv. It gives a file that contains for each input point the candidates. When the column is_result is true, the point is the one from the match. With this file, you can see for each input point each output point 1-to-1. But the complete match linestring still contains more or less points as the input track.

Here is an example of the is_result = 1 column:
candidate
In blue there is for each input point of the prepared original track the segment that points to the output point in the match. As you can see, the left input point points exactly on an explicit output point of an edge. The right input point, however, points directly on an edge but without an explicit point on the edge. There was a temporary candidate point that was removed in the output with the --join-merges option. This is an additional example why a 1-to-1 point-to-point mapping is not really practical.

When you disable the trajectory simplification completely, you would in fact get for each input point of the original track exactly one output point of the match in the candidates.csv file, although with worse matching quality, as already explained. Moreover, as we have seen, these points are not necessarily contained in the match. But you get for each point a "mapping" to where that point was put in the match. However, keep in mind that the match (in green) generally has a different amount of points as the original track.

With that result, you could theoretically join the results from the candidates.csv file with your input file and append the original timestamp and speed column. Since no points were removed during the matching, you can simply concat the tables (when the input table is sorted by id and timestamp (because the timestamp is used as sorting criteria when importing points for matching) and the candidates table is sorted by id, part_index, set_index, candidate_index).

The result of this can look like this:
mapping
As you can see, the trajectory simplification was turned off and this way you get a mapping from input to output. Although some of the points are not contained in the match linestring explicitly. The red points are the original input points, the green points are the "pointing_coordinate" points from the code below.

The code that was used to combine the data is here:

# !/usr/bin/env python3
# -*- coding: utf-8 -*-

import numpy as np
import pandas as pd
import geopandas as gpd
import shapely as shp

input_file = "points.csv"
output_file = "merged.csv"
candidates_file = "candidates.csv"

input = pd.read_csv(input_file, parse_dates=["time"])
candidates = pd.read_csv(candidates_file)
results = candidates[candidates["is_result"] == 1].copy()

input.sort_values(by=["id", "time"], inplace=True)
results.sort_values(by=["id", "part_index", "set_index", "candidate_index"], inplace=True)

results.reset_index(drop=True, inplace=True)

assert len(input) == len(results)

projections = gpd.GeoDataFrame(results.drop(columns=["projection"]),
                               geometry=gpd.GeoSeries.from_wkt(results["projection"]))
projections["pointing_coordinate"] = projections['geometry'].apply(
    lambda x: shp.geometry.Point(x.coords[1]) if (len(x.coords) > 1) else np.nan)

input["pointing_coordinate"] = projections["pointing_coordinate"]

input.to_csv(output_file, index=False)

You get as output a file where each input point has the output point from the match.
You need the --simplify-track off --median-merge off --filter-duplicates off --candidates candidates.csv settings applied.
You might also need to change the parse_dates and sort columns in the code above to the columns you use, e.g., License and Location Timestamp.

With trajectory simplification enabled, this becomes quite impossible, because the median-merge algorithm may introduce new points that were not in the original input track, but this only happens in small noisy situations. However, a 1-to-1 mapping then is practically impossible. You can try to re-map the prepared track points by distance to the original points, but this might not work well when the track goes through the same position multiple times. You can of course also omit these points and make a join with only the points that were also present in the original tracks (the x.coords[0] contains the starting point from the blue segments, which should lie close to the original point), but this needs additional care because of floating point inaccuracies between the original input point and the output point in the candidates.csv.

Please use the new version v1.0.5 when you want to try out with disabled trajectory simplification. I noticed a bug while preparing the above example and fixed it in advance of this reply.

So TL;DR: We cannot output the original timestamps and speed (or any other columns) from the original track input in the match output because the input and output are too different in amount and position of points. There are, however, ways with certain sacrifices in accuracy and speed, that allow to combine the input and output, as I have shown above. As this is a very specific and individual use-case, I don't really know currently how this could be implemented directly in this software in a way that it works generally due to all the explanations that I have given. But there are workarounds with individual code, as you see. The outputs of course can be later combined in individual tools. I hope you see it the same way as I do. If not, please let me know!

@addy90 addy90 added the question Further information is requested label Nov 14, 2024
@amoslai5128
Copy link
Author

amoslai5128 commented Nov 15, 2024

I appreciate your detailed and patient repsonse and as a newcomer to this area, it takes a bit more time for me to study the concepts. Meanwhile, here's a picture to better describing to my first question.

I was trying to find out the data from the orginal geo-point (red dot) to matched geo-point (green dot).
Section 1 (1)

With the trajectory simplification disabled (or by implementing a trajectory simplification algorithm that does not omit the time information), it would technically be possible to retain the time information throughout the process. However, the first issue with the example image above cannot be solved this way. Outputting the result as a list of points would still not correlate 1-to-1 with the input points and it would be totally questionable where to put which timestamp and speed information. It would be possible to interpolate the values, but this would also be an additional assumption (the match itself already is an assumption, we cannot know if it is correct to the ground truth).

Section 2
Q1:
The reason why I want to extract the speed and timestamp information is because I want to countermeasure the speed by its time & distance for a car drove across a road. I'm thinking is there a way to let a green point carries out the certain information from a nearest red point (if any, <= 0~10M). Spatial left join or something better?

Q2:
Also, there's a road network file inputted at the beginning, then how can I find out the road information (like route ID) in the result?

Thank you!

@addy90
Copy link
Collaborator

addy90 commented Nov 15, 2024

Hi, thank you for precising your questions!

I hope that I understand them better now and will try to response to them. First some more comments to your statements:

Concerning your first picture, you get the information from the original geo point to the matched geo point from the --candidates <filename> setting, which I already explained in my previous post. You can add --help to the application for more information or consult the help.txt file.
The candidates file contains in the projection column a segment (as WKT LineString) that points from the original geoposition to the matched geo position. These are the blue arrows from my previous example pictures.

Concerning "dropped original points", there are two ways currently how an original point is "dropped". The first way is via the multiple trajectory simplification algorithms (filter-duplicates, simplify-track, and median-merge). This actually drops points and the median-merge may introduce new ones, as explained in my article in section 2.1: https://doi.org/10.1111/tgis.13107.
The second way is the candidate adoption feature, please refer to section 2.3 in my article: https://doi.org/10.1111/tgis.13107. This does not "drop" the points from the track, but collapses multiple points into the same match point, therefore virtually dropping unnecessary points.

See for example this old image from this repository (shows a track from the Kubicka et. al dataset:
example-2
You can see that, the red original track was simplified (not shown directly in the image) and from the simplified track the orange projections from original point to match point are shown (they were blue in my previous post, here they are orange). Moreover you can see that some points collapse into the same match points, this is the candidate adoption feature's doing.

So "dropping original points" is done in two ways, one explicit way in the trajectory simplification and one stochastic implicit way in the actual matching optimization algorithm backed by the candidate adoption feature.

You can turn off every of these features individually, not only trajectory simplification, but also candidate adoption! Then no "dopping" occurs at all. However, deactivating both features reduces the matching quality (and depending on the options also the speed) drastically , as can also be seen in my article in the benchmark results in section 3.3: https://doi.org/10.1111/tgis.13107.

So the --candidates <filename> export gives other projections from the original point to the match point (for the rows where is_result == 1) depending on your settings. Note to myself: Maybe an additional option to export only the result candidates would be good, the candidates file contains all candidates, not only the ones used in the final match.

One additional note: The points that you named "generated" in this software are in fact referenced from the road network edges. The "improved" points are computed points on these edges (might also reuse an existing point if they are spatially very close), they are contained in the projection column in the candidates.csv file, but they are often removed in the final match (the --join-merges option), because each "generated" point is slightly off due to floating point and algorithmic computation inaccuracies. Removing these very slightly off "join" points between routes between two track points improves the result quality. For making sure that the match is most accurate towards the underlying road network, all points that can be reused are reused in an exact copy. However, the first and last point usually are such generated points, as tracks often start and end in the middle of a road where no point in the edge exists up to that moment.

Concerning your questions:

Q1: I think the only valid way would be the one that I explained in my previous post. The python code gives out as result the input file but with an added column of the green match points (pointing_coordinates). This should fulfil your needs as you have both the original red points, the new green match points, and the time and speed information all aligned and correlated in one file. It does work correctly only when you disable all trajectory simplification, as explained. To further elaborate: As soon as you use the trajectory simplification algorithms, the time and speed information is dropped in a way that it cannot be restored correctly anymore, as explaind in my previous post. I recommend that you follow my example from my previous post and combine the file that you can generate with the --candidates <filename> option (i.e., candidates.csv) with your original data with a script like the one that I posted. Anything you do with "nearest spatial join" would not correctly join, see the picture from above, the projections here are not to the nearest road position but sometimes point to a position further away. This map matching software uses a stochastic process to determine the best position (concerning the defined metrics) from a range of positions (called candidates). Please refer to section 1.2 from my article for the basic understanding of what I mean and also to section 2.2 for further understanding of what I mean: https://doi.org/10.1111/tgis.13107. So a "nearest spatial join" will probably not work from the matched geo position to the original position. However, what I already said in my previous post: You can try to use the projection column from the candidates.csv file; the first coordinate is spatially close to the original position (the difference are mere floating point inaccuracies, because the track is imported as text from a csv file and converted into a double binary in that process) for doing a spatial join with the original data, even with simplified trajectories. A simple nearest spatial join still might not work correctly in all situations, because when you have duplicate points with the same spatial coordinate in the original data, but different timestamps, you don't know with which of these points to join with. And the median-merge algorithm may introduce new points (but within a defined range that you can use as limit in your spatial join, that you can configure yourself, see --median-merge-distance-tolerance (and also disable --adaptive-median-merge to be safe, as this can raise the distance tolerance in certain situations). But with a more sophisticated algorithm, or a heuristic approach (e.g., choose the first point of the duplicates) you might find a solution that is well enough. To be honest, these thoughts give me some ideas how a general solution could look like that I could implement in this software. However, since the use-case is very specific, I currently cannot work on this particularly.

Q2: You can use the edge_ids column that is not on by default with the --columns <arg> setting. Please also prepare your network with the --simplify osm-aware setting in this case. The second setting makes sure that the ids of the road network edges remain intact during import. The default setting merges edges without retaining the correct ids, because the resulting network is smaller and faster to match on, but when you need the original edge ids, this does not work of course. With the osm-aware setting, the edge ids and also tags are retained, which slows down the matching speed slightly, but not the quality. With the extra colum edge_ids you get then exported in the matches.csv result file the list of ids of the original edges from the road network that you can then further use. The concatenated geometries then are close to the exported match, although in the beginning and ending, there might be differences, when a match starts and ends within an edge. The whole edge (by id) contains more information than the actual match in such cases. In case you want your matches to fit the edges completely, use the --export-edges <arg> option so that the list of edge_ids and the match linestring are of equal geometry information. However, this may lengthen the matches in cases where a track started or ended somewhere in the middle of a road, as then the complete road is returned instead of only the part of the road that was actually driven. Furthermore, there is no "route id" as such, as a route consists of the list of edges (and / or partial edges when --export-edges is off, which is the default). With the list of the edge ids, you can, however, join the original route information to the result back. The edges information can also be obtained via the --export-edges-csv-file <arg> option that you can specify during preparation, then you have a csv file that contains the edge ids and tags that you can merge with the edge ids from the result.

I hope that I could help you further!

Your concepts are very interesting and I see all of your ideas working in practice, although in some cases sacrifices in speed and / or quality are needed. The general use-case of this software is to find a most accurate match for a given track, so use-cases like yours, which are very interesting, however, need more deep diving and some extra work. However, I had ongoing research in mind, which is the reason why all these special options (and more) exist, that I explained here.

Please let me know if I can be of further help. I hope you understand that passing the time / speed information within this software directly to the result is currently out of scope of what I can do, because of the challenges with the trajectory simplification algorithms, that were designed with only spatial but currently not temporal map matching in mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants