Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid more duplicate spots #93

Open
tillwenke opened this issue Oct 11, 2024 · 11 comments
Open

Avoid more duplicate spots #93

tillwenke opened this issue Oct 11, 2024 · 11 comments

Comments

@tillwenke
Copy link
Collaborator

Suggest reviewing an existing spot if a new spot is added within a e.g. 100m of an existing one.

Maybe we need a clever data structure to quickly get all spots close to a new spot.

@bopjesvla
Copy link
Owner

Maybe we need a clever data structure to quickly get all spots close to a new spot.

classic cs student. maybe once we have 1 million spots.

we could create a dedicated review page that quickly jumps from one possible duplicate spot to another

@tillwenke
Copy link
Collaborator Author

tillwenke commented Oct 12, 2024

I think a dedicated page does not fit my requirements.
I was thinking of a user adding a spot. if they try to add the spot close to an existing spot, we intervene, asking them if they would like to review the existing close by spot instead.

@tillwenke
Copy link
Collaborator Author

Somehow some users are not aware that reviewing instead of adding a spot is an option.

@bopjesvla
Copy link
Owner

I think I already added this before, but it's tough to explain without a lot of text. This should make deduplication unnecessary though: #46

@bopjesvla
Copy link
Owner

You had already clustered the points, correct? I think we can use that clustering to merge points on the front-end. If you have a script that outputs (lat, lon, cluster_id) for every point, that should be easy

@tillwenke
Copy link
Collaborator Author

I came from reporting some duplicates (as one might see here https://hitchmap.com/dashboard.html), saw a lot of recent duplicates as well and feared that while cleaning up already new ones will spawn.

tough to solve with #46 as e.g. spots on opposite sites of a road can be quite close.

I d like to avoid text as well. How about at the end of the process:

  • (if there is a nearby spot) "We will add your review to this spot. Are you ok with it?"
  • (if there is another close by spot) if not: "Do you want to select another spot to add your review to?" -> select
  • if not: ok we ll keep your new spot

could live without 2nd option

@bopjesvla
Copy link
Owner

tough to solve with #46 as e.g. spots on opposite sites of a road can be quite close.

I encountered this too, but it can be solved. I asked ChatGPT for a solution a few days ago, this is what it came up with:

import requests
import pandas as pd
from scipy.spatial import KDTree
from shapely.geometry import Point, LineString
from shapely.ops import nearest_points

# Function to query OSM Overpass API to get the nearest road and its geometry
def get_nearest_road_geometry(lat, lon):
    # Overpass API query to find the nearest highway and get its geometry
    overpass_url = "http://overpass-api.de/api/interpreter"
    overpass_query = f"""
    [out:json];
    way(around:50,{lat},{lon})["highway"];
    (._;>;);
    out body geom;
    """
    response = requests.get(overpass_url, params={'data': overpass_query})
    data = response.json()

    # Extract road geometry (as a list of coordinates forming the polyline)
    if 'elements' in data and len(data['elements']) > 0:
        road_element = data['elements'][0]
        if 'geometry' in road_element:
            # Return the road ID and the LineString geometry of the road
            road_id = road_element['id']
            road_geom = LineString([(pt['lon'], pt['lat']) for pt in road_element['geometry']])
            return road_id, road_geom
    return None, None

# Function to check if two points are on the same side of the road
def are_points_on_same_side(point1, point2, road_geom):
    # Calculate nearest points on the road for both points
    nearest_p1 = nearest_points(point1, road_geom)[1]
    nearest_p2 = nearest_points(point2, road_geom)[1]
    
    # Determine if both points are on the same side of the road
    distance1 = point1.distance(nearest_p1)
    distance2 = point2.distance(nearest_p2)
    
    # If the signs of the distances are the same, points are on the same side
    return (distance1 * distance2) > 0

# Function to query OSM for service areas
def get_service_area(lat, lon):
    overpass_url = "http://overpass-api.de/api/interpreter"
    overpass_query = f"""
    [out:json];
    (node(around:50,{lat},{lon})["amenity"~"parking|fuel|service_area"]["highway"~"service|rest_area"];
    way(around:50,{lat},{lon})["amenity"~"parking|fuel|service_area"]["highway"~"service|rest_area"];
    relation(around:50,{lat},{lon})["amenity"~"parking|fuel|service_area"]["highway"~"service|rest_area"];
    );
    out body;
    """
    response = requests.get(overpass_url, params={'data': overpass_query})
    data = response.json()

    # Extract the service area ID (or other identifying information)
    if 'elements' in data and len(data['elements']) > 0:
        # Return the ID of the first matching service area
        return data['elements'][0]['id']
    return None

# Sample DataFrame with coordinates
df = pd.DataFrame({
    'x': [52.5200, 52.5201, 52.5202],  # latitudes
    'y': [13.4050, 13.4051, 13.4052]   # longitudes
})

# KDTree for efficient neighbor search
coords = df[['x', 'y']].values
tree = KDTree(coords)

# Define distance threshold
distance_threshold = 50  # 50 meters

# Find nearby points
neighbors = tree.query_ball_point(coords, distance_threshold)

# Initialize lists to store road IDs, geometries, and service areas
df['road_id'] = None
df['road_geom'] = None
df['service_area'] = None

# Query road segment and geometry for each point
for idx, row in df.iterrows():
    lat, lon = row['x'], row['y']
    
    # Query nearest road
    road_id, road_geom = get_nearest_road_geometry(lat, lon)
    df.at[idx, 'road_id'] = road_id
    df.at[idx, 'road_geom'] = road_geom
    
    # Query service area
    service_area_id = get_service_area(lat, lon)
    df.at[idx, 'service_area'] = service_area_id

# Check for each pair of nearby points
same_side_or_service_pairs = []
for i, nearby in enumerate(neighbors):
    for j in nearby:
        if i != j:
            road_id_i = df.loc[i, 'road_id']
            road_id_j = df.loc[j, 'road_id']
            service_area_i = df.loc[i, 'service_area']
            service_area_j = df.loc[j, 'service_area']
            
            # Check if they are on the same road and the same side
            if road_id_i == road_id_j:
                point1 = Point(df.loc[i, 'y'], df.loc[i, 'x'])  # (lon, lat)
                point2 = Point(df.loc[j, 'y'], df.loc[j, 'x'])  # (lon, lat)
                road_geom = df.loc[i, 'road_geom']
                
                if road_geom and are_points_on_same_side(point1, point2, road_geom):
                    same_side_or_service_pairs.append((i, j))

            # Check if both points are in the same service area
            elif service_area_i and service_area_j and service_area_i == service_area_j:
                same_side_or_service_pairs.append((i, j))

print("Pairs of nearby points on the same road side or service area:", same_side_or_service_pairs)

Dunno if it works, but something like it probably will work. Even if we make the occasional mistake, as long as clicking the spot shows the data where it was reported, all's good.

@bopjesvla
Copy link
Owner

bopjesvla commented Oct 16, 2024

Note: to check if two points are on the same side I'd probably draw a line between the points and see if it intersects with the road, don't know if signed distances really exist

@bopjesvla
Copy link
Owner

Yeah signed distances definitely appear to be a hallucination, other than that I think it's very close

@tillwenke
Copy link
Collaborator Author

I did similar things around here https://github.com/Hitchwiki/hitchmap-data/tree/main/cleaning

In addition we should come up with an idea to educate users to no further pollute the map.

@bopjesvla
Copy link
Owner

It's not polluting if we can handle it :)

I'm all for people logging exactly where they stood as long as it doesn't mess up the map

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants