Skip to content

3 group surveys to fix floating point issue #4

Merged
merged 7 commits into from
Oct 27, 2025

Conversation

adriahso
Copy link
Member

@adriahso adriahso commented Oct 25, 2025

Issue number

Closes #3

Description

Updated the lasConverter to group and crop the survey to level 2 H3 cells. This is done in order to prevent the floating point precision issue which occurs when dealing with large areas.

Testing steps

  • Try converting all surveys following the instructions in the README. The metadata.csv and level1_h3_cells.csv files are uploaded to our Google Drive.
  • Also try to send the resulting las files through the PotreeConverter and run It in MolloyExplorer to check that the conversion is correct and that the extra attributes are kept.

Summary by CodeRabbit

  • New Features

    • Processes surveys into per-cell outputs and organizes final files into ./h3_cells for easier per-cell analysis and delivery.
    • Improved processing workflow with more robust handling of coordinate reference systems and per-survey validation.
  • Documentation

    • Reworked setup into a "Prepare Files and Folders" section and simplified run instructions; script runs with a consolidated flow and saves outputs to ./h3_cells.
  • Chores

    • Updated .gitignore to track several project directories (surveys, converted_surveys, temp_cells, metadata, h3_cells).
    • Updated dependencies to include h3, pdal, and shapely.

@adriahso adriahso self-assigned this Oct 25, 2025
@adriahso adriahso linked an issue Oct 25, 2025 that may be closed by this pull request
@coderabbit
Copy link

coderabbit bot commented Oct 25, 2025

Walkthrough

Refactors LAS conversion to group surveys by H3 cells, adds public functions (prepare_cells, CSV_2_LAS, group_by_h3_cell), enhances CRS and H3 handling with per-survey reprojection and PDAL cropping, updates README and requirements, and converts .gitignore to directory-level entries.

Changes

Cohort / File(s) Summary
Repository Ignore Rules
.gitignore
Replaced path-specific ignores (e.g., /surveys/*, /output_las/*) with directory-name entries (surveys, converted_surveys, temp_cells, metadata, h3_cells); retained .DS_Store, .venv; removed an explicit comment line.
Dependencies
requirements.txt
Added h3, pdal, and shapely; removed pathlib; net effect for pyproj unchanged.
Documentation
README.MD
Reworked setup and usage: added "Prepare Files and Folders" section, replaced prior activation instructions with explicit venv creation and install steps, simplified invocation to python lasConverter.py, and documented processed LAS outputs saved to ./h3_cells.
Conversion Pipeline
lasConverter.py
Introduced API functions prepare_cells(metadata: pd.DataFrame, level1_cells: List[str]) -> Tuple[DefaultDict[str, List[str]], Dict[str, Polygon]], CSV_2_LAS(metadata: pd.DataFrame, chunk_size_bytes: str="64MB") -> None, and group_by_h3_cell(cell_to_surveys, cell_polygons). Added WKT loading and reprojection to EPSG:4978, level-2 H3 generation and cell polygons, per-survey CRS handling with fallback to EPSG:25832, out-of-core CSV→LAS chunking and merging, PDAL-based per-cell cropping, validation/logging, temp-file cleanup, and a new __main__ orchestration.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as Main (__main__)
    participant Prep as prepare_cells()
    participant CSV as CSV_2_LAS()
    participant Group as group_by_h3_cell()

    CLI->>Prep: load metadata + level1_h3_cells
    note right of Prep #d9f0ff: generate level‑2 H3 cells\ncompute cell_polygons\nmap surveys → cells
    Prep-->>CLI: cell_to_surveys, cell_polygons

    CLI->>CSV: pass metadata
    note right of CSV #f7fbf0: per-survey CSV → LAS\nper-row CRS handling + fallback\nchunked writes → merge
    CSV-->>CLI: per-survey merged LAS files

    CLI->>Group: pass cell_to_surveys, cell_polygons
    rect rgb(255,249,230)
      note right of Group #fff9e6: iterate level‑2 H3 cells\ncrop merged survey LAS with PDAL\nwrite per-cell LAS outputs (`./h3_cells`)
    end
    Group-->>CLI: per-cell LAS outputs / logs
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Areas requiring extra attention:
    • lasConverter.py: CRS reprojection (WKT → EPSG:4978), per-row transformations, and fallback logic.
    • H3 generation: level-1 → level-2 cell derivation and polygon reprojection accuracy.
    • PDAL integration and per-cell crop pipelines, including error handling and temp-file lifecycle.
    • README vs implementation: ensure documented invocation and output directories match code behavior.
    • Requirements: compatibility and version constraints for pdal, h3, and shapely.

Poem

🐰 Hopping through hex cells, tidy and bright,
Surveys sorted by moon and light.
Chunks reprojected, cropped with care,
Little LAS files nestle there.
A rabbit cheers — precise and fair!

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "3 group surveys to fix floating point issue" directly corresponds to the main objective of the pull request, which is to address issue #3 by grouping surveys into H3 cells to prevent floating point precision problems. The changes across lasConverter.py (introducing prepare_cells, updated CSV_2_LAS, and group_by_h3_cell functions), README.md (updated workflow documentation), and .gitignore (new directories for h3_cells and metadata) all support this core objective. While the phrasing uses "3" (likely referencing issue #3) and could be more concise, the title is clear enough that a teammate reviewing the history would understand the primary change is about grouping surveys to fix floating point issues.
Linked Issues Check ✅ Passed The code changes comprehensively address all objectives from issue #3. The implementation introduces prepare_cells to map surveys to level-2 H3 cells, updates CSV_2_LAS to handle per-survey processing with per-survey CRS handling, and adds group_by_h3_cell to crop LAS files to specific H3 cells using PDAL. These changes directly fulfill the requirements to group surveys into smaller chunks via H3 cell boundaries, which prevents floating point precision errors when working with large areas. The supporting documentation updates in README.md reflect the new workflow, and requirements.txt additions (h3, pdal, shapely) provide the necessary dependencies for this functionality.
Out of Scope Changes Check ✅ Passed All changes in the pull request are directly related to implementing the H3 cell grouping feature described in issue #3. The .gitignore modifications add entries for the new output directories (h3_cells, metadata) created by the feature, README.md updates document the modified workflow, requirements.txt additions provide the necessary dependencies (h3, pdal, shapely), and lasConverter.py contains the core implementation. No changes appear to introduce unrelated functionality or address separate concerns outside the scope of grouping surveys by H3 cells to fix floating point precision issues.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch 3-group-surveys-to-fix-floating-point-issue

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbit coderabbit bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (12)
requirements.txt (1)

5-8: PDAL installability and version pinning

pdal often requires system/conda packages; plain pip install pdal may fail on CI or Windows. Also none of the new deps are pinned, which can introduce breaking changes.

  • Document/automate installation (e.g., conda-forge: pdal, python-pdal, gdal, proj, geos).
  • Consider pinning: pyproj, shapely, h3, pdal to versions you’ve validated locally/CI.
README.MD (2)

61-68: Call out PDAL/GDAL/PROJ install path

Mention that PDAL (and GDAL/PROJ) may require conda/OS packages; provide a conda-forge snippet. This prevents install blockers.


3-9: Clarify CRS assumptions

You state conversion “using EPSG:4978.” Please add:

  • What CRS metadata.geom is in (currently assumed 4258 in code).
  • What CRS the input CSV coordinates are in (metadata.epsg).
  • That cropping is done in a consistent CRS (see code fixes below).
lasConverter.py (8)

226-229: Don’t rely on symlinks/globs; pass explicit file list

Symlinks fail on Windows without admin privileges and add I/O churn. PDAL accepts filenames (array), so build input_files directly and drop the temp dir and glob.

Apply:

-        # PDAL filename glob pattern
-        input_pattern = str(temp_folder / "*.las")
+        # Use explicit filenames list for PDAL
+        input_pattern = None  # not used; we pass `filenames` above

Also remove the temp folder creation and symlink block above.


255-259: Catch a narrower exception and prefer non-streaming execution

Use p.execute() unless you’ve validated the pipeline is streamable; catch RuntimeError (or PDAL’s specific exception) instead of a blind Exception.

Apply:

-        try:
-            p.execute_streaming(chunk_size=500000)
+        try:
+            p.execute()
             print(f"✅ Written H3 cell {c2}.")
-        except Exception as e:
+        except RuntimeError as e:
             print(f"⚠️ Pipeline failed for cell {c2}: {e}")
-        finally:
-            shutil.rmtree(temp_folder)
+        # temp folder removal no longer needed if symlinks are removed

82-99: Document/verify EPSG fallback and dtype

Defaulting missing epsg to 25832 may be wrong for surveys outside UTM32. Consider:

  • Using survey centroid to pick ETRS89/UTM zone, or
  • Defaulting to 4326 and explicitly reprojecting via PDAL later.

142-161: Consider setting a consistent CRS and scale/offset in LAS headers

Optionally set header.add_crs("EPSG:4978") and consistent scales (e.g., 0.001) across all chunks per survey. If you customize offsets, ensure they are identical across chunks or rescale during merge.


204-218: Use constants for paths instead of hardcoding

Replace Path("converted_surveys") with CONVERTED_PATH for consistency.

Apply:

-            src = Path("converted_surveys") / f"{survey_name}.las"
+            src = CONVERTED_PATH / f"{survey_name}.las"

22-33: Type-hint consistency

prepare_cells returns DefaultDict[str, List[str]] (FIDs appended as strings), while group_by_h3_cell expects List[int]. Align types (prefer List[str]) to avoid runtime .item() issues.


52-67: Performance: O(cells×surveys) intersection check

This nested loop will be slow for large extents. Consider using a STRtree over survey geoms and querying with each cell polygon’s envelope before precise intersects.


47-47: Nit: typo

“Transfrom” → “Transform”.

.gitignore (1)

1-5: Confirm intentional exclusion of input/metadata

Ignoring metadata and surveys prevents tracking sample inputs. If examples are needed, add a metadata.sample/ pattern and reference it in README.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bedebd3 and 4d72a31.

📒 Files selected for processing (4)
  • .gitignore (1 hunks)
  • README.MD (1 hunks)
  • lasConverter.py (1 hunks)
  • requirements.txt (1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
README.MD

16-16: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.1)
lasConverter.py

257-257: Do not catch blind exception: Exception

(BLE001)

Copy link

@coderabbit coderabbit bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
lasConverter.py (2)

31-33: Unify CRS to 4326 for overlap tests; stop transforming H3/survey polygons to 4978

Cropping/overlap in ECEF (4978) with 2D polygons is geometrically wrong and mixes datums. Keep both H3 polygons and survey geoms in 4326 for overlap; only point clouds should be 4978. This also simplifies PDAL cropping later.

Apply:

-        cell_polygons: Mapping of H3 cell IDs to their shapely Polygons (EPSG:4978).
+        cell_polygons: Mapping of H3 cell IDs to their shapely Polygons (EPSG:4326).
@@
-    # Transformer for h3 cell polygons
-    transformer_4326_to_4978 = Transformer.from_crs("EPSG:4326", "EPSG:4978", always_xy=True)
+    # H3 polygons remain in EPSG:4326 for overlap/cropping.
@@
-            poly = Polygon([(lng, lat) for lat, lng in coords])
-            cell_polygons[c2] = transform(transformer_4326_to_4978.transform, poly)
+            poly = Polygon([(lng, lat) for lat, lng in coords])  # EPSG:4326
+            cell_polygons[c2] = poly
@@
-    # Transform survey polygons from EPSG:4258 to EPSG:4978
+    # Transform survey polygons from EPSG:4258 to EPSG:4326
     metadata["geom"] = metadata["geom"].apply(wkt.loads)
-    transform_4258_to_4978 = Transformer.from_crs("EPSG:4258", "EPSG:4978", always_xy=True)
-    metadata["geom_4978"] = metadata["geom"].apply(lambda g: transform(transform_4258_to_4978.transform, g))
+    transform_4258_to_4326 = Transformer.from_crs("EPSG:4258", "EPSG:4326", always_xy=True)
+    metadata["geom_4326"] = metadata["geom"].apply(lambda g: transform(transform_4258_to_4326.transform, g))
@@
-            geom = survey["geom_4978"]
+            geom = survey["geom_4326"]

Also applies to: 35-50, 56-66


234-251: Crop in 4326; add reprojection stages; use filenames; drop extra_dims in PDAL

Perform crop in 4326, reproject back to 4978. Use explicit filenames to avoid glob quirks and remove unnecessary extra_dims.

-        # PDAL JSON pipeline
+        # PDAL JSON pipeline (crop in EPSG:4326, then reproject back)
         pdal_pipeline = {
             "pipeline": [
-                {
-                    "type": "readers.las",
-                    "filename": input_pattern,
-                    "extra_dims": "Accepted=uint8,TVU=float32,THU=float32"
-                },
-                {
-                    "type": "filters.crop",
-                    "polygon": cell_poly.wkt
-                },
-                {
-                    "type": "writers.las",
-                    "filename": str(H3_CELLS_PATH / f"{c2}.las"),
-                    "extra_dims": "Accepted=uint8,TVU=float32,THU=float32"
-                }
+                {"type": "readers.las", "filenames": input_files},
+                {"type": "filters.reprojection", "in_srs": "EPSG:4978", "out_srs": "EPSG:4326"},
+                {"type": "filters.crop", "polygon": cell_poly.wkt},  # cell_poly must be EPSG:4326
+                {"type": "filters.reprojection", "in_srs": "EPSG:4326", "out_srs": "EPSG:4978"},
+                {"type": "writers.las", "filename": str(H3_CELLS_PATH / f"{c2}.las")}
             ]
         }
🧹 Nitpick comments (6)
lasConverter.py (5)

192-195: Docstring: reflect 4326 polygons when cropping in 4326

Update to avoid confusion with the corrected CRS flow.

-        cell_polygons: Mapping of H3 cell IDs to their shapely Polygons (EPSG:4978).
+        cell_polygons: Mapping of H3 cell IDs to their shapely Polygons (EPSG:4326).

229-231: Remove unused glob after switching to filenames

input_pattern becomes dead code when using filenames. Clean it up.

-        # PDAL filename glob pattern
-        input_pattern = str(temp_folder / "*.las")
+        # Using explicit filenames list for PDAL; no glob needed

215-221: Symlink fallback for Windows/non‑privileged environments

Symlink creation often fails on Windows without Developer Mode/admin. Fall back to copy.

-            dst.symlink_to(src.resolve())
+            try:
+                dst.symlink_to(src.resolve())
+            except OSError:
+                # Fallback on platforms without symlink privileges
+                shutil.copy2(src, dst)

143-151: Embed CRS in LAS output (optional but helpful for downstream tools)

Add EPSG:4978 to LAS headers so consumers don’t rely on external config.

-                header = laspy.LasHeader(point_format=6, version="1.4")
+                header = laspy.LasHeader(point_format=6, version="1.4")
+                # Embed CRS in header for clarity (LAS 1.4)
+                # Requires: from laspy.crs import CRS (import at top)
+                header.add_crs(CRS.from_epsg(4978))

Add import:

 import laspy
+from laspy.crs import CRS

140-141: Verify point_source_id fits 16‑bit; optionally preserve full FID

LAS point_source_id is uint16. If your survey index/FIDs can exceed 65535, remap or store the full value in an extra dimension (e.g., SourceFID).

Optional addition:

-                # Set source_id
-                ids = np.full(len(df), source_id, dtype=np.uint16)
+                # Set source_id (LAS spec: uint16); ensure source_id <= 65535
+                ids = np.full(len(df), source_id % 65536, dtype=np.uint16)
@@
-                header.add_extra_dim(laspy.ExtraBytesParams(name="Accepted", type=np.uint8))
+                header.add_extra_dim(laspy.ExtraBytesParams(name="Accepted", type=np.uint8))
+                header.add_extra_dim(laspy.ExtraBytesParams(name="SourceFID", type=np.uint32))
@@
-                las["Accepted"] = accepted
+                las["Accepted"] = accepted
+                las["SourceFID"] = np.full(len(df), int(row["fid"]), dtype=np.uint32)

Also applies to: 145-159

README.MD (1)

3-9: Clarify CRS flow: crop in 4326, reproject to 4978 for LAS

Align docs with implementation fix: overlap/cropping in EPSG:4326; CSV→LAS in 4978.

-`lasConverter.py` converts survey CSV files into LAS files using EPSG:4978 and groups/crops them by H3 cells.  
+`lasConverter.py` converts survey CSV files into LAS files (EPSG:4978). H3 overlap and cropping are performed in EPSG:4326, then results are reprojected back to 4978.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4d72a31 and fc1d707.

📒 Files selected for processing (2)
  • README.MD (1 hunks)
  • lasConverter.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.1)
lasConverter.py

260-260: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (2)
lasConverter.py (1)

167-171: Nice defensive guard on empty chunk set

Prevents IndexError during merge when a survey yields no chunks. LGTM.

README.MD (1)

16-28: Good: fenced code block language set

The code block now specifies text, satisfying MD040.

Copy link

@coderabbit coderabbit bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
lasConverter.py (1)

47-50: Hardcoded EPSG:4258 assumption for survey geometries is risky.

The code assumes all survey geom WKT strings are in EPSG:4258 without validation. If geometries are in a different CRS, spatial intersections will silently produce incorrect results.

Consider adding a geom_epsg column to metadata or at minimum document this assumption prominently:

+    # ASSUMES: All geometries in metadata["geom"] are in EPSG:4258
+    # TODO: Make source CRS configurable via metadata["geom_epsg"]
     metadata["geom"] = metadata["geom"].apply(wkt.loads)
     transform_4258_to_4978 = Transformer.from_crs("EPSG:4258", "EPSG:4978", always_xy=True)
     metadata["geom_4978"] = metadata["geom"].apply(lambda g: transform(transform_4258_to_4978.transform, g))
🧹 Nitpick comments (2)
lasConverter.py (2)

54-65: Nested loop may be slow for large datasets.

The O(cells × surveys) iteration checks every survey against every cell. For hundreds of cells and surveys, this could become slow.

Consider using an R-tree spatial index (via shapely.strtree.STRtree) to speed up intersection queries:

from shapely.strtree import STRtree

# Build spatial index for cells
cell_ids = list(cell_polygons.keys())
cell_geoms = list(cell_polygons.values())
tree = STRtree(cell_geoms)

# Query for each survey
for _, survey in metadata.iterrows():
    geom = survey["geom_4978"]
    if geom.is_empty:
        continue
    
    # Query index for potentially intersecting cells
    for idx in tree.query(geom):
        c2 = cell_ids[idx]
        cell_poly = cell_geoms[idx]
        if cell_poly.intersects(geom) or (isinstance(geom, MultiPolygon) and any(cell_poly.intersects(sg) for sg in geom.geoms)):
            cell_to_surveys[c2].append(survey["fid"])

83-83: Document the default EPSG:25832 choice.

EPSG:25832 (ETRS89 / UTM zone 32N) is silently used as the default when epsg is missing. This should be documented for clarity.

Add a comment explaining the choice:

-    # Set EPSG:25832 as default when crs is missing
+    # Default to EPSG:25832 (ETRS89 / UTM zone 32N, common for Norwegian surveys)
     metadata["epsg"] = metadata["epsg"].fillna(25832).astype(int)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc1d707 and 4393a81.

📒 Files selected for processing (1)
  • lasConverter.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-26T12:12:49.402Z
Learnt from: adriahso
PR: TDT4290-group-4/lasConverter#4
File: lasConverter.py:238-250
Timestamp: 2025-10-26T12:12:49.402Z
Learning: In the lasConverter project, PDAL pipelines require explicit `extra_dims` declarations in readers.las and writers.las stages to preserve custom extra bytes (Accepted, TVU, THU) when cropping LAS files, even though theoretically LAS should carry extra bytes automatically. This was empirically tested and confirmed necessary.

Applied to files:

  • lasConverter.py
🔇 Additional comments (1)
lasConverter.py (1)

219-219: Symlink creation may fail on Windows.

Path.symlink_to() requires administrator privileges or Developer Mode on Windows, which could cause failures in production environments.

Consider using shutil.copy2() for cross-platform compatibility if symlinks are problematic:

-            dst.symlink_to(src.resolve())
+            shutil.copy2(src, dst)  # More portable than symlinks

Or add error handling:

-            dst.symlink_to(src.resolve())
+            try:
+                dst.symlink_to(src.resolve())
+            except (OSError, NotImplementedError):
+                # Fallback to copy on Windows or when symlinks unavailable
+                shutil.copy2(src, dst)

Verify whether the deployment environment supports symlinks.

Copy link
Member

@gautegf gautegf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why the metadata files cant be included in the repo?

Also saw some exception handling suggestions from the rabbit. Not sure how necessary they are, but can be nice to display error messages for bug fixing for future use of the converter maybe

Copy link
Member

@gautegf gautegf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adriahso
Copy link
Member Author

Is there a reason why the metadata files cant be included in the repo?

Also saw some exception handling suggestions from the rabbit. Not sure how necessary they are, but can be nice to display error messages for bug fixing for future use of the converter maybe

I didn't include the metadata files in the repo in case the survey locations themselves are confidential, so want to clarify with the customer before including them. Already have quite some exception handling, but can include more later if time allows for it.

Copy link
Member

@kleinc kleinc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@adriahso adriahso merged commit 673827f into main Oct 27, 2025
1 check passed
@adriahso adriahso deleted the 3-group-surveys-to-fix-floating-point-issue branch October 27, 2025 08:45
Sign in to join this conversation on GitHub.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Group surveys to fix floating point issue
3 participants