Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

!!!! erroneous table mutation when adding a large column !!!! #86

Closed
alex-s-gardner opened this issue Nov 13, 2024 · 10 comments · Fixed by yeesian/ArchGDAL.jl#442
Closed

Comments

@alex-s-gardner
Copy link

alex-s-gardner commented Nov 13, 2024

This one caught me off guard. Large tables seem to be unsafe when manipulating this example geo parquet file (using GeoDataFrames v0.3.10 with Julia v"1.11.1"):

https://drive.google.com/file/d/1FJUbk_Smj3VoMhGeR790AtEEaEwZPiFY/view?usp=sharing

In this case adding a new column with 'vector length = 100' does not modify existing columns

using GeoDataFrames
vector_length = 100
test = GeoDataFrames.read("test.parquet")
foo1 = copy(test.RiverIDTrace[1])

test[!, :new_column] = [zeros(vector_length ) for r in 1:nrow(test)]#
foo2 = copy(test.RiverIDTrace[1])
println(foo1 == foo2)
true

adding a large vector, 'vector length = 1000000', DOES MODIFY existing columns

using GeoDataFrames
vector_length = 1000000
test = GeoDataFrames.read("test.parquet")
foo1 = copy(test.RiverIDTrace[1])

test[!, :new_column] = [zeros(vector_length ) for r in 1:nrow(test)]#
foo2 = copy(test.RiverIDTrace[1])
println(foo1 == foo2)
false

adding deepcopy fixes the problem in this instance but after more testing deepcopy does not work in all cases

using GeoDataFrames
vector_length = 1000000
test = deepcopy(GeoDataFrames.read("test.parquet"))
foo1 = copy(test.RiverIDTrace[1])

test[!, :new_column] = [zeros(vector_length ) for r in 1:nrow(test)]#
foo2 = copy(test.RiverIDTrace[1])
println(foo1 == foo2)
true
@evetion
Copy link
Owner

evetion commented Nov 13, 2024

That's odd. Can you test a similar thing with just DataFrames? Since your adding of the new column is a DataFrame specific thing, GeoDataFrames doesn't play a role there anymore.

@evetion
Copy link
Owner

evetion commented Nov 13, 2024

The shared file has been deleted, so I don't know what RiverIDTrace is? Is it also a column of vectors? Can you do an elementwise comparison, and check why foo1 and foo2 are not equal anymore?

Worst case, you generate a lot of memory pressure with your vector of vectors, and something is garbage collected. Then again, you also say deepcopy doesn't work, so something else is happening (or deepcopy on DataFrames is not correctly implemented).

@asinghvi17
Copy link
Contributor

asinghvi17 commented Nov 13, 2024

RiverIDTrace is a column of Vector{Int}, yes. The values inside were randomly overwritten with zeros, seemingly no correlation to row number. (I don't currently have access to the file but saw the bug being reproduced)

@alex-s-gardner
Copy link
Author

@evetion @asinghvi17 please see updated path to file. I am working on a DataFrames only replication of the issue but have not succeeded yet. I will keep working at it.

@asinghvi17
Copy link
Contributor

I tried a similar thing in pure DataFrames, and it does not pose an issue there:

MWE
using DataFrames
df = DataFrame();
int_data = [rand(Int, rand(1:1_000)) for i in 1:215_000]
df.col1 = deepcopy(int_data)
df.col2 = [zeros(1000000) for i in 1:size(df, 1)]
all(df.col1 .== int_data) # true

https://github.com/yeesian/ArchGDAL.jl/blob/a322ce6eb8a811b6ec053608c95c385464214d92/src/ogr/feature.jl#L345 looks to be where int arrays are moved from GDAL to Julia.

It looks like this is an unsafe_wrap, but without own = true (which defaults to false). I'm now going to see if setting own = true changes anything here. Maybe the GDAL dataset scoping also contributes to this, but I don't imagine so...

@evetion
Copy link
Owner

evetion commented Nov 13, 2024

yeesian/ArchGDAL.jl@a322ce6/src/ogr/feature.jl#L345 looks to be where int arrays are moved from GDAL to Julia.

It looks like this is an unsafe_wrap, but without own = true (which defaults to false). I'm now going to see if setting own = true changes anything here. Maybe the GDAL dataset scoping also contributes to this, but I don't imagine so...

Good catch! That's the culprit, and kudos for @alex-s-gardner for actually spotting it in real life (sorry for that). But you shouldn't own=true, as:

the field value. This list is internal, and should not be modified, or freed. Its lifetime may be very brief. If *pnCount is zero on return the returned pointer may be NULL or non-NULL. (https://gdal.org/en/latest/doxygen/classOGRFeature.html)

So the fix would be to at least copy the unsafe_wrap.

@asinghvi17
Copy link
Contributor

asinghvi17 commented Nov 13, 2024

Ah, I missed that we were wrapping the pointer returned directly. Yeah in that case copy seems like the way to go.

Just for my satisfaction, and to document this, copy does make the array robust to any underlying mutation of the parent pointer's memory.

julia> A = rand(10)
10-element Vector{Float64}:
 0.8487556697809062
 0.4530489648028254
 0.7915015228101486
 0.5249905434671536
 0.011043884362292533
 0.8092927336663542
 0.2807079717859139
 0.5462812200563412
 0.7293837731721518
 0.8515677666121682

julia> Ap = pointer(A)
Ptr{Float64} @0x00000001ccca4f70

julia> Au = unsafe_wrap(Vector{Float64}, Ap, size(A))
10-element Vector{Float64}:
 0.8487556697809062
 0.4530489648028254
 0.7915015228101486
 0.5249905434671536
 0.011043884362292533
 0.8092927336663542
 0.2807079717859139
 0.5462812200563412
 0.7293837731721518
 0.8515677666121682

julia> Auc = copy(Au)
10-element Vector{Float64}:
 0.8487556697809062
 0.4530489648028254
 0.7915015228101486
 0.5249905434671536
 0.011043884362292533
 0.8092927336663542
 0.2807079717859139
 0.5462812200563412
 0.7293837731721518
 0.8515677666121682

julia> Auc[1] = 1
1

julia> A
10-element Vector{Float64}:
 0.8487556697809062
 0.4530489648028254
 0.7915015228101486
 0.5249905434671536
 0.011043884362292533
 0.8092927336663542
 0.2807079717859139
 0.5462812200563412
 0.7293837731721518
 0.8515677666121682

julia> Au
10-element Vector{Float64}:
 0.8487556697809062
 0.4530489648028254
 0.7915015228101486
 0.5249905434671536
 0.011043884362292533
 0.8092927336663542
 0.2807079717859139
 0.5462812200563412
 0.7293837731721518
 0.8515677666121682

julia> Auc
10-element Vector{Float64}:
 1.0
 0.4530489648028254
 0.7915015228101486
 0.5249905434671536
 0.011043884362292533
 0.8092927336663542
 0.2807079717859139
 0.5462812200563412
 0.7293837731721518
 0.8515677666121682


julia> Au[1] = 1
1

julia> Au
10-element Vector{Float64}:
 1.0
 0.4530489648028254
 0.7915015228101486
 0.5249905434671536
 0.011043884362292533
 0.8092927336663542
 0.2807079717859139
 0.5462812200563412
 0.7293837731721518
 0.8515677666121682

julia> A
10-element Vector{Float64}:
 1.0
 0.4530489648028254
 0.7915015228101486
 0.5249905434671536
 0.011043884362292533
 0.8092927336663542
 0.2807079717859139
 0.5462812200563412
 0.7293837731721518
 0.8515677666121682

@alex-s-gardner
Copy link
Author

So just for my own edification, why did deepcopy not prevent this issue?

@asinghvi17
Copy link
Contributor

It could be that GDAL overwrote the memory before / during the deepcopy, so what deepcopy saw was already incorrect.

@alex-s-gardner
Copy link
Author

fixed with yeesian/ArchGDAL.jl#442

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants