dss-extensions/dss_capi

Memory leak

keegit opened this issue · 4 comments

keegit commented

I actually think this is because I am not handling memory propery in my use of the API versus there actually being a bug in it, but either way I could use some help.

We're currently using the C API within a genetic algorithm - so there are thousands of API calls happening over the course of the algorithm. Using valgrind, we've identified the C API to be the source of the memory leak. I've included the valgrind output as an attached file.

My question is - do I need to be releasing memory for all API calls? or just a few. I notice in the output that leaks occur in functions I don't necessarily call.

Any help is greatly appreciated.

edit: valgrind output isn't uploading, so I'm posting it here:

==34137== Memcheck, a memory error detector
==34137== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==34137== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==34137== Command: target/debug/main
==34137== Parent PID: 300
==34137== 
==34137== Warning: invalid file descriptor -1 in syscall close()
==34137== Warning: invalid file descriptor -1 in syscall close()
==34137== Thread 13 tokio-runtime-w:
==34137== Invalid read of size 8
==34137==    at 0x4BED696: DSSGLOBALS_$$_SETACTIVEBUS$TDSSCONTEXT$ANSISTRING$$LONGINT (DSSGlobals.pas:374)
==34137==    by 0x6F80FB7: ???
==34137==    by 0x6F8103F: ???
==34137==  Address 0xa0 is not stack'd, malloc'd or (recently) free'd
==34137== 
==34137== 
==34137== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==34137==  Access not within mapped region at address 0xA0
==34137==    at 0x4BED696: DSSGLOBALS_$$_SETACTIVEBUS$TDSSCONTEXT$ANSISTRING$$LONGINT (DSSGlobals.pas:374)
==34137==    by 0x6F80FB7: ???
==34137==    by 0x6F8103F: ???
==34137==  If you believe this happened as a result of a stack
==34137==  overflow in your program's main thread (unlikely but
==34137==  possible), you can try to increase the size of the
==34137==  main thread stack using the --main-stacksize= flag.
==34137==  The main thread stack size used in this run was 8388608.


// redacted due to size


==34137== 112,740 bytes in 22,548 blocks are definitely lost in loss record 2,448 of 2,449
==34137==    at 0x483577F: malloc (vg_replace_malloc.c:299)
==34137==    by 0x9141F4: alloc (alloc.rs:98)
==34137==    by 0x9141F4: alloc_impl (alloc.rs:181)
==34137==    by 0x9141F4: allocate (alloc.rs:241)
==34137==    by 0x9141F4: allocate_in<u8, alloc::alloc::Global> (raw_vec.rs:184)
==34137==    by 0x9141F4: with_capacity_in<u8, alloc::alloc::Global> (raw_vec.rs:130)
==34137==    by 0x9141F4: with_capacity_in<u8, alloc::alloc::Global> (mod.rs:672)
==34137==    by 0x9141F4: with_capacity<u8> (mod.rs:481)
==34137==    by 0x9141F4: spec_new_impl_bytes (c_str.rs:287)
==34137==    by 0x9141F4: <&str as alloc::ffi::c_str::CString::new::SpecNewImpl>::spec_new_impl (c_str.rs:306)
==34137==    by 0x61233A: alloc::ffi::c_str::CString::new (c_str.rs:316)
==34137==    by 0x6138A3: dss_rs::circuit::set_active_bus (circuit.rs:224)
==34137==    by 0x2EEF57: genetic_cvr::GeneticCvr::compute_ckt_voltage_delta (lib.rs:236)
==34137==    by 0x2F3B4F: <genetic_cvr::GeneticCvr as genevo::genetic::FitnessFunction<alloc::vec::Vec<i32>,genetic_cvr::score::Score>>::fitness_of (lib.rs:548)
==34137==    by 0x2F531C: <genevo::reinsertion::elitist::ElitistReinserter<G,F,E> as genevo::operator::ReinsertionOp<G,F>>::combine (elitist.rs:154)
==34137==    by 0x30206F: <genevo::ga::GeneticAlgorithm<G,F,E,S,C,M,R> as genevo::algorithm::Algorithm>::next::{{closure}} (mod.rs:208)
==34137==    by 0x304CB7: genevo::statistic::TimedFn<F,U>::run (mod.rs:97)
==34137==    by 0x301C90: <genevo::ga::GeneticAlgorithm<G,F,E,S,C,M,R> as genevo::algorithm::Algorithm>::next (mod.rs:207)
==34137==    by 0x2FD6E4: genevo::simulation::simulator::Simulator<A,T>::process_one_iteration (simulator.rs:162)
==34137==    by 0x2FD017: <genevo::simulation::simulator::Simulator<A,T> as genevo::simulation::Simulation<A>>::step (simulator.rs:242)
==34137== 
==34137== 1,262,240 bytes in 22,540 blocks are definitely lost in loss record 2,449 of 2,449
==34137==    at 0x4837B65: calloc (vg_replace_malloc.c:752)
==34137==    by 0x4979D01: CMEM_$$_CALLOCMEM$QWORD$$POINTER (in /usr/local/lib/libdss_capi.so)
==34137==    by 0x7799E67: ???
==34137==    by 0x496B1AA: SYSTEM_$$_ALLOCMEM$QWORD$$POINTER (in /usr/local/lib/libdss_capi.so)
==34137==    by 0x1FFEFFDBFF: ???
==34137==    by 0x4B67626: CAPI_UTILS_$$_DSS_CREATEARRAY_PDOUBLE$PDOUBLE$PAPISIZE$LONGINT$$PDOUBLEARRAY0 (CAPI_Utils.pas:373)
==34137==    by 0x2: ???
==34137==    by 0x4B67890: CAPI_UTILS_$$_DSS_RECREATEARRAY_PDOUBLE$PDOUBLE$PAPISIZE$LONGINT$$PDOUBLEARRAY0 (CAPI_Utils.pas:483)
==34137==    by 0x4B671DB: CAPI_UTILS_$$_INVALIDCIRCUIT$TDSSCONTEXT$$BOOLEAN (CAPI_Utils.pas:246)
==34137==    by 0x7799E67: ???
==34137==    by 0x1FFEFFDC00: ???
==34137==    by 0x53CC717: ???
==34137== 
==34137== LEAK SUMMARY:
==34137==    definitely lost: 1,679,836 bytes in 81,157 blocks
==34137==    indirectly lost: 8,447 bytes in 126 blocks
==34137==      possibly lost: 133,344 bytes in 2,200 blocks
==34137==    still reachable: 924,303 bytes in 3,735 blocks
==34137==                       of which reachable via heuristic:
==34137==                         length64           : 361,199 bytes in 1,462 blocks
==34137==         suppressed: 0 bytes in 0 blocks
==34137== Reachable blocks (those to which a pointer was found) are not shown.
==34137== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==34137== 
==34137== For counts of detected and suppressed errors, rerun with: -v
==34137== ERROR SUMMARY: 1411 errors from 1410 contexts (suppressed: 0 from 0)

PMeira commented

There were some tiny leak when the state was invalid and some for empty data. These were fixed in 41295d0, which is included in v0.14.0b1 (check the pre-releases).

I'd need to investigate something more specific another time (not for a few days at least). I also can't make any sense of those traces, maybe try linking with the debug version (libdss_capid.so) for better traces.

Valgrind has some false-positives with Free Pascal, but that's a fixed set of potential leaks that doesn't grow, related to the FPC runtime initialization and things like that, AFAICT.

If you're not using the *_GR functions, the array buffers can be reused and freed once (with the appropriate API functions, not directly). Functions from the classic API that return single strings keep a copy of the last string in the engine too, so you don't need to free those -- just copy them before another API call that returns a string since it could overwrite it (Free Pascal strings are refcounted, so it depends).

For the *_GR alternatives, there are internal buffers (that grow as required) for each basic type (notably arrays of float64, int32 and int8; deprecated for strings).

All buffers are specific to each DSS context, there are no global buffers at process level anymore. You could use (ctx_)DSS_ResetStringBuffer and (ctx_)DSS_DisposeGRData to dispose the internal buffers, but those don't leak and should be disposed automatically when the context itself is disposed (in case of the prime/default instance, that happens when the library is finalized.

For the new APIs, some functions return strings that need to be manually disposed. Most try to expose internal state directly and don't require disposal.

For nearly all functions, any pointer provided by the user must be handled by the user, e.g. if you provide a string to ctx_Circuit_SetActiveElement, it doesn't take ownership of the string (same is valid for almost all arrays).

keegit commented

So linking to the debug version shows that there are only two places where the memory leaks occur:

  • text_set_command()
    • this is likely due to my mishandling of the string passed into the function, but I'm still investigating.
  • solution_solve()
    • I'm going to update the version I'm using to the one you linked. Hopefully that eliminates solve() as the source.
keegit commented

Linking with the debug version helped gain some more information. There were other API calls I was making that were showing up in the valgrind output but I seem to have fixed those. At the moment, it seems like the error is happening in both solution_solve() and text_set_command(). I can't seem to upload the valgrind output as a file so I've emailed it as an attachment to you @PMeira. I've also shared the code for each function, but I'm not sure I'm doing anything wrong.

pub fn text_set_command(value: &str) -> Result<()> {
    unsafe {
        let c_str = CString::new(value)?;
        let raw_c_str = c_str.into_raw();
        dss_c::Text_Set_Command(raw_c_str);
        let c_str = CString::from_raw(raw_c_str);
        drop(c_str);
    }
    Ok(())
}

pub fn solution_solve() {
    unsafe {
        dss_c::Solution_Solve();
        dss_c::DSS_DisposeGRData();
    }
}

Edit: upon further testing it seems like RegControls_Set_MaxTapChange() is also leaking:

==44919== 49 bytes in 1 blocks are possibly lost in loss record 658 of 2,396
==44919==    at 0x483577F: malloc (vg_replace_malloc.c:299)
==44919==    by 0x498CCFC: CMEM_$$_CGETMEM$QWORD$$POINTER (in /usr/local/lib/libdss_capid.so)
==44919==    by 0x4AFC725: REGCONTROL$_$TREGCONTROLOBJ_$__$$_RECALCELEMENTDATA (RegControl.pas:623)
==44919==    by 0x4A0C5B4: CKTELEMENTCLASS$_$TCKTELEMENTCLASS_$__$$_ENDEDIT$POINTER$LONGINT$$BOOLEAN (CktElementClass.pas:123)
==44919==    by 0x4C9781F: DSSOBJECTHELPER$_$TDSSOBJECTHELPER_$__$$_ENDEDIT$LONGINT (DSSObjectHelper.pas:3234)
==44919==    by 0x4C96FE1: DSSOBJECTHELPER$_$TDSSOBJECTHELPER_$__$$_SETINTEGER$LONGINT$LONGINT$TDSSPROPERTYSETTERFLAGS$$BOOLEAN (DSSObjectHelper.pas:3085)
==44919==    by 0x4BCBB3A: CAPI_REGCONTROLS_$$_SET_PARAMETER$TDSSCONTEXT$LONGINT$LONGINT (CAPI_RegControls.pas:130)
==44919==    by 0x4BCC974: CAPI_REGCONTROLS_$$_REGCONTROLS_SET_MAXTAPCHANGE$LONGINT (CAPI_RegControls.pas:428)
==44919==    by 0x617A8A: dss_rs::reg_controls::set_max_tap_change (reg_controls.rs:29)
==44919==    by 0x2E0BDD: genetic_cvr::GeneticCvr::set_regulator (lib.rs:145)
==44919==    by 0x2E16E4: genetic_cvr::GeneticCvr::exercise_ckt (lib.rs:206)
==44919==    by 0x2E65F8: <genetic_cvr::GeneticCvr as genevo::genetic::FitnessFunction<alloc::vec::Vec<i32>,genetic_cvr::score::Score>>::fitness_of (lib.rs:546)
==44919==    by 0x2F37A8: <genevo::reinsertion::elitist::ElitistReinserter<G,F,E> as genevo::operator::ReinsertionOp<G,F>>::combine (elitist.rs:154)
==44919==    by 0x2F859F: <genevo::ga::GeneticAlgorithm<G,F,E,S,C,M,R> as genevo::algorithm::Algorithm>::next::{{closure}} (mod.rs:208)
==44919==    by 0x2F9497: genevo::statistic::TimedFn<F,U>::run (mod.rs:97)
==44919==    by 0x2F819C: <genevo::ga::GeneticAlgorithm<G,F,E,S,C,M,R> as genevo::algorithm::Algorithm>::next (mod.rs:207)
==44919==    by 0x2F2454: genevo::simulation::simulator::Simulator<A,T>::process_one_iteration (simulator.rs:162)
==44919==    by 0x2F1D80: <genevo::simulation::simulator::Simulator<A,T> as genevo::simulation::Simulation<A>>::step (simulator.rs:242)
==44919==    by 0x2E3811: genetic_cvr::GeneticCvr::run_algorithm (lib.rs:343)
==44919==    by 0x2E24E1: genetic_cvr::GeneticCvr::run (lib.rs:276)
==44919==    by 0x2008C3: <main::MyService as algorithm_interfaces::algorithm_service_server::AlgorithmService>::run_algorithm::{{closure}} (main.rs:22)
==44919==    by 0x21E282: <core::pin::Pin<P> as core::future::future::Future>::poll (future.rs:124)
==44919==    by 0x2641DD: <<algorithm_interfaces::algorithm_service_server::AlgorithmServiceServer<T> as tower_service::Service<http::request::Request<B>>>::call::RunAlgorithmSvc<T> as tonic::server::service::UnaryService<algorithm_interfaces::Request>>::call::{{closure}} (algorithm.service.v1.rs:234)
==44919==    by 0x21E282: <core::pin::Pin<P> as core::future::future::Future>::poll (future.rs:124)
==44919==    by 0x27137C: tonic::server::grpc::Grpc<T>::unary::{{closure}} (grpc.rs:252)
==44919==    by 0x263B67: <algorithm_interfaces::algorithm_service_server::AlgorithmServiceServer<T> as tower_service::Service<http::request::Request<B>>>::call::{{closure}} (algorithm.service.v1.rs:257)
==44919==    by 0x21E482: <core::pin::Pin<P> as core::future::future::Future>::poll (future.rs:124)
==44919==    by 0x21DB99: <F as futures_core::future::TryFuture>::try_poll (future.rs:82)
==44919==    by 0x281930: <futures_util::future::try_future::into_future::IntoFuture<Fut> as core::future::future::Future>::poll (into_future.rs:34)
==44919==    by 0x24A69C: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll (map.rs:55)
==44919==    by 0x24CF20: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll (lib.rs:91)
==44919==    by 0x1CD910: <futures_util::future::try_future::MapOk<Fut,F> as core::future::future::Future>::poll (lib.rs:91)
==44919==    by 0x1EA520: <tower::util::map_response::MapResponseFuture<F,N> as core::future::future::Future>::poll (macros.rs:38)
==44919==    by 0x1EA9E9: <F as futures_core::future::TryFuture>::try_poll (future.rs:82)
==44919==    by 0x281890: <futures_util::future::try_future::into_future::IntoFuture<Fut> as core::future::future::Future>::poll (into_future.rs:34)
==44919==    by 0x24ADAC: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll (map.rs:55)
==44919==    by 0x24CED0: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll (lib.rs:91)
==44919==    by 0x1CD870: <futures_util::future::try_future::MapOk<Fut,F> as core::future::future::Future>::poll (lib.rs:91)
==44919==    by 0x1EA3E0: <tower::util::map_response::MapResponseFuture<F,N> as core::future::future::Future>::poll (macros.rs:38)
==44919==    by 0x688A03: <core::pin::Pin<P> as core::future::future::Future>::poll (future.rs:124)

Per the full valgrind log and the snippets here, I think there was probably a memory corruption somewhere.

I modified the current ieee13.rs example to include and run this function instead:

fn test_regcontrols(dss: &IDSS) -> Result<(), DSSError> {
    let solution = &dss.ActiveCircuit.Solution;
    let reg_controls = &dss.ActiveCircuit.RegControls;
    use altdss::classic::{SolveModes, ControlModes};

    run_ieee13(&dss)?;

    // Adjust the loads and solution mode so we get tap changes
    solution.Set_Mode(SolveModes::Daily)?;
    solution.Set_ControlMode(ControlModes::Time)?;
    solution.Set_Number(1)?;
    solution.Set_LoadMult(1.3)?;
    dss.Command("BatchEdit Load..* Daily=Default".to_string())?;

    const NUM: i32 = 100000;
    let mut total_changes = 0;
    for i in 0..NUM {
        let mut idx = reg_controls.First()?;
        let val = 1 + (i % 2);
        loop {
            if idx == 0 { 
                break; 
            }
            // set it to 1 or 2 depending on the current index
            reg_controls.Set_MaxTapChange(val)?;
            total_changes += 1;
            idx = reg_controls.Next()?;
        }
        solution.Solve()?;
    }
    println!("Total number of Set_MaxTapChange calls: {total_changes}");
    Ok(())
}

Since there is a Solve() call, it takes a while to run through valgrind, but seems clean:

Total number of Set_MaxTapChange calls: 300000
<snip>
==29056== LEAK SUMMARY:
==29056==    definitely lost: 0 bytes in 0 blocks
==29056==    indirectly lost: 0 bytes in 0 blocks
==29056==      possibly lost: 0 bytes in 0 blocks
==29056==    still reachable: 74,366 bytes in 5 blocks
==29056==         suppressed: 0 bytes in 0 blocks
==29056== 
==29056== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Let's close this for now. It anything remains we can reopen later.