Thursday, March 5, 2015

How to debug segfaults and exceptions using gdb?

This article is intended for advanced users or developers. Before we go into the details of troubleshooting  using gdb, I will talk about some basic stuff first.


When a program segfaults in Linux, a core dump file will be created.

If you are not sure if your program is segfault, you can run

>dmesg | tail

dmesg will mention about your segfault program at the end if it does segfault.


If no core file is generated, you need to tell the system to generated the core file.

>ulimit -u  -- set the size of the crash dump file.

>ulimit -c unlimited  -- if core file is not generated, run this command.


If this is done correctly and your program does crash, you can use gdb to load the crash or core dump file to see what's causing the crash.

gdb -c crashdumpfile program.exe

//-args use this if program.exe has parameters

 [ -args program.exe args1 args2 ... ]



Some handy programs to know:

>ldd programname  //check and list all the dependencies files (*.so).

>objdump programname   //dump the whole binary

>objdump -p programname | grep NEEDED  //list all libraries dependencies

>sudo pldd 1234  //will list all the library dependencies of a running process id 1234

>pmap 1234 -p //will list all the dependencies and their actual paths


I will talk in more details on how to troubleshoot using these programs in another article.


Moving on to the case where the program does not segfault.

The program may mysteriously exit or return none-zero exit code.

Now you want to know what is causing the problem.

You may not know where to start or sometimes it is nearly impossible to go line by line to find the issue. The program can be so huge and take a long time to run that it can be very frustrating if you have to run gdb more than once to reproduce or troubleshoot the issue.

The first thing to think about is if there are any exceptions being thrown in the code. Normally this would cause a crash if the exception is not being handled, but since the program does not crash it means that the exception is being handled and a non zero exit code is returned.

The next question is how do I know what exception to catch or what function to break?

You don't need to.

You can tell gdb to break at all exceptions being thrown and at the callers throwing the exceptions.

Below is an example showing the useful commands in gdb along with the case on how to find where the exception is being thrown. I highlighted the ones related to catching exceptions.


Debug using gdb

----------------------------------

gdb -c crashdumpfile program.exe

//-args use this if program.exe has parameters

 [ -args program.exe args1 args2 ... ]


>b programname_methodname --break at method name

>bt -- backtrace, works ok in release build as well.

>p variable name

>n -- next

>s -- step into

>list - --print lines just before the lines last printed

>list linenum -print lines centered around linenumber

>b __raise_exception  -- break at the code where the exception is thrown, notice sometimes catch catch won't give you this information

this is the step just before an exception handler is called.

>info break -- list the current catchpoints

>delete //delete all breakpoints

>disable //disable all breakpoints

>enable breakpoint#  //enable given breakpoint number

>continue



When running the program via gdb, the output will also tell you the path to the libraries the program is depending on too.


(gdb) break TestClass::testFunc(int)

Breakpoint 1 at 0x80485b2: file cpptest.cpp, line 16.

(gdb) break test.c:19

Breakpoint 2 at 0x80483f8: file test.c, line 19

Show the next statement that will be executed.

(gdb) where

#0  mystrcpy (copyto=0x259fc6c "*", copyfrom=0x259fddc "ABC") at printch.cpp:27

#1  0x4010c8 in main (argc=3, argv=0x25b0cb8) at printch.cpp:40

The statement at line 27 of the function mystrcpy is the next statement and the function mystrcpy was called by main.


Execute the rest of the current function; that is, step out of the function.

(gdb) finish

(gdb) info variables -- show all global and static variables

(gdb) info local - local variables of the current stack frame

(gdb) info args  -- arguments of the current stack frame


(gdb) show verbose // set verbose on   /// show logging  //set log file filename.txt


>catch catch  -- catch all exceptions -- this will not tell you where the exception is thrown. It will show you the catch line


Example of stack trace :


#0  0x000000359a0bbc40 in __cxa_begin_catch () from /usr/lib64/libstdc++.so.6

#1  0x0000000000407424 in main (argc=8, arguments=0x7fffffffcc38)

    at server/Metadata/XUDML/TestXUDMLParser/TestXUDMLParser.cpp:101


Here line 101 is just a catch(...) statement. It may tell you what the exception is but it won't tell you where the exception is thrown.


>catch throw --- will help tell you where the exception is thrown

Example: Here the exception is being thrown in ::removeChild method. Wala!

(gdb) bt

#0  0x000000359a0bccb0 in __cxa_throw () from /usr/lib64/libstdc++.so.6

#1  0x00007fffea279824 in obixercesc_2_8::DOMParentNode::removeChild (

    this=0x7fff3551ff80, oldChild=0x7fff3551ff68) at DOMParentNode.cpp:282

#2  0x00007fffea263c3b in obixercesc_2_8::DOMElementImpl::removeChild (

    this=0x7fff3551ff68, oldChild=0x7fff3551ff68) at DOMElementImpl.cpp:534

#3  0x00007fffecc5cee4 in XmlElement::RemoveChild (this=0x7fffffff8320,

    child=...) at server/Metadata/XUDML/Src/SMFineGrainedUtil.cpp:947

#4  0x00007fffecc6cea5 in LogicalTableSourceOperations::UpdateParentOnDeletion

    (this=0xa7fe20, element=..., delXmlElements=..., residueDeleteSet=...)

    at server/Metadata/XUDML/Src/SMFineGrainedXMLHandler.cpp:1158

#5  0x00007fffecc6a5e3 in ElementTypeOperations::ApplyDeleteCommand (

    this=0xa7fe20, element=..., deleteSet=..., delXmlElements=...,

    residueDeleteSet=...)

    at server/Metadata/XUDML/Src/SMFineGrainedXMLHandler.cpp:865

#6  0x00007fffecc6a695 in ElementTypeOperations::ApplyDeleteCommand (

    this=0xa7fe20, element=..., deleteSet=..., delXmlElements=...,

    residueDeleteSet=...)

    at server/Metadata/XUDML/Src/SMFineGrainedXMLHandler.cpp:872

#7  0x00007fffecc6a695 in ElementTypeOperations::ApplyDeleteCommand (

    this=0xa7fe20, element=..., deleteSet=..., delXmlElements=...,

    residueDeleteSet=...)

    at server/Metadata/XUDML/Src/SMFineGrainedXMLHandler.cpp:872

#8  0x00007fffecc6a695 in ElementTypeOperations::ApplyDeleteCommand (

    this=0xa7fe20, element=..., deleteSet=..., delXmlElements=...,

    residueDeleteSet=...)

    at server/Metadata/XUDML/Src/SMFineGrainedXMLHandler.cpp:872

#9  0x00007fffecc71935 in FineGrainedXML::XmlSection::ApplyDelete (

    this=0x7fffffffba78, deleteSet=..., residueDeleteSet=...)

    at server/Metadata/XUDML/Src/SMFineGrainedXMLHandler.cpp:1818

#10 0x00007fffec79793d in TransformXml (pGateway=0x7fff9cbe9008, fgxml=...)

    at server/Metadata/XUDML/Src/SMXUDMLParser.cpp:520

#11 0x00007fffec797bea in GeneratePatchedXML (pGateway=0x7fff9cbe9008,

    fgXmlfileOrText=..., bIsFile=true, pObjManager=0xa7e060, errorMsg=...,

    bIsPureFGXml=@0x7fffffffc09f, bIsTransformation=true, mergedXML=...)

    at server/Metadata/XUDML/Src/SMXUDMLParser.cpp:543

#12 0x00007fffec7988f8 in XUDMLParser::ExecuteTransactionalXMLforXUDML (

    pGateway=0x7fff9cbe9008, inputs=..., inputPasswords=..., errorMsg=...,

    warningMsg=..., createdObjIds=..., modifiedObjIds=..., delIds=...,



Useful commands when troubleshooting multithreaded program:


(gdb) info threads //list all the threads in the program

  5 Thread 0x7fffe0988700 (LWP 9729)  0x0000003b5180c6ad in pthread_getspecific () from /lib64/libpthread.so.0

* 4 Thread 0x7fffe0a89700 (LWP 9728)  0x00007ffff456b869 in _SASSTL::_Rb_global<bool>::_M_decrement (_M_node=0x7fffe12f02b0) at thirdparty/include/stlport/stl/_tree.c:269

  3 Thread 0x7fffe2b66700 (LWP 9727)  0x0000003b5180b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

  2 Thread 0x7fffe2c67700 (LWP 9726)  0x0000003b5180b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

  1 Thread 0x7fffe9c7b720 (LWP 9716)  0x0000003b5180b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

//* is the current thread


(gdb) thread apply all bt 5   //print the first 5 lines from each stack trace for all the threads.

Thread 5 (Thread 0x7fffe0988700 (LWP 9729)):

#0  0x0000003b5180c6ad in pthread_getspecific () from /lib64/libpthread.so.0

#1  0x00007ffff3f4a23c in samem_details_800::Manager::Allocate (this=0x7ffff417f140, Bytes=16, pFile=0x7ffff5c0e8c0 "void _SASSTL::vector<_Tp, _Alloc>::_M_insert_overflow(_Tp*, const _Tp&, const _SASSTL::__false_type&, size_t, bool) [with _Tp = Ref<ALevelToLevel, NonConstPointer<ALevelToLevel> >, _Alloc = _SASSTL::a"..., nLine=130) at manager.cpp:1202

#2  0x00007ffff45b4a09 in NQNodeAlloc::allocate (__n=<value optimized out>, pFile=<value optimized out>, nLine=<value optimized out>) at thirdpartysource/STLport-4.5/src/nqnodealloc.cpp:37

#3  0x00007ffff5b996b1 in allocate (this=<value optimized out>, parentLinks=...) at thirdparty/include/stlport/stl/_alloc.h:372

#4  _M_insert_overflow (this=<value optimized out>, parentLinks=...) at thirdparty/include/stlport/stl/_vector.h:130


Thread 4 (Thread 0x7fffe0a89700 (LWP 9728)):

#0  0x00007ffff456b869 in _SASSTL::_Rb_global<bool>::_M_decrement (_M_node=0x7fffe12f02b0) at thirdparty/include/stlport/stl/_tree.c:269

#1  0x00007ffff5607b8d in operator-- (this=0x7fffe12dd090, __v=...) at thirdparty/include/stlport/stl/_tree.h:185

#2  _SASSTL::_Rb_tree<RefCnt<RelInstance, RefCountable>, RefCnt<RelInstance, RefCountable>, _SASSTL::_Identity<RefCnt<RelInstance, RefCountable> >, LessRelInstance, _SASSTL::allocator<RefCnt<RelInstance, RefCountable> > >::insert_unique (this=0x7fffe12dd090, __v=...) at thirdparty/include/stlport/stl/_tree.c:410

#3  0x00007ffff5607026 in insert (this=0x7fffe12dd088, pSharedInstance=<value optimized out>) at thirdparty/include/stlport/stl/_set.h:137

#4  Relation::SetRelInstance (this=0x7fffe12dd088, pSharedInstance=<value optimized out>) at server/Metadata/Networker/Src/SMRelation.cpp:334


Thread 3 (Thread 0x7fffe2b66700 (LWP 9727)):

#0  0x0000003b5180b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1  0x00007ffff4af71b9 in NQConditionWait::Wait (this=0x7ffff4dd84c8, milliseconds=1000) at server/Utility/Generic/NQThreads/SUGConditionWait.cpp:273

#2  0x00007ffff4b70aff in PeriodicTasksExecutor::schedulingThreadMain (this=0x7ffff4dd83c0) at server/Utility/Generic/Src/PeriodicTasksExecutor.cpp:123

#3  0x00007ffff4b207c9 in NQExecutionState::ExecuteSystemMain (this=0x7fffe9a3d538) at server/Utility/Generic/NQThreads/SUGExecutionState.cpp:91

#4  0x00007ffff4b5c3bc in NQThreadJobBase::ExecuteSystemMain (this=0x7fffe9a3d538) at server/Utility/Generic/NQThreads/SUGThreadJob.cpp:179


Thread 2 (Thread 0x7fffe2c67700 (LWP 9726)):

#0  0x0000003b5180b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1  0x00007ffff4af72be in NQConditionWait::Wait (this=0x7fffe96d9688) at server/Utility/Generic/NQThreads/SUGConditionWait.cpp:156

#2  0x00007ffff4b432f8 in NQSemaphore::Acquire (this=0x7fffe96d9580) at server/Utility/Generic/NQThreads/SUGSemaphore.cpp:100

#3  0x00007ffff4af8cb9 in NQConditionWaitLIFO::Wait (this=0x7fffe9a131f0, signalSemaphore=..., startTime=...) at server/Utility/Generic/NQThreads/SUGConditionWaitLIFO.cpp:128

#4  0x00007ffff4af8de9 in NQConditionWaitLIFO::Wait (this=0x7fffe9a131f0) at server/Utility/Generic/NQThreads/SUGConditionWaitLIFO.cpp:104


Thread 1 (Thread 0x7fffe9c7b720 (LWP 9716)):

#0  0x0000003b5180b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1  0x00007ffff4af71b9 in NQConditionWait::Wait (this=0x7fffffff9d68, milliseconds=200) at server/Utility/Generic/NQThreads/SUGConditionWait.cpp:273

#2  0x00007ffff4b43157 in NQSemaphore::Acquire (this=0x7fffffff9c60, milliseconds=<value optimized out>) at server/Utility/Generic/NQThreads/SUGSemaphore.cpp:130

#3  0x00007ffff4b1f5e8 in Acquire (this=0x7fffe96db808, stateMask=<value optimized out>, milliseconds=200, actualState=@0x7fffffff9f8c) at server/include/Utility/Generic/SUGExecutionState.h:160

#4  NQExecutionState::WaitForStateNoLock (this=0x7fffe96db808, stateMask=<value optimized out>, milliseconds=200, actualState=@0x7fffffff9f8c) at server/Utility/Generic/NQThreads/SUGExecutionState.cpp:228


 Some other useful commands

(gdb) frame

(gdb) list *$pc

Both of these will list the source code around the current execution line.



checking the function I am interested to debug in gdb:

(gdb) info functions libafd_traj_quality
All functions matching regular expression "libafd_traj_quality":

File ../libAFD/libafd.cpp:
Datum libafd_traj_quality(FunctionCallInfo);
const Pg_finfo_record *pg_finfo_libafd_traj_quality();
Datum libafd_traj_quality(FunctionCallInfo);
const Pg_finfo_record *pg_finfo_libafd_traj_quality();
(gdb) 

Trying to set up the breakpoint and getting the error:

(gdb) b libafd_traj_quality
Cannot access memory at address 0x26c01f
(gdb)

How can I set a breakpoint on my function???

Additional Info:

# cat /proc/22648/maps  | grep libafd
2b3b2556e000-2b3b25a82000 r-xp 00000000 08:01 5379086                    /usr/lib/postgresql/9.1/lib/libafd.so
2b3b25a82000-2b3b25c82000 ---p 00514000 08:01 5379086                    /usr/lib/postgresql/9.1/lib/libafd.so
2b3b25c82000-2b3b25c88000 r--p 00514000 08:01 5379086                    /usr/lib/postgresql/9.1/lib/libafd.so
2b3b25c88000-2b3b25c9a000 rw-p 0051a000 08:01 5379086                    /usr/lib/postgresql/9.1/lib/libafd.so

//This only returns something if libafd.so is compiled with symbols# nm -as /usr/lib/postgresql/9.1/lib/libafd.so | grep libafd_traj_quality
00000000003d4788 r _ZZ28pg_finfo_libafd_traj_qualityE8my_finfo
000000000026c01f T libafd_traj_quality
000000000026c012 T pg_finfo_libafd_traj_quality

Real remote debugging example

setup on the target is running on 10.140 node:
gdbserver --multi :2345 //gdbserver is running with port 2345 opened
target machine also run the app we are trying to debug
1. build debug version of the app and launch gdb with the debug app (having debug symbols) on the host machine to connect to the target.
Assuming this py program will build a debug version of the app we are going to debug and run gdb with it.
$ ./build/gdb_coredump.py app
gdb> target extended-remote 10.140:2345
gdb> attach app_processid // get the process id on the target machine. The program will pause here.
gdb> bt
gdb> b app_main.cpp:42 //break at the line
gdb> b .... //set any break points you want to troubleshoot before continue running the program
(gdb) c
Continuing.

Thread 1 "app" hit Breakpoint 1, main (argc=<optimized out>, argv=0x7fd6924fd8)
at _app/src/app_main.cpp:42
42 sleep(10);
(gdb) list
37 StdMessageUi msg_handler; // Standard Message handling including a time field
38 PrometheusUiApplication app(argc, argv);
39 volatile bool bFakeDelay = true; //need to be volatile so the compiler does not optimize it out.
40 while (bFakeDelay)
41 {
42 sleep(10);
43 }
44 QCoreApplication::setApplicationName("Prometheus UI");
45 QCoreApplication::setApplicationVersion("0.1");
46
(gdb) p bFakeDelay
$1 = true
gdb> set bFakeDelay=false //to break the loop
gdb> continue


##### example on how to print something at the breakpoint
(gdb) b app_main.cpp:42
Breakpoint 4 at 0x43ff98: file app_main.cpp, line 42.
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>printf "breaking at sleep(10) bFakeDelay=%d\n", bFakeDelay
>cont
>end
(gdb)c //you have to wait for 10 seconds to see it breaks
Continuing.
//do something to trigger the crash and pause gdb
[New Thread 4285.4622]
[New Thread 4285.4625]
[New Thread 4285.4626]
[New Thread 4285.4629]
[New Thread 4285.4630]
[New Thread 4285.4631]
[New Thread 4285.4632]
[New Thread 4285.4662]
[New Thread 4285.4663]
[New Thread 4285.4664]
[New Thread 4285.4665]

Thread 1 "prometheus_app" received signal SIGSEGV, Segmentation fault.
0x0000007f78340a80 in tcache_init () at malloc.c:3129
3129 malloc.c: No such file or directory.
(gdb) bt
....
#12 ISI::AlertUiProcessingManager::trackSessionAlertMessage (this=this@entry=0x17a7e4f0, id=id@entry=ID::Alert::CHECK_ENERGY_CORD_CONNECTION) at _prometheus/lib/ui/alerts/src/alert_ui_processing_manager.cpp:48
#13 0x00000000004f22a0 in ISI::AlertUiProcessingManager::updateAlertContext (this=this@entry=0x17a7e4f0, id=id@entry=ID::Alert::CHECK_ENERGY_CORD_CONNECTION, newContextVal=...) at _prometheus/lib/ui/alerts/src/alert_ui_processing_manager.cpp:75
#14 0x00000000004f0230 in ISI::AlertUiDataValueDelegate::changed (this=0x17a91f10) at _prometheus/lib/ui/alerts/src/alert_ui_data_value_delegate.cpp:23
#15 0x00000000007363c4 in DmUiDataValue::syncValue (this=0x177c9110) at framework/lib/ui/datamodel/src/dm_ui_datamodel.cpp:440
#16 0x0000000000736098 in DmUiDataValue::syncValue (this=0x17679b78) at framework/lib/ui/datamodel/src/dm_ui_datamodel.cpp:456
#17 0x0000000000516d2c in BaseUiScreen::dataManagerSync (this=this@entry=0x176edbc0) at _prometheus/lib/ui/base_domain/src/base_ui_screen.cpp:46
#18 0x00000000004d9060 in PrometheusUiMainScreen::paint (this=0x176edbc0, painter=0x1710d1b0) at _prometheus/lib/ui/prometheus_standalone/src/prometheus_ui_mainscreen.cpp:78
#19 0x0000007f7923f0e0 in ?? ()
#20 0x0000000016916e40 in ?? ()
Backtrace stopped: not enough registers or memory available to unwind further
(gdb) list alert_ui_processing_manager.cpp:48
43 {
44 alertMessage += "\n" + thirdMessage;
45 }
46 if(m_sessionAlerts.size() == SESSION_ALERTS_TRACK_SIZE )
47 {
48 m_sessionAlerts.erase(m_sessionAlerts.end());//pop_back();
49 }
50 m_sessionAlerts.push_front(std::pair<string,string>(alertMessage, string(buffer)));
51 std::stringstream ss;
52 ss << "trackSessionAlert called: " << alertMessage << " " << buffer << " " << m_sessionAlerts.size();
(gdb)

Thread 1 "prometheus_app" hit Breakpoint 4, main (argc=<optimized out>, argv=0x7fe6ab5d78)
at _prometheus/app/ui/prometheus_app/src/prometheus_ui_app_main.cpp:42
42 sleep(10);
breaking at sleep(10) bFakeDelay=value has been optimized out
(gdb)

No comments: