close Warning:

Changes between Initial Version and Version 1 of CodaHOWTO/TroubleShooting


Ignore:
Timestamp:
Feb 8, 2007, 2:18:16 PM (11 years ago)
Author:
jaharkes
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CodaHOWTO/TroubleShooting

    v1 v1  
     1= Troubleshooting =
     2
     3The Coda file system is still under development, and there certainly are several bugs which can crash both clients and servers. However, many problems users observe are related to semantical differences of the Coda file system compared to well-known NFS or SMB network file systems.
     4
     5This section will point out several logs to look at for identifying the cause of problems. Even if the source of the problem cannot be found, the information gathered from Coda's logging mechanisms will make it easier for people on the Coda mailinglist <codalist@coda.cs.cmu.edu> to assist in solving the problem(s).
     6
     7Some of the more common problems are illustrated in detail. At the end of this section some of the more involved debugging techniques will be addressed. This will be helpful to developers to isolate problems more easily.
     8
     9At the end there is a whole section describing how to solve some problems with Windows95, but only the Coda related stuff!
     10
     11== Basic Troubleshooting == #BasicTroubleshooting
     12
     13Most problems can be solved, or at least recognized by using the information logged by the clients and servers. The first step in finding out where the problems stem from is doing a `tail -f` on the log files.
     14
     15It must also be noted that, when Coda clients and servers crash they do not ''dump core'', but start sleeping so that we developers can attach debuggers. As a result, a crashed client or server still shows up in the `ps auxwww` output, and only the combination of lack of file-service and error messages in log files indicate that something is really wrong.
     16
     17Since release 5.3.4 servers now actually exit when they crash, create a file `/vice/srv/ZOMBIFY` to force a server to go into an endless sleep again.
     18
     19=== Client debugging output ===
     20
     21 * codacon is a program which connects to `venus` and provides the user with run-time information. It is the initial source of information, but cannot be used to look back into the history. It is therefore advisable to always have a `codacon` running in a dedicated xterm.
     22{{{
     23$ xterm -e codacon
     24}}}
     25 * `/usr/coda/etc/console` is a log file which contains mostly error or warning messages, and is a place to look for errors which might have occurred. When assertions in the code fail, it is logged here.
     26 * `/usr/coda/etc/venus.log` contains more in-depth information about the running system, which can be helpful to find out what the client is or was doing.
     27
     28=== Server logs ===
     29
     30 * `cmon` is an ncurses program that can be run on a client to gather and display statistics from a group of servers. When a server goes down it will not respond to the statistics requests, which makes this a simple method for monitoring server availability.
     31{{{
     32$ xterm -e cmon server1 server2 server3 ...
     33}}}
     34 * `/vice/srv/SrvLog` and `/vice/srv/SrvErr` are the server log files.
     35
     36Other log files that could be helpful in discovering problems are:
     37
     38 * `/vice/auth2/AuthLog`
     39 * `/vice/srv/portmaplog`
     40 * `/vice/srv/UpdateClntLog`
     41 * `/vice/srv/UpdateLog`
     42
     43== Client Problems == #ClientProblems
     44
     45 '''Client does not connect to testserver.coda.cs.cmu.edu'''::
     46  When you have set up your client for the first time, and it can not connect to the testserver at CMU, there are a couple of possible reasons. You might be running an old release of Coda, check the Coda web-site to see what the latest release is.
     47
     48  Another common reason is that your site is behind a firewall, which blocks, or allows only outgoing, udp traffic. Either try Coda on a machine outside of the firewall, or set up your own server.
     49
     50  The third reason is that the testserver might be down, for maintenance or upgrades. That does not happen often, but you can check whether it is up, and how long it has been running using cmon.
     51{{{
     52cmon testserver.coda.cs.cmu.edu
     53}}}
     54
     55 '''Venus comes up but prints cannot find !RootVolume'''::
     56  All of the reasons in the previous item could be the cause. It is also possible that your /etc/services file is not allright. It needs the entries
     57{{{
     58# Iana allocated Coda filesystem port numbers
     59rpc2portmap     369/tcp   
     60rpc2portmap     369/udp    # Coda portmapper
     61codaauth2       370/tcp   
     62codaauth2       370/udp    # Coda authentication server
     63venus           2430/tcp   # codacon port
     64venus           2430/udp   # Venus callback/wbc interface
     65venus-se        2431/tcp   # tcp side effects
     66venus-se        2431/udp   # udp sftp side effect
     67codasrv         2432/tcp   # not used
     68codasrv         2432/udp   # server port
     69codasrv-se      2433/tcp   # tcp side effects
     70codasrv-se      2433/udp   # udp sftp side effect
     71}}}
     72
     73 '''Trying to access a file returns Connection timed out (ETIMEDOUT)'''::
     74  The main reason for getting Connection timed out errors is that the volume where the file is located is disconnected from the servers. However, it can also occur in some cases when the client is in write-disconnected mode, and there is an attempt to read a file which is open for writing. See Volume is disconnected/Volume is write-disconnected for more information.
     75
     76 '''Commands do not return, except by using ^C'''::
     77  When command are hanging it is likely that venus has crashed. Check /usr/coda/etc/console and /usr/coda/venus.cache/venus.log .
     78
     79 '''Venus fails when restarted'''::
     80  If venus complains (in venus.log about not being able to open /dev/cfs0 , it is because /coda is still mounted.
     81{{{
     82# umount /coda
     83}}}
     84  Another reason for not restarting is that another copy of venus is still around, and venus is unable to open it's network socket. In this case there will be a message in venus.log stating that RPC2_CommInit has failed.
     85
     86 '''Venus doesn't start'''::
     87  A reason is that you do not have the correct kernel module. This can be tested by inserting the module by hand, and then listing the available modules. `coda' should show up in that listing. Otherwise reinstall (or recompile) a new module.
     88{{{
     89# depmod -a
     90# insmod coda.o
     91# lsmod
     92Module                  Size  Used by
     93coda                   50488   2
     94}}}
     95  If the kernel-module can be loaded without errors, check venus.log . A message stating `Cannot get rootvolume name' indicated either a misconfigured server or the codasrv/codasrv-se ports are not defined in /etc/services , which should contain the following entries. See above for the entries needed.
     96
     97 '''I'm disconnected and Venus doesn't start'''::
     98  Put the hostnames of your servers in /etc/hosts .
     99
     100 '''I cannot get tokens while disconnected'''::
     101  Take a vacation until we release a version of Coda which uses it's telepathic abilities to contact the auth2 server. We will add this feature.
     102
     103 '''Hoard doesn't work'''::
     104  Make sure you have version 5.0 of Coda or later. Before you can hoard you must make sure that:
     105  * You started Venus with the flag -primaryuser ''youruid''
     106  * You have tokens
     107
     108== Server Problems == #ServerProblems
     109
     110 '''The server crashed and prints messages about !AllocViaWrapAround'''::
     111  This happens when you have a resolution log that is full. In the `SrvLog` file you will usually be able to see which volume is affected, take down it's volume id. You may need to consult `/vice/vol/VRList` on the SCM to do this. Kill the dead (zombied) server, and restart it. The moment it is up you do:
     112{{{
     113volutil setlogparms volumeid reson 4 logsize 16384
     114filcon clear -s "this server"
     115}}}
     116  Unless you do "huge" things 16k will be plenty.
     117
     118 '''server doesn't start due to salvaging problems'''::
     119  If this happens you have several options. If the server has crashed during salvaging it will not come up by trying again, you must either repair the damaged volume or not attach that volume.
     120
     121  Not attaching the volume is done as follows. Find the volume id of the damaged volume in the SrvLog. Create a file named /vice/vol/skipsalvage with the lines:
     122{{{
     1231
     1240xdd000123
     125}}}
     126  Here 1 indicates that a single volume is to be skipped and 0xdd000123 is the volume id of the replica that should not be attached. If this volume is a replicated volume, take all replicas offline, since otherwise the clients will get very confused.
     127
     128  You can also try to repair the volume with norton . Norton is invoked as:
     129{{{
     130norton LOG DATA DATA-SIZE
     131}}}
     132  These parameters can be found in /etc/coda/server.conf.
     133
     134  The Norton manual pages give details about norton's operation and there is online guidance available which is possibly more helpful.
     135
     136  ''NOTES''
     137   1. Often corruption is replicated. This means that if you find a server has crashed and does not want to salvage a volume, your other replicas may suffer the same fate: the risk is that you may have to go back to tape (you do make tapes, right?). Therefore first copy out good data from the available replicas, then attend to repairing or skipping them in salvage.
     138   2. Very often you have to take both a volume and its most recent clone (generated during backup) offline, since corruption in a volume is inherited by the clone.
     139   3. If you find that a replica of a volume is corrupt, do not attempt to merely replace that replica . We have found that this corrupts the volume databases. It is better to make a new replicated volume and copy of the data from the healthy replicas (keep the server with the bad replica down).
     140
     141 '''How to restore a backup from tape'''::
     142  Tuesday I lost my email folder - the whole volume moose:braam.life was corrupted on server moose , it wouldn't salvage. Here is how I got it back.
     143
     144  First I tried mounting moose.braam.life.0.backup but this was corrupted too.
     145
     146  On the SCM in `/vice/vol/VRList` I found the replicated volume number f0000427 and the volume number ce000011 (ficitious) for the volume.
     147
     148  I logged in as root to bison, our backup controller. I read the backuplog for Tuesday morning in /vice/backuplogs/backuplog.DATE and saw that the incremental dump for August 31st had been fine. At the end of that log, I saw the name f0000427.ce000011 listed as dumped under /backup (a mere symlink) and /backup2 as spool directory with the actual file. The backup log almost shows how to move the tape to the correct place and invoke restore:
     149{{{
     150cd /backup2
     151mt -f /dev/nst0 rewind
     152restore -b 500 -f /dev/nst0 -s 3 -i
     153}}}
     154  The -s 3 option varies according to which /backup[123] volume the backup is restored from. This invokes the restore command. Typing help allowed me to add then extract the file I wanted. It took a little while before the file was back. From the restore prompt do:
     155{{{
     156restore> cd 31Aug1998
     157restore> add viotti.coda.cs.cmu.edu-f0000427.ce000011
     158restore> extract
     159Specify volume #: 1
     160}}}
     161  In /vice/db/dumplist I saw that the last full backup had been on Friday Aug28. I went to the machine room and inserted that tape (recent tapes were stacked on top of bison). This time f0000427.ce000011 was a 200MB file (the last full dump) in /backup3. I extract the file as above.
     162
     163  Then I merged the two dumps:
     164{{{
     165merge /restore/peter.mail /backup2/28Aug1998/f0000427.ce000011 \
     166    /backup3/31Aug1998/f0000427.ce000011
     167}}}
     168  This took a minute or two to create /restore/peter.mail. Now all that was needed was to upload that to a volume:
     169{{{
     170volutil -h moose restore /restore/peter.mail /vicepa vio:braam.mail.restored
     171}}}
     172  Back to the SCM, to update the volume databases:
     173{{{
     174bldvldb.sh viotti
     175}}}
     176  Now I could mount the restored volume:
     177{{{
     178cfs mkm restored-mail vio:braam.mail.restored
     179}}}
     180  and copy it into a read write volume using cpio or tar.
     181
     182 '''createvol_rep reports RPC2_NOBINDING'''::
     183  When trying to create volumes, and createvol_rep reports RPC2_NOBINDING, it is an indication that the server is not (yet) accepting connections.
     184
     185  It is useful to look at `/vice/srv/SrvLog`, the server performs the equivalent of fsck on startup, which might take some time. Only when the server logs "Fileserver Started" in `SrvLog`, it starts accepting incoming connections.
     186
     187  Another reason is that an old server is still around, blocking the new server from accessing the network ports.
     188
     189 '''RPC2_DUPLICATESERVER in the rpc2portmap/auth2 logs'''::
     190  Some process has the UDP port open which rpc2portmap or auth2 is trying to obtain. In most cases this is an already running copy of rpc2portmap or auth2. Kill all running copies of the program in question and restart them.
     191
     192 '''Server crashed shortly after updating files in /vice/db'''::
     193  Servers can crash when they are given inconsistent or bad data-files. You should check whether updateclnt and updatesrv are both running on the SCM and the machine that has crashed. You can kill and restart them. Then restart codasrv and it should come up.
     194
     195 '''Users cannot authenticate or created volumes are not mountable'''::
     196  Check whether auth2, updateclnt, and updatesrv are running on all fileservers. Also check their logfiles for possible errors.
     197
     198== Disconnections == #Disconnections
     199
     200As most common problems are related to the semantical differences arising as a result of `involuntary' disconnections, this section contains some background information of why volumes become disconnected or write-disconnected. And how to get them to reconnect again.
     201
     202=== Volume is fully disconnected ===
     203
     204There are several reasons why a coda client may have disconnected some or all volumes from an accessible server.
     205
     206 ''Pending reintegration''::
     207  When modifications have been made to the volume in disconnected mode, the client will not reconnected the volume until all changes have been reintegrated. Also, reintegration will not occur without proper user authentication tokens. Furthermore, reintegration is suspended as long as there are objects in conflict.
     208
     209  The most important item here is to have a codacon process running, since it will give up-to-date information on what venus is doing. Venus will inform the user about missing coda authentication tokens, `Reintegration: pending tokens for user < uid > ' . In this case the user should authenticate himself using the clog command.
     210
     211  Conflicts, which require us to use the repair tool, are conveyed using the `local object < pathname > inconsistent' message. Otherwise codacon should show messages about backfetches , and how many modifications were successfully reintegrated.
     212
     213 ''Access permissions''::
     214  The client may also disconnect when a servers reports an error to an operation, when according to the client this is a valid operation. Causes for this are authentication failure; check tokens using ctokens and optionally obtain new tokens using clog . Or inconsistencies between the data cached on the client and the actual data stored on the server; this will reveal itself as an inconsistent object during subsequent reintegration.
     215
     216 ''Lost connections''::
     217  Sometimes the client does not receive a prompt reply from an accessible server, and marks the server as dead. This will ofcourse disconnect the volume if the last server is lost. Once every five minutes, the client automatically verifies connectivity with all known servers, and can thus recover from lost connections. However, this action can also be triggered by the user by excecuting the cfs checkservers command.
     218
     219  If cfs checkservers reports that servers are unreachable, it might be interesting to check with cmon if the server is responding at all, since we might be faced with a crashed server. When a server was considered unreachable, but is successfully contacted after `cfs checkservers', reintegration will automatically start (when a user has tokens, and there are no inconsistencies).
     220
     221=== Volume is write-disconnected ===
     222
     223Write-disconnected operation is used as often as weakly connected mode to describe this volume state, and they are effectively the same. This is the special situation where a client observes a weak connectivity with a server, and therefore forces the associated volumes in weakly connected mode. Weakly connected volumes postpone writing to the server to significantly reduce waiting on a slow network connection. Read operations are still serviced by the local cache and the servers, as in fully connected mode. Which is why this mode of operation is also called write-disconnected operation.
     224
     225The write operations are effectively a continuous reintegration ( trickle-reintegration ) in the background. This mode, therefore, requires users to be authenticated and gives more chance for possible file conflicts. The following points are several reasons for write-disconnected operation.
     226
     227 ''Weak network connectivity''::
     228  Venus uses bandwidth estimates made by the rpc2 communication layer to decide on the quality of the network connection with the servers. As soon as the connectivity to one of the servers drops to below the weakly connected treshhold (currently 50 KB/s), it will force all volumes associated with that server into weakly-connected mode. The cfs wr command can be used to force the volumes back into fully connected mode, and immediately reintegrate all changes.
     229
     230  To avoid switching to weakly connected mode, use cfs strong . This way venus ignores bandwidth estimates. cfs adaptive will make venus revert to interpreting bandwidth estimates.
     231
     232  When the user was not authenticated, or conflicts were created during the write-disconnected operation, the user must first obtain proper authentication tokens or repair any inconsistent objects before the volume becomes fully connected again. Here again codacon is an invaluable tool for obtaining insight into the client's behaviour.
     233
     234 ''User requested write-disconnect mode''::
     235  Users can ask venus to force volumes in write-disconnected mode, exchanging high consistency for significantly improved performance. By using the -age and -time flags on the cfs wd commandline, some control is given about the speed at which venus performs the trickle-reintegration. For instance, to perform the trickle-reintegrate more quickly than the default, where only mutations to the filesystem older than 15 minutes are reintegrated. You could use cfs wd -age 5 , which will attempt to reintegrate all mutations older than 5 seconds.
     236
     237 ''Pending reintegration''::
     238  When a volume is write-disconnected, it will stay write-disconnected until a user properly authenticates using clog .
     239
     240== Advanced Troubleshooting == #AdvancedTroubleshooting
     241
     242=== rpc2tcpdump ===
     243
     244rpc2tcpdump is a modified version of tcpdump to decode rpc2 protocol headers. This makes it a very useful tool for analyzing why programs fail to work.
     245
     246All traffic between venus and the coda servers can be viewed using the following command.
     247{{{
     248# tcpdump -s120 -Trpc2 port venus or port venus-se
     249}}}
     250To identify problems with clog , for instance which server it is trying to get tokens from.
     251{{{
     252# tcpdump -s120 -Trpc2 port codaauth
     253}}}
     254
     255=== debugging with gdb ===
     256
     257To be able to debug programs that use RVM, most coda related application will go into an endless sleep when something goes really wrong. They print their process-id in the log (f.i. venus.log or !SrvLog ), and a user can attach a debugger to the crashed, but still running, program.
     258{{{
     259# gdb /usr/sbin/venus `pidof venus`
     260}}}
     261This makes it possible to get a stack backtrace ( where ), go to a specific stack frame ( frame < x > ), or view the contents of variables, ( print < varname > ). By installing the coda sources in same place as where the binaries were initially built from, it is possible to view the surrounding code fragment from within the debugger using the list command.
     262
     263When using !RedHat Linux rpms, you can install the sources in the right place by installing the coda source rpm file.
     264{{{
     265# rpm -i coda-x.x.x.src.rpm
     266# rpm -bp /usr/src/redhat/SPECS/coda.spec
     267}}}
     268On other platforms look at the paths reported in the backtrace and unpack the source tarball in the correct place.
     269{{{
     270(gdb) where
     271#0  CommInit () at /usr/local/src/coda-4.6.5/coda-src/venus/comm.cc:175
     272#1  0x80fa8c3 in main (argc=1, argv=0xbffffda4) at /usr/local/src/coda-4.6.5/coda-src/venus/venus.cc:168
     273(gdb) quit
     274# cd /usr/local/src
     275# tar -xvzf coda-4.6.5.tgz
     276}}}
     277
     278== Troubleshooting on Windows 95 == #WindowsTrouble
     279
     280=== Common problems ===
     281
     282 '''Unable to shutdown Windows95'''::
     283  Check the DOS Windows settings of Venus and Relay. The check box Properties- > Misc- > Termination must be unticked.
     284
     285 '''I cannot reboot Windows95 and I think it is due to the VXDs loaded for Coda'''::
     286  Boot your System in DOS mode by pressing F8 on boot time. Cd to the windows directory and type edit system.ini . In the section [enh386] you will find the entries
     287{{{
     288device=c:\usr\coda\bin\mmap.vxd
     289device=c:\usr\coda\bin\mcstub.vxd
     290}}}
     291  Comment them out by using a ; in front of the lines. Try to restart Windows again.
     292
     293 '''How can I find out why venus.exe crashed'''::
     294  See troubleshooting venus. When this happens it might not be possible to restart Venus, if it is still mounted. In this case try to unmount by typing
     295{{{
     296unmount <drive>:
     297}}}
     298  If it does not work, you want to reboot the machine.
     299
     300 '''How can I find out more about what has happened'''::
     301  Look in the file c:\vxd.log . The file system driver codadev.vxd prints information about all requests and answers in this file. the information is only stored if the debug level has been turned on. the debug level is specified in the registry
     302{{{
     303HKLM/System/CurrentControlSet/Services/VxD/Codadev/Debuglevel
     304}}}
     305  Set the debug level higher than 0 to receive messages in the debug file.
     306
     307 '''I hook my running machine off the network and the explorer blocks'''::
     308  Venus switches to disconnected mode after a short timeout. After that it should work fine. If it doesn't, check if you have 'network connections' set up in the explorer (e.g. samba drive). 'Network connections' block your system, when no network is available.
     309
     310=== Restrictions ===
     311
     312 * Most command line tools, that talk to Venus through the ioctl interface of the Coda kernel module seem to work even when they print error messages.
     313 * Handling large files (in particular executables) does not work well in a low bandwidth scenario.
     314 * cfs.exe and hoard.exe use absolute pathnames so far.
     315 * Long filenames are not supported under DOS environment yet. You can access files, but you need to use the long filenames.