Welcome to the MacNN Forums.

Scotttheking · 12:35 PM

I just logged into my fast box, and it's been sitting there all night with these

[16:26:32] + Attempting to get work packet
[16:26:32] - Connecting to assignment server
[16:26:33] - Successful: assigned to (171.64.122.119).
[16:26:33] + News From Folding@Home: Welcome to Folding@Home
[16:26:33] Loaded queue successfully.
[16:26:33] - Couldn't send HTTP request to server
[16:26:33] + Could not connect to Work Server
[16:26:33] - Error: Getwork #5 failed, and no other work to do. Waiting before retry

What is going on, and how do I fix it? Restarting the client did nothing. It's got a 20 point protein waiting to be sent in.
This one is running a windows client.

Same with another linux client, it's stuck.

Same with windows client.

I'm guessing all my boxes are stuck.

WTH is going on???
My whole farm is sitting here, idle.

Edit: Doing more digging, I find this.
http://folding.stanford.edu/serverstat.html

And yep, the server I'm trying to connect to is overloaded. This is absurd.

Shaktai · 02:11 PM

Looks like Stanford is having some kind of major problems. They are Pacific time, so are probably became aware of the problem an hour or two ago and having to get staff in to work on it.

UPDATE 09:39 Pacific time: The website, stats and server status are all back up. Looks like the work servers may be too, but can't confirm it until I have something more to send or receive. This caught the Stanford folks sound asleep in their beds most likely. Looks like they scrambled to get right on it though.

Scotttheking · Jan 18, 2003, 01:43 PM

That's only what, 9 hours too late?

They need to fix their setup so this stuff doesn't happen.
I don't have time to monitor my farm and switch stuff to different projects when they go down.

reader50 · Jan 18, 2003, 03:23 PM

The "Server Status" indicators in the quickstats scan that page. It assumes that the project is online for each platform if any server can accept units, and any can assign them. This basically assumes the client will check all servers and assign / upload as needed.

Looks like the client is not that smart. I'll have to rework the status code to trip on any server problems that could affect a client.

Scotttheking · Jan 18, 2003, 05:00 PM

Looks to be back now.
That sure was absurd.

Shaktai · Jan 18, 2003, 07:47 PM

The official word on the problem from Vijay Pande is:

Around 2-3am, there was apparently a major problem at Stanford's main computer centers, causing net outtages throughout campus.

It looks like F@H is (back) up and mostly ok, just really overloaded due to everyone trying to return data. The clients have built in backoffs, so it should resolve itself, as long as everyone just waits (rather than pounding the servers manually -- which will prolong this for everyone).

Guha & I are on it.

So it wasn't just their servers but a network outage that created the problems. Blame it on the full moon I guess.

I was lucky, my boxes had mostly all downloaded new proteins right before the outage, and so just kept crunching through it. Only had one out of six that was effected, and then probably didn't loose more than an hour or two of crunch time on that one, and it is back up working on another 19.4 pointer that will finish tomorrow (it is the Celeron, so it takes a while).

Darn Scott, you seem to have the worst luck with this project. Go weeks without a major glitch, you join in and the project gets hit hard just in time to take out most of your farm.

What a bummer!

Shaktai · Jan 18, 2003, 07:52 PM

Originally posted by reader50:
The "Server Status" indicators in the quickstats scan that page. It assumes that the project is online for each platform if any server can accept units, and any can assign them. This basically assumes the client will check all servers and assign / upload as needed.

Looks like the client is not that smart. I'll have to rework the status code to trip on any server problems that could affect a client.

From what I could tell, it couldn't reassign other servers to issue work, because the entire system was down or overloaded. Normally though, if it can't reach one server, then it will try another. It will hold finished work, in queue until it can be sent, if a receiving server goes down.

Scotttheking · Jan 18, 2003, 08:54 PM

Originally posted by Shaktai:
From what I could tell, it couldn't reassign other servers to issue work, because the entire system was down or overloaded. Normally though, if it can't reach one server, then it will try another. It will hold finished work, in queue until it can be sent, if a receiving server goes down.

That's what it's supposed to do.
That's not what it did. It just kept sending me to the same server for hours.
Finally it sent me to new servers.

I can do the big 20 pointers in under a day, so a server outage hurts me bad. This is the 2nd time there's been a major outage for me while doing f@h.

On a good note, I might be able to get another athlon. I hope.

krove · Jan 18, 2003, 11:51 PM

Yeah, I had a big one that took a couple days on an old iMac that couldn't be submitted. I checked the server status on their website and all looked well at the time (this was about 2 weeks ag). I checked out the protein that had completed and apparently the project was gone or no longer up. I was within the time limits, so I'm not sure what happens when they finish distributing a particular project... Kinda sucks to have crunching time wasted like that...

I also found that I had inadvertantly setup a client incorrectly on a machine at work, putting "k" as the team number instead of 16. Lost 15 points there...

reader50 · Jan 19, 2003, 01:34 AM

I believe you still get stats credit for work submitted after the time limit. You do on other projects anyway. It's just not as useful to the project, because they may have reissued the unit to someone else. It would become a redundancy check instead of original data.

Shaktai · Jan 19, 2003, 04:16 AM

Yes you should get credit. There have been a few instances where a receiving server went down prematurely, but even if the project ends, keep letting it try to send. If it doesn't go in a couple of days, go the the folding forum for help. The Pande Group really tries to help out, and they have improved their systems dramatically.

Today's snafu had nothing to do with their systems malfunctiong, but was rather the result of a campus wide network problem. Unfortunately, it created a lot of problems for them and all of us.

Scotttheking · Jan 19, 2003, 12:02 PM

Originally posted by Shaktai:
Darn Scott, you seem to have the worst luck

You have no idea