Register
View RSS Feed

WoW-GPS Dev Blog

WoW-GPS 2.0 - Data Progress

Rate this Entry
So...EU realms screwed everything up.

As I mentioned in the comments of the last progress post, I had finished scaling up all the US realms and was starting to work on adding EU realms this past week. It wasn't too bad at first - sure, the extra frequency was quite noticeable, and the EU batches were quickly surpassing the US ones in terms of volume, even though I was still fetching fewer realms, overall.

As I scaled the EU realms up though, things started to go south. The EU queue was frequently getting killed mid-process and at times it was even spilling over to the US realms. The US queue wasn't a big deal, as the low volume meant it was relatively easy to get caught back up, but by the end of day one with EU and US realms both at ~150 each, I was sitting with a huge backlog of unprocessed EU data.

So, I scaled the EU realms back down again, split the queue into 2 separate workers and manually ran the backlog in between queues. Before long, I was back to normal, sitting at 100 EU realms and things seemed to be going smoothly again. The next morning, when things looked relatively normal, I jumped back up to 150 and left for work. Well...I came home to an even bigger backlog than the day before. I also started working on a DB/data checking/recovery process at the same time and noticed that when trying to run that script, not only was it getting killed, it was worsening the queue-breaking issues as well.

After a solid day of going through server logs and consulting S.O./google/etc. it seemed to me that either apache an/or my FastCGI setup was throttling my requests. Initially, when I started setting everything up, I figured it would be easiest to have the heroku workers making individual POST requests for each realm on every pass. So, when copying the raw API files, it would send 1 request for each realm that had new data. Then, when processing the queues, it would send 1 request for every single realm and parse through the data realm by realm. I figured it would be "easier" on the webserver to generate a bunch of smaller requests that result in shorter, less obtrusive PHP processes being run. Apparently, my shared host isn't setup that way - instead, I seem to be working with ~5 static PHP fastCGI processes that remain alive constantly. On top of that, they apparently have request QTY limits that I'm unable to increase at all. Fun, right? This is also just data retrieval and storage capacity, nevermind the fact that I'd still need however many POST requests down to fetch data OUT of the server once the front-facing site goes live.

So, I've recoded the queue requests to now parse through batches of 10 realms per request. This means that if there are 45 EU realms waiting to be parsed, instead of sending 45 individual requests, it will now send 5. This seems to have helped significantly. I've left the API fetch script at 1 realm at a time because I'd have to rewrite both ends of the function (data server AND heroku script) in order to still be able to handle individual file errors (I still get lots of 404 errors from Blizzard, etc.), but it's nice to know that option is available, should I require it. There's also the option to increase the queue processing from batches of 10 to something higher, but I want to try to keep individual script times within a reasonable range (ie. no 5 min scripts, trying to parse 50 realms at a time, etc.)

After all is said and done, we only ended up losing 2 EU DBs out of that whole ordeal, and they've only been running for a couple days max, so it really wasn't a significant loss. It's also forced me to restructure the scripts so that they're more scalable (I can scale batches all the way down to 1 or 2 realms each, if things change, or up higher as necessary), and I've been able to test some of the limits/boundaries of the Data server - which is very useful knowledge moving forward. I also now have a functional DB verification script that I can run to let me know if there are any corrupt DBs. Not a bad few days, if you ask me.

As it stands, we're currently sitting at 100% capacity on US realms and ~80-85% for EU. If things go smoothly for the rest of the day, I'm going to push EU up to 100% and keep a close eye on things for the rest of the weekend.
Categories
Uncategorized

Comments

  1. Kathroman's Avatar
    Well, that didn't take long...

    EU API script started bombing part-way through this afternoon, so I went ahead and rewrote those scripts to accommodate batches of 20. I've also gone ahead and scaled EU realms up to 100% capacity, so now it's just making sure everything keeps working.

    At least I now have the ability to easily increase those batch sizes wherever appropriate, so hopefully no more code rewriting anyway...
  2. Kathroman's Avatar
    UPDATE: Everything seems to be doing well the past 2 days. I also had some downtime at work this morning and built myself a little dashboard that scans each realm for Errors and has some built in recovery/maintenance triggers. I'll be able to use this to monitor the data server for any errors instead of having to -tail through the Heroku logs.

    By the end of next weekend, we should have ~2 weeks of data for every US and EU server, so I can hopefully shift the Data processes into maintenance mode and focus 100% of my attention back on the front-end and application development
  3. Saliira's Avatar
    I love these blog posts even though I don't have any feedback or analysis to contribute. Keep them coming for the interested reader. :-)
  4. Kathroman's Avatar
    Quote Originally Posted by Saliira
    I love these blog posts even though I don't have any feedback or analysis to contribute. Keep them coming for the interested reader. :-)
    Good to hear! I, personally think it will be really interesting in a year or two to look back on these early development landmarks to see how far things have come...
  5. Kathroman's Avatar
    Happy to report that things have been running smoothly since the weekend *knock on wood*

    I've been running the new dashboard a couple times each day and haven't had to intervene once, so far. Ready to move to application details (fun stuff)
  6. Moogle_'s Avatar
    +1 to Saliras post. Also, I might not understand every part of the posts, but at least you got me to use Google excessively every time I open one of these
  7. Kathroman's Avatar
    Quote Originally Posted by Moogle_
    +1 to Saliras post. Also, I might not understand every part of the posts, but at least you got me to use Google excessively every time I open one of these
    Nice. Good to hear they're not completely over people's heads - I do try to keep them accessible still
  8. Kathroman's Avatar
    Well, apparently I jinxed it. Been getting 500 server errors for the past few hours when fetching/copying the raw files. I'm not 100% certain yet whether it's coming from Blizz or from the Data server. Unfortunately, the Data server doesn't have permanent error logs, and I have to actually go in every week and re-enable them. I did add a bit more error reporting to the script though, so that's always a good thing. I also see that Blizzard has been having some issues lately with a power outage at one of their data centres, so perhaps this is related. I might have also gotten myself IP blocked by either Blizz, or the data server, or both. Usually you'll get a more specific error message in that case, though, but who knows.

    Good news is that if it is Blizzard throttling me/choking me off (which doesn't make a LOT of sense, since they're currently running with limitless API requests) my setup is super-flexible and I can simply shut down the heroku workers, create new ones, redeploy and then I just need to run the Realm DB population script and queue the workers and we're back in business on a brand new IP address.

    Hopefully we'll know more in a little bit...
  9. Kathroman's Avatar
    *sigh* - looks like it was just one or 2 realms that was bombing the whole script. Kicking myself for not putting that error trapping in from the start, but at least it's there now and we're back on track.