Register
View RSS Feed

WoW-GPS Dev Blog

API First

Rate this Entry
I had a thought the other day - potentially a big one.

First, some obligatory backstory: From just about Day 1, WoW-GPS has been using wowuction as its primary data source. Before I even started developing the first module (Saronite Shuffle), I had made arrangements with Sasa and he even setup a private API channel to pull individual realm data from.Over time, certain limitations (although reasonable) demonstrated to me the need for WoW-GPS to collect and utilize it's own data, directly from the Blizzard API. Part of the reason was that the method Sasa had worked out for me limited the number of Items I could fetch to about 150 per request, and in building an MoP shuffler module (never finished, FYI), I found myself either needing to split the request into 2 calls and increasing processing time, or drop items from the module altogether (like perfect uncommon cuts, etc.) and diminish the user experience. There were also a number of occasions where I found myself either needing more or less information than what the wowuction API presented and unfortunately without ability to further customize my requests. To compound the issue, there was the glaring fact that I was completely reliant upon someone else's service to provide CRITICAL functionality to my own. If wowuction was ever no longer maintained, or if there was a major service disruption, WoW-GPS would come to a screeching halt (aside from users manually inputting ALL their prices, of course). I know some might say that the same can be said of the Blizz API data, but the major difference is that Blizz API issues affect EVERYONE, so it wouldn't be an isolated problem for WoW-GPS users. All in all, it's just a more comfortable feeling knowing you're in control of your own fate, so to speak.

So, I've been thinking about data solutions for quite some time, probably months. Last year, when @Sapu94 and the TSM crew started the revamp process of the desktop app and making plans for their own data storage process, the option was proposed for me to share resources, but again, I'd prefer not to piggyback on another service where I can avoid it, and especially not with critical processes like AH data. Those discussions got me thinking, though, and I started looking into a number of different options on how to approach it. As I see it, there are 2 components for a service such as TUJ, wowuction or TSM when it comes to handling the API data - processing and storage. The real challenge here is that in order to do things "properly", it's not cheap. A commercial-grade server that is capable of handling both components efficiently is going to be extremely pricey, and the bottom line is always a factor (sometimes it's THE factor) when developing a self-funded, essentially "hobbyist" project like this. So, I started looking at some resources I already had at my disposal and some other, more financially-accessible ones, as well.

One option that has always intrigued me, mostly due to my exposure to Rails is Heroku, specifically the free portion of their service. Essentially, Heroku offers apps to either a free web process (which can handle user-based requests ie. through a browser) or a free worker process (which can run background jobs, server-side). I have no need for the web process, since I already have plenty of space on a lightly used web server, but the free worker process is something that I really want to investigate for the Blizzard API interaction. Fetching data for every single US and EU realm every hour is a pretty resource-heavy process, so outsourcing it to Heroku is a huge relief. The problem, here, however is that the free Heroku environment only comes with a relatively small DB, and their upgrade plans really only make sense for a legitimately profitable, commercial project - which WoW-GPS is not . So, this would mean that the storage aspect would have to occur somewhere else, which isn't necessarily a huge issue, since Heroku is able to connect to an external DB, presumably without issue. The real problem is where to house a 100-200GB database without breaking the bank? I do also have access to an unlimited (theoretically) DB via a shared hosting provider that I use with some of my freelance clients, but that theoretical is a bit of a pain point. They've given me access to an unlimited number of 100MB databases, so I can essentially store lots of data, as long as I don't store it all in the same place. If this sounds like a problem, that's because it is. In theory, I could try to see if each individual realm DB could fit into 100MB, but 1) that's a logistical nightmare, 2) it might mean separating certain aspects of the realm data into smaller chunks, which would be inefficient and 3) it poses a potential scaling problem, should the source data continue to evolve. So, I filled that one under "doable, but ugly" and continued to search.

That's when I had an idea, or the first of many, to be precise. While my shared host might be cleverly capping the "unlimited" DB resources, they don't seem to pay as much attention to the actual hard drive, itself. I mean, I've got nearly 2.5 years of Emerald Dream AH data in CSV files there that just keep building. It's currently running approx 50GB with no sign of trouble from the web host, so I figured this could be a viable route.

My first thought was that I could use a Heroku script to get the files, parse through them, and then store the parsed data files on my web server. As I was mulling over ways to accomplish this, the idea came to me to simply have the Heroku script send the URL of the raw data files and store the raw JSON files directly on the server. I realized that in order to be able to do this, I'd need some sort of processing script on the server to actually copy the files over, and the prospect of having an individual script for every realm was obviously unacceptable, so naturally the idea of having ALL the data (raw and parsed) stored right in the filesystem instead of in a DB started creeping in. If I were already storing the raw data files and running a script to process them, I figured it would end up making things more accessible down the road to build on top of that framework. It would probably mean the main site would take a performance hit compared to using a DB (haven't ever benchmarked it, though) but seeing as I'd have to jump through so many hoops in order to get a cheap DB solution running anyway, I decided that was a wash and turned focus to the added benefits of filesystem storage.

One of the biggest advantages that I think this setup offers is API functionality. I don't even know if WoW-GPS will support an external API right away, but it's something I've definitely thought about. If I did ever choose to open things up via an API, then having all my data already stored at urls like API_PATH/Region/Realm/Faction/Item/ItemID.json or API_PATH/Region/Realm/Faction/SellerName.json definitely makes things more accessible. I'm not sure if there are other implications yet, such as file locking, or read/write issues, but considering the mess that would be managing 500+ individual DBs, I'm willing to take a chance.

For me, the coolest part of all this, however, would be in building WoW-GPS 2.0 from an "API-first" perspective. With several worker processes running on Heroku apps, fetching data from Blizzard and parsing through queues on the server, I'll have to start off by writing API functions on the web server that will be able to handle and execute external requests. It will definitely be an interesting way to approach the project, and one that I'm hoping will eventually lead to highly accessible data for end-users at some point down the road - the sooner, the better.

I'm curious to hear if anyone has any insight or feedback about this API-first undertaking. Think it could work? Excited about the possibility of more easily-accessible data exports? Think I'm crazy?
Categories
Uncategorized

Comments

Page 1 of 2 12 Last
  1. snowstorm's Avatar
    If a small performance hit isn't a huge issue and you think you have theoretically lots of storage space, how would a sqlite3 db work out for that? It wouldn't be nearly as optimized as a dedicated DB server, but it would at least be easily queriable in theory. (Though, I've never used one over 1-2 gigs myself, so no idea how badly the performance degrades when it gets larger...)

    JSON exports for end users would kick ass. I wanted to do some realm data processing myself but currently i was just going to do it locally pulling from TSM app's data crunching. I am not yet familiar with Bliz's API and the proper averaging math to do it myself. In time.
  2. Kathroman's Avatar
    Quote Originally Posted by snowstorm
    If a small performance hit isn't a huge issue and you think you have theoretically lots of storage space, how would a sqlite3 db work out for that? It wouldn't be nearly as optimized as a dedicated DB server, but it would at least be easily queriable in theory. (Though, I've never used one over 1-2 gigs myself, so no idea how badly the performance degrades when it gets larger...)

    JSON exports for end users would kick ass. I wanted to do some realm data processing myself but currently i was just going to do it locally pulling from TSM app's data crunching. I am not yet familiar with Bliz's API and the proper averaging math to do it myself. In time.
    Unfortunately, I don't think that's an option, certainly not a viable one - with the sheer size of what's being stored.

    On the other hand, I'm not actually all that concerned about query time. The setup I have in mind would mean you'd be accessing everything you need in one shot anyway.

    I do agree about the exports - It's something that I feel adds a lot of value for people looking to customize their own solutions. Especially with how specific I'm expecting the datasets to become.
  3. Kathroman's Avatar
    To provide a more detailed response to @Sapu94's conversation on twitter: https://twitter.com/Sapu94/status/462425149065555968

    Here's how I envision things working:

    - worker script copies raw files to server and adds to a processing queue.
    - processing worker script goes through each dump file 1 by 1, parsing each auction 1 at a time.
    - each auction will be parsed into smaller files: itemID.json, sellerName.json, auctionID.json (and whatever else ends up making sense)
    - smaller files will either then be added to another processing queue to have calculations, etc. performed, or they will be performed on demand, per user request, but with substantially smaller sets of data.

    In essence, I'm only ever reading the raw data into memory (in bulk) once, and then it's just left to sit on the server (for end-user access, if they desire) until it expires (1 week, 2 weeks, etc.)
  4. Ord's Avatar
    That is an interesting dilemma, and as a database guy working internally only (i.e. data-analytics) I've never had a problem with not having an active, nearly unlimited database(s) at my disposal for number-crunching which got me scratching my head thinking of alternatives (that don't include a re-purposed PC or two in your garage doing number-crunching for you).

    As far as what you're proposing, however, I still seem to be lacking a full understanding of how you intend to proceed. I think similar to what @Sapu94 was trying to say is that every hour or so, you will be pulling new AH data from Blizz's API but if you want to create your own "market value" you will need to process that new data will all your existing, unexpired data as well, every single pull. You'd need to "re-calculate" the market value considering the new pull plus all data you have for the past 1-2+ weeks every hour. I can see your idea working for things that don't need to be analyzed like character professions/level and so on that are only ever "realtime," but for market value, I don't see a way around data-processing all of your data every hour to aggregrate your new pull into your calculated market values.

    At least, that's how I understand and possibly don't quite see the real solution that you seem to have figured out.

    Also, I totally understand wanting to be independent of other services to provide a sort of 'bare-metal' approach. Perhaps instead of relying solely on one service, you could provide options to link to any one of them (a-la a dropdown in the user preferences for preferred price source), or even your own custom aggregation from multiple sources. Average the realm market values from WoWuction, TUJ and TSM all at once. If one is down, the other two are still available to generate a value.
  5. Kathroman's Avatar
    Quote Originally Posted by Ord
    That is an interesting dilemma, and as a database guy working internally only (i.e. data-analytics) I've never had a problem with not having an active, nearly unlimited database(s) at my disposal for number-crunching which got me scratching my head thinking of alternatives (that don't include a re-purposed PC or two in your garage doing number-crunching for you).

    As far as what you're proposing, however, I still seem to be lacking a full understanding of how you intend to proceed. I think similar to what @Sapu94 was trying to say is that every hour or so, you will be pulling new AH data from Blizz's API but if you want to create your own "market value" you will need to process that new data will all your existing, unexpired data as well, every single pull. You'd need to "re-calculate" the market value considering the new pull plus all data you have for the past 1-2+ weeks every hour. I can see your idea working for things that don't need to be analyzed like character professions/level and so on that are only ever "realtime," but for market value, I don't see a way around data-processing all of your data every hour to aggregrate your new pull into your calculated market values.

    At least, that's how I understand and possibly don't quite see the real solution that you seem to have figured out.

    Also, I totally understand wanting to be independent of other services to provide a sort of 'bare-metal' approach. Perhaps instead of relying solely on one service, you could provide options to link to any one of them (a-la a dropdown in the user preferences for preferred price source), or even your own custom aggregation from multiple sources. Average the realm market values from WoWuction, TUJ and TSM all at once. If one is down, the other two are still available to generate a value.
    My thinking here is that I'd like to keep things as flexible as possible, since so much of the end-user functionality hasn't been fleshed out. So, I don't really know what I NEED the data to be until I've fully worked out how it's going to end up being USED. As such, nothing's really been decided - merely floating ideas around.

    For the traditional MV model, I have 2 options, really. First would be to calculate MV for every item as the scans are parsed (ie. "hourly"). Second, though, would be to store the auction data for each item, and then calculate MV on-demand. Consider the following data structure:

    .../API/region/realm/faction/items/58719.json
    {
    "item": "Get Rich Ore",
    "auctions": ["18976258","18976259","18976260"...]
    }

    In order to calculate MV on-demand, I'd just need to grab the auction data from that "auctions" array and parse through it, rather than going through all of the raw files.

    I'll try not to get TOO philosophical here, but I'm also considering moving away from the traditional MV calculation as well. One of the primary reasons is data efficiency - in order to do traditional MV, I'd need to store EVERY auction from EVERY scan, but what I've proposed would only store each auction ONCE, and then update the time remaining as that changes. It would also mean that I'd need to pull timestamp data so that auctions could be associated with their scan times. If I move away from this, I can use a simpler calculation, and a data structure like this might make sense:

    .../API/region/realm/faction/items/58719.json
    {
    "item": "Get Rich Ore",
    "auctions": [{ "id": "18976258", "price": "54999", "qty": "5"},{"id": "18976259", "price": "54998", "qty": "10"},{"id": "18976260", "price": "54997", "qty": "5"},...]
    }

    With this, I wouldn't need to read each auction file for MV, since the essential data would be right there in the index file.

    Now, as for the philosophical reasoning: the current MV calculation essentially takes a snapshot of the AH at every scan interval and historical data compares each snapshot against each other. The main assumption here is that every single snapshot is equally valuable, but is this really the case? Is the MV of an item at 4am really as important as the MV of an item at 8pm? It might, be, but given some of the potential obstacles in calculating that, I'm willing to explore other options.

    Hope that makes some sense. Remember that nothing's been finalized, nor will it be for some time. I'm trying to explore some different options to find something that will work the best.

    As for a combination of other service's data: I don't think it would work, really. None of them are really accessible on the scale that I expect I'd need them to be. Wowuction has been the best, since Sasa has opened up an exclusive API to handle larger requests, but all other services are really only built to handle large sets of data (ie. full scans) at infrequent intervals.
  6. Sapu94's Avatar
    Quote Originally Posted by Kathroman
    As for a combination of other service's data: I don't think it would work, really. None of them are really accessible on the scale that I expect I'd need them to be. Wowuction has been the best, since Sasa has opened up an exclusive API to handle larger requests, but all other services are really only built to handle large sets of data (ie. full scans) at infrequent intervals.
    We (TSM) will be offering "live" data access. This is exactly what we are doing with the TSM beta app. We update our global prices every 3 hours (don't see any use in doing it much more frequently, but the processing power needed for calculating global prices is negligible due to how our database is organized), and realm-specific prices every time we get new data from Blizzard. Our latency for realm data is only a couple minutes on average from when Blizzard posts a new snapshot to when we make it available for the TSM app to download (we go through all the realms and check for updates every 5 minutes, and it takes ~1 minute to get through our processing pipeline).

    In fact, I've already setup a third-party-accessible API for a couple things, including fetching global data. This will be flushed out more in the future.
  7. Kathroman's Avatar
    Quote Originally Posted by Sapu94
    We (TSM) will be offering "live" data access. This is exactly what we are doing with the TSM beta app. We update our global prices every 3 hours (don't see any use in doing it much more frequently, but the processing power needed for calculating global prices is negligible due to how our database is organized), and realm-specific prices every time we get new data from Blizzard. Our latency for realm data is only a couple minutes on average from when Blizzard posts a new snapshot to when we make it available for the TSM app to download (we go through all the realms and check for updates every 5 minutes, and it takes ~1 minute to get through our processing pipeline).

    In fact, I've already setup a third-party-accessible API for a couple things, including fetching global data. This will be flushed out more in the future.
    While interesting, this is only as useful as it's limitations - same as the rest
  8. Sapu94's Avatar
    Quote Originally Posted by Kathroman
    While interesting, this is only as useful as it's limitations - same as the rest
    I guess I'm mainly confused about why you are so intent on making an API for others to use, when you yourself don't trust APIs made by other people...
  9. Kathroman's Avatar
    Quote Originally Posted by Sapu94
    I guess I'm mainly confused about why you are so intent on making an API for others to use, when you yourself don't trust APIs made by other people...
    1) I'm building an API to build an API. Whether or not others use it is almost irrelevant. It's a valuable skill, and in this day and age, a very powerful experience. The value is in the creation, not the usage.

    2) I never said I don't trust APIs made by other people - I said they have limitations. TUJ has an export available, but it's limited to a small number of requests per day and it's cumbersome, in that the data comes entirely in bulk. It serves a purpose, just not mine, not for this project. Wowuction have a more specific API, which it nice, but it has limitations in the number of items I can retrieve. This has already proven problematic for past modules that I've developed and based on plans for 2.0, will only continue to prove problematic. It's workable, but it's certainly not a long term solution. With your API, you've only mentioned global data will be available, which again serves a purpose, just not mine. Of course, this is subject to change, but I can't very well make plans based on something that may or may not happen. Also, you've gone to great lengths to enlighten me about the trouble you guys have gone to in order to ensure the data is getting ONTO your server, why wouldn't it be reasonable to expect obstacles getting data OFF it as well. Limits such as those that TUJ and Wowuction have employed to ensure their services aren't crippled by excessive use are both reasonable and necessary, so even though I have no details about the TSM API, I think it's reasonable to expect usage restrictions will eventually be put in place. Again, these may be limitations that would still allow WoW-GPS to function, but they also may not, and I have to plan accordingly.

    At the end of the day. WoW-GPS isn't going to be built for any other reason than because I enjoy building things (now, if someone commissioned me a large sum of money to develop it, I'd probably consider that, as well). So, I'm not really looking to build an API for users (although, I expect people will probably get something out of it), I'm building it for myself. I hope that makes sense.
  10. Ord's Avatar
    Now, as for the philosophical reasoning: the current MV calculation essentially takes a snapshot of the AH at every scan interval and historical data compares each snapshot against each other. The main assumption here is that every single snapshot is equally valuable, but is this really the case? Is the MV of an item at 4am really as important as the MV of an item at 8pm? It might, be, but given some of the potential obstacles in calculating that, I'm willing to explore other options.
    AuctionDB uses an algorithm to remove outliers and then calculates an "average of the previous 14 days' market value plus today's" which is "heavily weighted" resulting in the data "having a 'half life' of a little over 2 days." (source)

    Wowuction's market price "denotes the 15.87th centile (value at standard deviation of -1) of that data it describes" (source)

    TUJ has a mean price in it's API that is just that, a "running average" but the market value is "calculated" from the raw scans and "uses more data points [and] more accurate [than a basic mean]." (source is a bit old, but the best I could find for TUJ)

    What I'm getting at here is that nobody does a basic mean where each "single snapshot is equally valuable." There is calculation, math, statistics and processing involved to create the "Market Value(s)" that we are familiar with today are calculated with formulas that have been refined and perfected over years of communities who rely on it's accuracy for maximum profitability. Even for single items, processing a weighted or statistical (or whatever method you'd want to experiment with) could be a CPU/RAM/HDD intensive task as @Sapu94 was hinting with his tweet that "That's still a lot more processing than you might think..." I'm in no way undermining your intentions, it's just that from what I understand, it might seem like you're tying to 'reinvent the wheel' and I'd hate to see you put a lot of resources into what's already had hundreds or thousands of hours of tested work from the community(ies) in general.

    Perhaps one solution could be a collaboration agreement with @Sapu94, @Erorus and/or Sasa for specific, authenticated accesses to their services that could allow you to bypass the limitations of their public API that would otherwise make GPS 2.0 not effective. I don't know, but your number one dilemma of lack of storage/database size could be a non-issue if you were able to leverage the existing services that do have the storage and processing power to accomplish what you are considering doing. Also consider that these sources are familiar and trusted by users and even if you have a great solution, some users might want to see TSM/TUJ/Wowuction prices in GPS 2.0 becasue those are the prices they see in-game and base their operations off of.

    At the end of the day, it'll take real-world tests. Maybe you do have the right idea and are on to something but if not, and it does end up being as troublesome as @Sapu94 hinted at, then using other services resources for your pricing might allow you to move on to the bigger and better things you've been envisioning.

    Not trying to be aggressive about it, just constructive criticism is all; I just want what's best for GPS 2.0.
    Updated May 7th, 2014 at 03:14 AM by Ord
  11. Sapu94's Avatar
    I agree with @Ord in that when I read your blog posts I think some of the things you've described in your WoW GPS "wishlist" posts are extremely exciting ideas, but in this post you seem to be focusing a lot of effort on solving problems which have already been solved, and do not add much benefit to the overall goals you described in your previous, "wishlist" posts.

    So, why did we (ie TSM) re-invent the wheel? Because we (and TSM users) want TSM_AuctionDB to have min buyout data, and that requires more frequent updates than WoWuction / TUJ provides (TSM_WoWuction doesn't even have a min buyout stat for this very reason). It seems like the only "benefit" you're getting out of building your own data collection scripts is the fun / experience of doing so, rather than something that will directly contribute towards the end-goals of the WoW-GPS which you previously described. While that experience is worth something, if that's the only benefit you're getting out of all this work, as somebody who's gone through it myself (and is continuing to fight it - we've had more downtime in the last couple days than up-time due to issues with Blizzard's APIs as well as our server), I'd argue that it's not worth the headaches.

    As far as limitations, our plan (and the beta app does this currently) is to have the TSM app download realm+global data from our server. There are currently ~10k people running the non-beta version of the app. If our new server can't handle an extra load equivalent to say 100 users from a third-party website, then we're in trouble. But as you said, I (and you) have no idea what the requirements of WoW-GPS will actually be and how that will compare.

    I'm curious to hear if anyone has any insight or feedback about this API-first undertaking. Think it could work? Excited about the possibility of more easily-accessible data exports? Think I'm crazy?
    Yes it could work. I don't see the data exports as being a selling point (for lack of a better way to describe it than "selling point"). I'm the crazy one telling you not to be like me.
  12. Kathroman's Avatar
    @Ord - no worries. I definitely appreciate the feedback (I did ask for after all ).

    First, I'll explain that I'm not "reinventing the wheel", I'm reiterating it. Second - part of the reason why it's necessary is the same as it's been for everyone else who has done the same. Each of those 3 examples have taken a different approach to how the interpret the data, and they all have their place. Each resource offers both immediate and historic context to their data. Here's a quick overview:

    TUJ
    -Immediate context is applied directly through MV, which is a calculated representation of what's happening with only the most recent scan.
    -Historic context is applied through the use of chart-based comparisons. Here, the comparison is made on a strictly 1:1 level. All points are considered equal insofar as they establish "trend".

    Wowuction
    -Immediate context is also applied directly through MV - again, it's current scan only.
    -Historic/additional context is applied through supplementary metrics such as % Daily Change or Demand, and also through static charts.

    TSM
    -Immediate context is here applied through min buyout.
    -Historical context is uniquely baked right into MV, using the weighted 14 day average. While you're correct in pointing out that it's not a straight 1:1 comparison, part of what I'm saying is that the only factor it considers is recency. It's just not enough context, IMO. If you consider my previous example, and let's say someone is viewing data at 3AM - are you telling me that the 2AM data is MORE valuable than the data from 8PM on the previous day? In some markets, it might be, but in others, it might not.

    As far as the system resources are concerned, I'm really just trying to keep an open mind, because nothing's been field tested yet. At the end of the day, it might be a non-issue, perhaps my available resources will prove capable of handling anything and everything I can dream up. I really can't say because I've never tested them on such a large scale, but I'm trying to be prepared in case they don't, and I'm also trying to think outside the box a little bit in order to have multiple options to work with when that time comes.

    This much I DO know: I don't currently have access to the DB resources required to store Current Data for every WoW Realm, let alone historic. I also know that I'm going to need access to this data because it isn't readily available in a format/context I'm satisfied with anywhere else. To me, this means a non-DB route, and from there, I'm trying to sort out different ways to accomplish this effectively.
  13. Kathroman's Avatar
    Quote Originally Posted by Sapu94
    I agree with @Ord in that when I read your blog posts I think some of the things you've described in your WoW GPS "wishlist" posts are extremely exciting ideas, but in this post you seem to be focusing a lot of effort on solving problems which have already been solved, and do not add much benefit to the overall goals you described in your previous, "wishlist" posts.

    So, why did we (ie TSM) re-invent the wheel? Because we (and TSM users) want TSM_AuctionDB to have min buyout data, and that requires more frequent updates than WoWuction / TUJ provides (TSM_WoWuction doesn't even have a min buyout stat for this very reason). It seems like the only "benefit" you're getting out of building your own data collection scripts is the fun / experience of doing so, rather than something that will directly contribute towards the end-goals of the WoW-GPS which you previously described. While that experience is worth something, if that's the only benefit you're getting out of all this work, as somebody who's gone through it myself (and is continuing to fight it - we've had more downtime in the last couple days than up-time due to issues with Blizzard's APIs as well as our server), I'd argue that it's not worth the headaches.

    As far as limitations, our plan (and the beta app does this currently) is to have the TSM app download realm+global data from our server. There are currently ~10k people running the non-beta version of the app. If our new server can't handle an extra load equivalent to say 100 users from a third-party website, then we're in trouble. But as you said, I (and you) have no idea what the requirements of WoW-GPS will actually be and how that will compare.


    Yes it could work. I don't see the data exports as being a selling point (for lack of a better way to describe it than "selling point"). I'm the crazy one telling you not to be like me.
    The problem, though, is that almost none of those goals on the wishlist are viable without data behind them. I think where you're coming from is that "the data is already out there", but from my perspective, this is only partially true. I'm been using the wowuction data for over a year now, I'm familiar with (and have already explained) its limitations. With TUJ, the data is there, but it's not in a format that's viable for what I need. It's bulk data, and it's restricted by time - this isn't something that will translate into multiple, simultaneous WoW-GPS users accessing specific data from different realms, on-demand. I also don't expect @Erorus to change that. TUJ is now being "maintained" instead of actively developed, because it doesn't really NEED anything else, until it does (like Battle Pets, or BMAH, etc.). It's just not a viable option.

    With the TSM data, what you're talking about seems rather similar to the TUJ data (or, I guess also the daily, global wowuction data - I was using something else for WoW-GPS) but without some of the time restrictions. My issue is that the time limits were only PART of the problem. For the functionality that I need, I'll also need far more customized data, as well. Essentially, I'll need my users to be able to (simultaneously) get data from ANY number of individual items, on ANY realm, at ANY time. If there's a resource out there that offers that level of flexibility, I'd be more than willing to consider it, but I don't think there is, so this is why I'm moving in this direction.

    Perhaps I misspoke before, or at least didn't explain myself clearly enough: I believe these Data solutions I'm trying to work out are critical for the future of the WoW-GPS project. I'm also trying to find a way to make them work using limited/cost-effective resources. For me, the API aspect becomes a "2 birds with one stone" situation, and the fact that it would present itself as a valuable, personal experience serves to tip the scales.

    So, to summarize:

    1) I need data.
    2) Building an API to get the data seems viable.
    3) I recognize that building an API to get the data might not be THE most effective method, although it also might be, that's yet to be determined.
    4) Because it's viable, I'd like to do just to do it, regardless of whether it's the #1 option or not. If it ends up not being viable, this is obviously subject to change
    Updated May 7th, 2014 at 11:27 AM by Kathroman
  14. Erorus's Avatar
    Some quick notes:

    What you want your API to do still seems vague to me. Pin down exactly what you want it to return (and what may come later), then you can worry about how you're going to store and return it.

    But it's certainly okay to think API before you think user interface. Plenty of projects do that, and it would help you to condense what your tool does and keep your presentation layer separate from your data layer.

    Parsing json is expensive, hell, parsing strings is expensive. Storing stuff as json/csv then reading it again to perform calculations on it is inefficient. While we're at it, don't count on the filesystem when what you need is a database. Use a proper DB. If you're going to parse battle.net, split and join data, write new json, and never touch the files again, ok fine. But if you later have to read that json frequently, there goes any efficiency.

    If you have json with calculated values (historical averages, for example), then reading+updating+writing all the time will get tiresome. Basically, the upkeep of calculated fields is annoying, I even struggled with it in the database on TUJ in the early days. I've learned to push as much calculation to request time as possible and do minimal pre-calculating at the time of parsing.

    On a similar note, when pre-calculating stuff, consider whether those results will be read before they expire and need to be updated. If you're storing calculations all day but only 10% of it gets read, perhaps it's better to only calculate it when requested.

    TUJ's database is about 90GB now, and that includes all auctions for the US and English-speaking EU for the past 2 weeks, as well as daily average prices for every item on every US and EU realm every day since May 2012. In json that'd be OMG huge, and I would never calculate the daily averages quick enough.

    Ultimately you'll need to balance 4 things:
    1) Time to process new data
    2) Space to store the data
    3) Space to store cached/calculated values
    4) Time to service a request
    1 must be low or you will fall behind, 2 and 3 should be as low as possible for efficiency, and to minimize 4, you'll increase 1 and 3.
  15. Kathroman's Avatar
    Quote Originally Posted by Erorus
    Some quick notes:

    What you want your API to do still seems vague to me. Pin down exactly what you want it to return (and what may come later), then you can worry about how you're going to store and return it.

    But it's certainly okay to think API before you think user interface. Plenty of projects do that, and it would help you to condense what your tool does and keep your presentation layer separate from your data layer.

    Parsing json is expensive, hell, parsing strings is expensive. Storing stuff as json/csv then reading it again to perform calculations on it is inefficient. While we're at it, don't count on the filesystem when what you need is a database. Use a proper DB. If you're going to parse battle.net, split and join data, write new json, and never touch the files again, ok fine. But if you later have to read that json frequently, there goes any efficiency.

    If you have json with calculated values (historical averages, for example), then reading+updating+writing all the time will get tiresome. Basically, the upkeep of calculated fields is annoying, I even struggled with it in the database on TUJ in the early days. I've learned to push as much calculation to request time as possible and do minimal pre-calculating at the time of parsing.

    On a similar note, when pre-calculating stuff, consider whether those results will be read before they expire and need to be updated. If you're storing calculations all day but only 10% of it gets read, perhaps it's better to only calculate it when requested.

    TUJ's database is about 90GB now, and that includes all auctions for the US and English-speaking EU for the past 2 weeks, as well as daily average prices for every item on every US and EU realm every day since May 2012. In json that'd be OMG huge, and I would never calculate the daily averages quick enough.

    Ultimately you'll need to balance 4 things:
    1) Time to process new data
    2) Space to store the data
    3) Space to store cached/calculated values
    4) Time to service a request
    1 must be low or you will fall behind, 2 and 3 should be as low as possible for efficiency, and to minimize 4, you'll increase 1 and 3.
    I appreciate the tips.

    1) Yes, it's definitely vague, but I think that's necessary at this point. There's still quite a bit of framework to implement before I can even properly determine what's possible and what isn't, so I'm not sure it makes much sense to go too far if some of the technical hurdles can't be cleared.
    2) I'd absolutely prefer a proper DB. If anyone out there has approx. 200GB of DB space on a server they're not using, I'd love to chat
    3) That's exactly what I had in mind - parse the raw Blizz JSON, write new files and never touch them again. They're far too cumbersome to rely on for any on-demand information, you're right about that.
    4) I'm definitely leaning towards the request-driven calculations. There's just too much data to justify having it all pre-calculated. I think 10% would be a generous estimate, considering EVERY item on EVERY realm.
  16. Kathroman's Avatar
    One thing I did forget to point out - having JSON -based data storage also lends to the possibility to pull the files directly into the script and run the calculations through the users' browsers, reducing the server load even further.

    Again, I do recognize it has it's own set of limitations, but I still think it's worth noting, either way
  17. Sapu94's Avatar
    Quote Originally Posted by Kathroman
    2) I'd absolutely prefer a proper DB. If anyone out there has approx. 200GB of DB space on a server they're not using, I'd love to chat
    I know of a VPS service where you can get 200GB of hard drive space (SSD-cached even) for ~$20 a month. We used this for a while for our new back-end stuff until we realized that there wasn't nearly enough processing power available in a VPS/shared setup for what we wanted to do, so we moved to a dedicated box.
  18. Kathroman's Avatar
    Quote Originally Posted by Sapu94
    I know of a VPS service where you can get 200GB of hard drive space (SSD-cached even) for ~$20 a month. We used this for a while for our new back-end stuff until we realized that there wasn't nearly enough processing power available in a VPS/shared setup for what we wanted to do, so we moved to a dedicated box.
    That's a decent price - I'll definitely keep that in mind.

    Unfortunately, until the project is generating at least $20/mo on its own, I can't really justify the extra expense. If things don't work out with the existing servers, though, then I'll definitely have some tough decisions to make

    I'll come bug you later if I end up needing the name of the VPS, though.
  19. Erorus's Avatar
    Keep an eye on lowendbox.com
  20. Kathroman's Avatar
    Quote Originally Posted by Erorus
    Keep an eye on lowendbox.com
    Bookmarked!

    Much appreciated
Page 1 of 2 12 Last