Hacker News new | past | comments | ask | show | jobs | submit login
League of Legends data scraping the hard and tedious way for fun (maknee.github.io)
158 points by maknee 75 days ago | hide | past | favorite | 38 comments



I worked on something like this back in 2016, I'm not sure how much things have changed since then. I used dynamic binary instrumentation to deal with the field encryption. Basically, manually map the executable into executable memory on Linux (as if it were a shared library). Begin execution at the packet switch, but before executing a block of code, disassemble it until a conditional branch, and modify it according to some heuristics to remove the at rest encryption. The original block of code wasn't executed since it might not fit into the original block size, so new blocks were mmap'd for this. Malloc/Free were hooked and replaced with wrappers over glibc's free/malloc, but with bookkeeping so that the memory can be freed after execution of the packet switch. atexit was just replaced with a noop. That all just dealt with the encryption, but there were also randomized packet id's and field orders. Those problems were dealt with by using manually written heuristics based on the packet id's which were actually interesting. Packet handlers with references to text strings (even hashed ones), etc were a gold mine here because they made static detection of packet id's simple. If there was no text string, many of the offsets could be auto detected just by parsing a replay and running small snippets to determine which offsets actually "made sense" for the field that was being searched for. For example, if there was a gold gain packet, the amount of gold gained shouldn't be out of an expected range, or else the offset is likely not corresponding to that field. Once all of the high volume code blocks had been instrumented, replays were able to be parsed in 2-3 seconds (along with generating the desired data aggregations). This is all from memory so it's possible there could be a minor mistake or two.


I've always heard that "security through obscurity" is discouraged because, well, there's no stopping someone from digging in and figuring it out. However in this case it seems somewhat successful in that the author was not able to decrypt the packets directly.

The article says that "while it might seem feasible to reimplement these functions in Python without running the client, several factors make this approach impractical" and then lists some reasons like the lookup tables changing, chunk layouts getting shuffled, etc.

Is that all it takes to thwart decrypting the packets? Even though, presumably, you have access to all those lookup tables and chunk layouts somewhere in the client? Is it just too much effort to piece together how it works? I'd be curious to hear more specifics on how exactly Riot was able to make reverse engineering this so impractical.

Great article!


>I've always heard that "security through obscurity" is discouraged because, well, there's no stopping someone from digging in and figuring it out.

This should tell you enough about the person.

Obscurity very often increases security. The question is - by how much?

It is fine to add obscurity AS ANOTHER layer.

It is the same story as the "open source software is safer"

No. If you are open source & you have significant community then it is, otherwise closed source is harder to attack.


There's hundreds/thousands of generated one way decryption schemas for fields. However, it's not impossible to generate the decryption in another language with some effort.

Example:

A packet could be decrypted like this (the actual decryption takes more steps than this)

field1 = LOOKUPTABLE(XOR(ADD_CONST(... field2 = ADD_CONST(XOR(LOOKUPTABLE(... field3 = ADD_CONST(XOR(SUB_CONST(LOOKUPTABLE(... ...

We observe that each each operation is composed of ADD_CONST, XOR, SUB_CONST, LOOKUPTABLE and the lookup tables in the client which is ~256 bytes long.

We could extract these operations and generate a really long script in python.

Why didn't I approach it this way?

1) It's really fragile. League is an actively updated game and the decryption mechanism may change in the future. If the decryption adds another operation like MUL_CONST or DIV_CONST, I would need to account for that on my end. This is unlike the reverse engineering efforts for dead games/servers where the packets do not change.

2) I don't need to know how the decryption mechanism works. Building a game server would require decryption of packet necessary. I only need to observe game state.

As for understanding how it works, I have not put enough time/effort to give an answer. :)


> League is an actively updated game and the decryption mechanism may change in the future

Wouldn’t this render all previous replay files unusable? Is that likely to happen?


You can't watch a replay from the league client that doesn't match the same version.


They could just also support the past methods


what a blast to run into this subject on hn ..

apologies, english is not my native ..

dug into LOL more than a decade ago with a few mates to back an api/bot/site, parsing the keyframe and chunk formats within a week of spectator launch to automate timer callouts for jungle camps through fog of war due to the observer delay being less than the respawn time and so-called « auto-shoutcasting » for matches implementation when we were maybe 11 years :)

there are a number of difficulties these days (ive not played in years but work in the industry and do not touch these due to legal risk particularly REing competitor code)

from kernel anticheat being a requirement and packman before that - this article was written during covid so predates vanguard but contains packman -

legality - RE is « forbidden » - bannable so in which you do not want to lose your account in which you have spent tens of thousands of hours or more, breaking authentication and DRM flows is [DRM, auth handshake, protecteions] illegal in USA -

entire obfuscation format and flow changes with every patch; you have to repeat the work every hotfix + patch (and it isn't just a new xor key) - the re implementations would probably need to be realised every week or two - annoying - this is likely one of most tedious bits


> legality - RE is « forbidden » - bannable so in which you do not want to lose your account in which you have spent tens of thousands of hours or more, breaking authentication and DRM flows is [DRM, auth handshake, protecteions] illegal in USA

You are confused. Reverse engineering is perfectly legal in USA, but that of course is irrelevant when it comes to losing your account (which isn't yours to begin with).


I would have assumed that the changes make it too impractical to maintain.

Semi-related but the game Vindictus/Mabinogi Heroes (a Source engine MMO) changed the game archive format multiple times (and probably continues to change it every so often) because people would eventually reverse-engineer the format, dump the files, then use them in Garry's Mod or the like.


This is really something cool, and it is exactly what I was looking for. To give a context, I worked on some data science-inspired studies [1] about LoL, and the future research direction is to provide a formal modeling for the games and analyze them through it. While I had a little success by getting aggregated data from websites such as uol.gg, the granularity is not fine enough to do very interesting analysis.

[1] https://doi.org/10.1016/j.ipm.2023.103516


The World of Warships community has gone through similar steps, but the encryption is much more straightforward. Some of the packets are pickled Python, some are just binary blobs, so there are some undocumented packets but for the most part people have done a decent job of figuring it out and building tooling around it such as the minimap renderer: https://github.com/WoWs-Builder-Team/minimap_renderer

There’s an odd unspoken and somewhat understood agreement between the developer (Wargaming) and community though: the community actively reverse engineers the game to document the packets and WG kind of looks the other way (except when they recently threatened me with a perma ban :) — they even use the tooling the community creates for official tournaments.

In this article the author mentions Riot partnering with external companies to provide more rich data set and analytics. Do they use these tools/data sets for tournaments as well? Is it known at all how these partnerships are structured?


Glad to see another community working on similar things!

I do not know how RIOT partners with external companies, so I do not know any analysis tools or datasets besides what is publicly available :(

At least, RIOT offers special endpoints/overlays for companies ~ [1][2].

[1] https://blitz.gg/overlays/lol [2] https://www.overwolf.com/browse-by-game/league-of-legends


> League of Legends runs on a custom game engine developed in 2009.

Developed by Sergey Titov (same engine that powers Big Rigs).


Big Rigs: Over the Road Racing?


Yes, angry video game nerd made a very funny video about it. Other game that i know that runs on same engine is WarZ.


I'm not very well versed in RE, but I know that competitive games like this spend a lot of effort in preventing you from attaching debuggers, hooking and decompilation.

By passing this is not mentioned at all in the article. Is this because they're trivial to bypass for experienced people, or because they want to hide their method from the dev?


I did something similar with a friend for some time for another game.

As it went, our data was used to prove things to the developer they would have loved to hush-hush, which led to a cat and mouse game with the data and their open and... not so open apis. In the End, we stopped playing the game and stopped our efforts at it. Fun times.


Getting data by directly processing the packets instead of using the (buggy, slow) replay system is a great idea. There's a lot of interesting data in the middle of LoL gamestate that is missing in summary overviews that only consider the final state of the game.


One of the cool things about dota is that opendota and stratz provide a lot of data because steam is relatively open.

it is how i wrote a blog post on generating builds for heroes before dota plus even had the feature!


Where/how are images like this made? They're cool. Technical and communicative, but with a relaxed and casual look and feel.

https://maknee.github.io/assets/images/posts/2024-11-02/leag...


my guess would be a tool like excalidraw of tldraw


I remember doing this 10+ years ago now for a site called probuilds. I left lol shortly after this. Cool to see that the packets haven’t changed much. (Based on my memory)

Shortly after I released this for TSM riot came out with the api.


I've been working on something similar [1], but I took a different approach: I statically extract all decryption stubs using a IDA script I wrote, then emulate them using Unicorn. I'm also interested in your implementation details—do you have your code on GitHub or somewhere else?

[1] https://github.com/m0w0kuma/ROFL


That's pretty cool! It's quite similar to my tool in many ways. Parsing the file, setting up the packet context and using unicorn :)

The repo isn't on github. I might release it later, but I would want it to be in a better shape if I were to.


A tip:

  @media (prefers-color-scheme: dark) {
    img[src*="svg"], img[src*="png"] {
      filter: invert(1) hue-rotate(180deg);
    }
  }


The diagrams are not visible in dark mode.


  document.querySelectorAll('img').forEach(img => img.style.background = 'white');
As a quick hack for anyone else that has the problem (paste into your browser console).


Oops, I didn't realize that the images are not visible in dark mode. I'll fix it. Thanks for pointing that out!


I see comments like this a lot actually and I'm curious, if the client is manipulating the intended style and layout of the site, do you really think it's the responsibility of the website owner ?

Otherwise I'm confused why you mention it.


This isn't the case of a browser plugin modifying the styles. The blog framework or whatever detects what your browser/system preference is and respects it. So if you've got your browser/os set to "dark mode" the page renders in "dark mode". Except the author used transparent images with dark lines, so they are invisible.

I think it's fair enough to complain about.


The site automatically displays in dark mode if the browser says it’s using dark mode.

So this isn’t something the user is doing to manipulate the style and layout: their browser is saying “hey, fyi, this user’s local system biases to dark mode” and the site is choosing to respond by styling in a way that breaks diagram visibility.


In this case yes because the website itself has a dark-mode toggle in the top right corner, and in its dark mode, the images are not visible.


Ahhh I missed that! That's completely fair then


The blog has a toggle for darkmode and some of their images are black text with a transparent background. When darkmode is toggled, the text is effectively invisible, so in this case it seems to be an oversight of the blog.


This site has a theme picker to toggle between light and dark modes.


Really cool project! I am not sure if this is only me, but your dark theme is hiding the illustrations fyi.


GTFO hackernews, we only play Dota2 here.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: