To compare every comment on reddit to every other comment in reddit’s entire history would require an index
You think in Reddit’s 20 year history no one has thought of indexing comments for data science workloads? A cursory glance at their engineering blog indicates they perform much more computationally demanding tasks on comment data already for purposes of content filtering
you need to duplicate all of that data in a separate database and keep it in sync with your main database without affecting performance too much
Analytics workflows are never run on the production database, always on read replicas which are taken asynchronously and built from the transaction logs so as not to affect production database read/write performance
Programmers just do what they’re told. If the managers don’t care about something, the programmers won’t work on it.
Reddit’s entire monetization strategy is collecting user data and selling it to advertisers - It’s incredibly naive to think that they don’t have a vested interest in identifying organic engagement
You think in Reddit’s 20 year history no one has thought of indexing comments for data science workloads?
I’m sure they have, but an index doesn’t have anything to do with the python library you mentioned.
Analytics workflows are never run on the production database, always on read replicas
Sure, either that or aggregating live streams of data, but either way it doesn’t have anything to do with ElasticSearch.
It’s still totally possible to sync things to ElasticSearch in a way that won’t affect performance on the production servers, but I’m just saying it’s not entirely trivial, especially at the scale reddit operates at, and there’s a cost for those extra servers and storage to consider as well.
It’s hard for us to say if that math works out.
It’s incredibly naive to think that they don’t have a vested interest in identifying organic engagement
You would think, but you could say the same about Facebook and I know from experience that they don’t give a fuck about bots. If anything they actually like the bots because it looks like they have more users.
You think in Reddit’s 20 year history no one has thought of indexing comments for data science workloads? A cursory glance at their engineering blog indicates they perform much more computationally demanding tasks on comment data already for purposes of content filtering
Analytics workflows are never run on the production database, always on read replicas which are taken asynchronously and built from the transaction logs so as not to affect production database read/write performance
Reddit’s entire monetization strategy is collecting user data and selling it to advertisers - It’s incredibly naive to think that they don’t have a vested interest in identifying organic engagement
I’m sure they have, but an index doesn’t have anything to do with the python library you mentioned.
Sure, either that or aggregating live streams of data, but either way it doesn’t have anything to do with ElasticSearch.
It’s still totally possible to sync things to ElasticSearch in a way that won’t affect performance on the production servers, but I’m just saying it’s not entirely trivial, especially at the scale reddit operates at, and there’s a cost for those extra servers and storage to consider as well.
It’s hard for us to say if that math works out.
You would think, but you could say the same about Facebook and I know from experience that they don’t give a fuck about bots. If anything they actually like the bots because it looks like they have more users.