System/database design for comments/replies and upvotes at scale

0 votes
2 answers
7154 views
postgresql database-design high-availability hierarchy relations
                          Recently I started discovering a topic around "How to design db schema for storing structures similar to Instagram/Facebook/Reddit comments?".

After extensive research, I was able to find a bunch of different answers on SO, SE, medium articles and etc. Meanwhile, all of these articles were pretty basic and always point out a Closure table pattern , which I used once back in the day.

I did implement a comments/replies system only once a few years ago using PostgreSQL and since then the product is already not in production, so I don't know how my solution would scale in a data-intensive environment. 

Therefore, I decided to ask a specific question with specific requirements and constraints, so I could probably get a hint from someone who had this experience in production!

Here we go with two different tasks:

**Task 1**

*Requirements:*
 - When I open a post I see only the first level of comments. In particular, 50 most liked comments are ordered ascending by the number of likes.
 - For every top-level comment: if a comment has only one reply - display this reply too.
 - For every top-level comment: if a comment has multiple replies, display only the one which is the most liked.
 - When the user clicks on "more replies": Display replies in descending order by their created_datetime.
 - The max depth is 2: Only the top-level comment and replies to it can exist. Replies to lower-level comments(depth == 0) should always be displayed near their parents. The only thing that distinguishes them is just a mention of a user you reply to, like @ on instagram.

*Questions(only related to the design of relational database with Closure table):*
 - What are the problems you faced in **production** with it and how you had to fix them? What would you recommend to people who just start with this, what should they spend their time on at the beginning to prevent a cascade of mess in the future?
 - Is there a better pattern with RDBS nowadays for this purpose?


Let's imagine the system grows. We don't talk about thousands of requests, but we talk about hundreds of thousands of comments and replies to them. E.g. some celebrity posted a message and then all the fans started replying, having conversations and etc. It results in a lot
of rows in our records in both the comment and closure tables. Our queries to group by amount of likes start getting much slower on some posts, causing long-running transactions which cause a ton of mess and even probably downtimes.
Again, that's what it looks to me that could happen if we just use a closure table. But what really happens? Curious to hear stories of people who had problems with it in really data-intensive applications.
E.g. We can shard the table somehow, right? Or for really big posts we could cache a lot of stuff, right?

**Task 2**
 - The main difference to the first one: When I open a post I see 50 most liked comments but with all their children. Meaning I fetch the whole tree for these 50 first comments. Depth is not limited.

*Questions(only related to the design of relational database with Closure table):*
- Should we simplify the logic and become less ambitious, so we would go with business requirements similar to the ones in task 1? (when we don't have infinite depth and comments trees can grow only in width) I assume otherwise this is almost impossible to scale such a business logic when there are millions or billions of comments.
- If the answer to the first question is no, how the magic happens then? ( I don't believe that such product requirements could be scalable while infrastructure would still stay profitable; costs would grow exponentially imo)



**General questions to be answered first:**
 - Is a relational database still a case for such a problem nowadays? I don't know much about graph databases, but wouldn't it be optimal to store such hierarchical data there? Probably I just need to discover graph databases deeper, so please feel free to link the related articles. Doesn't seem I found them in a week, so I would definitely need help with finding the right materials :) 


**To sum up:**
I understand that my questions may seem pretty vague, but they are also quite complex and require the knowledge of someone who had this experience. Meanwhile, I am also quite opinionated on some topics (like growth of costs/sharding/caching) and that's why it is even more difficult for me to compile the opinion - I wanna have more thoughts gathered, not only mine.

In case you think an extensive answer would take too much of your time - please give me just short answers like yes or no and just link all the resources you think could really help me to build my opinion on this topic. 
Sharing your real production experience of working with such systems would be really helpful and appreciated! 

Thanks!
                        
Asked by IDobrodushniy (1 rep)
Sep 10, 2022, 05:06 PM
Last activity: Mar 28, 2024, 08:24 AM
System/database design for comments/replies and upvotes at scale

Related Questions