Building a Serverless Reddit Wordcloud Generator

Matthew Golden

26 May 2024 — 2 min read

Beginning last year, I've endeavored to familiarize myself with the intricacies of AWS Cloud Services within a data engineering framework. As a capstone project, I built a reddit wordcloud generator. I just recently completed its core functionality. It takes all comments from a given reddit post and processes them into a wordcloud image. The wordcloud should provide a quick and intuitive way to explore and analyze the most prominent words in a body of text, useful for gaining insights into the main themes, topics, or keywords.

Project Insights

Serverless Architecture

The project adopts a serverless architecture using AWS Lambda functions, AWS Glue, S3, Athena, and EventBridge.

Reddit API Integration

The project integrates with the Reddit API to retrieve comments from a Reddit thread. This involves making calls to two different endpoints (`/comments` and /api/morechildren) to gather post metadata and comments.

Parallel Processing and Scalability

The downloading of comments is split into two Lambda functions (`reddit_get_comments` and reddit_more_comments), allowing for parallel processing and potentially faster retrieval of large datasets. This approach can help in handling varying loads and improving overall system scalability.

Asynchronous Processing

Asynchronous processing is used for handling additional comments. The project employs SQS (Simple Queue Service) to queue messages for additional comments, and the reddit_more_comments Lambda function processes these messages. This asynchronous approach helps decouple components and ensures efficient resource utilization.

Token Caching for API Calls

The project uses the serverless-aws-api-token-cache library to manage and cache API tokens for making authenticated calls to the Reddit API. This enhances security and reduces the frequency of token retrieval.

ETL Workflow with AWS Glue

The Extract, Transform, Load (ETL) workflow is implemented using AWS Glue. The workflow involves transforming raw JSON data into tabular format (`reddit_comments` Glue job) and further processing to generate word frequencies (`reddit_comments_wordcloud` Glue job).

Data Cleansing and Analysis

The reddit_comments_wordcloud Glue job performs data cleansing by removing punctuation, whitespace, URLs, callouts, emotes, etc. It also filters for word length and removes stop words. These steps are crucial for generating meaningful word frequencies for the wordcloud.

Event-Driven Architecture

The project utilizes AWS EventBridge for event-driven architecture. Events, such as the successful completion of the reddit_comments_wordcloud Glue job, trigger subsequent Lambda functions for generating wordcloud images.

Athena for Querying Data

Athena is used to query the word frequency table for generating the wordcloud image. This serverless query service allows for ad-hoc querying of data stored in S3 without the need for a dedicated database.

Image Generation and Comparison

The reddit_output_word_frequency_csv Lambda function queries Athena for word frequencies, and the reddit_generatewordcloud Lambda function generates images based on the comparison of Athena output with existing wordcloud images stored in S3. This approach minimizes unnecessary image generation and storage.

Monitoring and Completion Handling

The project uses EventBridge rules to monitor the completion of various stages, such as download completion, Glue job success or PutObject events in S3. This enables seamless coordination between different components and ensures that subsequent processes are triggered appropriately.

Infrastructure as Code with CloudFormation

he project leverages AWS CloudFormation and SAM Templates for defining and deploying infrastructure as code. CloudFormation templates are used to specify the configuration of Lambda functions, Glue Workflows, EventBridge rules, and other AWS resources. This ensures consistent and repeatable deployments, making it easier to manage the entire infrastructure lifecycle.