Social Profiles Real-Time Platform
Build high-precision search algorithms to find and retrieve information about individuals on the Web. The search involves scanning of social networks and other open public sources. The acquired information gives an insight into the personal profiles and update customer’s data with information from independent sources. The solution should provide consistent performance and automatically scale with increased traffic to the application.
The customer needs a solution for scalable data retrieval and storage for publicly accessible social profiles with the following requirements:
- Ability to store a high volume of social profiles with a variable number of attributes
- Keep track of profile modifications in form of independent documents
- Ability to continuously (24/7) process incoming data streams without degrading system read performance
- Ability to update documents to perform data correction
The system had to be designed to handle billions of profiles, with multiple new documents for each user profile added on a monthly basis.
Historical profile values are used to perform multiple calculations required by the business. The solution would have to be relatively inexpensive to support, and at the same time have high availability and scalable to handle the high volume of various workload types.
Amazon DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale. It’s a fully managed, multi-region, multi-master, durable database with built-in security, backup and restore, and in-memory caching for internet-scale applications. It is fully integrated with other services, fast and efficient.
Based on the posted requirements and the extensive AWS experience, Pi5 proposed to use AWS DynamoDB fully managed NoSQL database service that provides fast and predictable performance with seamless scalability along with ECS as a highly scalable, high-performance container orchestration service.
To work with the storage we developed Java-based modules, which encapsulated different post-aggregation business logic. These modules were deployed into the ECS cluster and provided API endpoint. API was built with Spring Boot and deployed as a docker container. API provides two main methods to query data:
- Raw query: return all documents for a social profile by primary partition key.
- Merge query: retrieve all documents for a social profile by primary partition key, merge it into a single document with attribute-specific logic (get the latest value / merge all values into an array) and returns it.
In both methods API allows the caller to specify maximum query depth to avoid processing a large amount of data, i.e. caller could request only documents with age less than one year.
To run HA and scalable infrastructure we leveraged AWS ECS service. On a quarterly basis all data from DynamoDb was retrieved for in-depth analysis. We utilized AWS DataPipeline to load data to S3 for further analysis by Hive/Spark on the EMR cluster. During batch updates on data corrections, we firstly shuffled data before writing to distribute the write workload efficiently.
The general system architecture is shown below
- DynamoDB was just as scalable as we had hoped with predictable performance
- We run into document size limit (400k) per item and designed document merging strategy on data access API level
- CloudWatch based auto-scaling feature allows achieving cost-effective throughput management without manual operations
- Shuffling data is required for bulk upload and updates
pi5.cloud is a global technology consulting company at the forefront of cloud computing. Through collaboration with Amazon Web Services, we help customers embrace a broad spectrum of innovative solutions. From a migration strategy to operational excellence, cloud-native development, and immersive transformation, pi5.cloud is a full spectrum integrator.
Tell us about your project. Get a free consultation and estimate.