If you want a platform to tap into a mammoth repository of structured web data to train machine learning models, Webz.io is a good choice. It makes the web a machine-readable data source and has a large repository of data from the open, deep and dark web. The service offers a Grab-and-Go API for integration, Ready-to-Consume Repositories for immediate access to current and historical data, and High-Resolution Structured Data in JSON or XML. It can handle many types of data and many languages, and has tiered pricing for different uses and projects.
Another contender is Hugging Face, an open-source collaborative machine learning platform. It has more than 400,000 models for different tasks, 150,000 applications and demos, and access to more than 100,000 public datasets. The service lets you host models, datasets and applications for free, and offers enterprise features like optimized compute options, SSO and private dataset management. Pricing tiers range from free to relatively inexpensive for compute and inference endpoints.
If you want to focus on AI applications, Appen offers high-quality, diverse data through an end-to-end platform for text, images, audio, video and geo-spatial data. Its customizable workflows and built-in quality control processes make it a good choice for gathering, curating and fine-tuning data. The company's platform is used by major brands, and it can be deployed in SaaS or on-premise, so it's good for AI projects that need more control over data.
Last, Import.io is for scraping data from the web. It's got a simple interface and an AI-powered crawler that can scrape millions of pages and produce billions of data points. It's good for large-scale web scraping, in particular for e-commerce and retail, where businesses need to understand market trends and customer behavior. Its tiered pricing means it can accommodate a range of business needs.