You have heard about all the amazing new tools for data analytics, business intelligence or machine learning. But one very important aspect for data strategies is often overlooked and neglected - the data logistics part. Meaning the question of how to integrate and move your existing data, breaking down data silos and making those lofty big data dreams actually happen. Many approaches are possible, so let's examine them for you and your needs.
This popular tool is actually not a good choice in the realm of big data. It's a simple scheduling solution that everybody knows and that works well for straightforward tasks. But data pipelines are far too complex and have too many operational issues when scaled up. Therefore, we would not recommend this one.
- Widely available
- No native monitoring / overview
- No native re-schedule / replay functionality when errors occur
- No coding framework: wildly varying scripts
- Hard to control execution environment / installed libraries
2. Cloud-Specific (e.g. Azure Data Factory)
Large-scale cloud providers like Amazon, Google and Microsoft have built their own solutions for data integration. These vary in being code / no-code solutions and in being purely proprietary or adjusted open-source software. An example would be Azure Data Factory. Such tools might be a good fit if your are already fully invested into a specific cloud or software vendor and don't mind lock-ins.
- Fully-optimized for specific cloud
- Developed by large-scale companies with lots of ressources
- Usually good integration into cloud specific tools
- Lock-In effects: It's hard to change your cloud after you have built on top of native solutions
- Modified open-source: Sometimes open-source tools like Airflow get integrated in cloud-specific systems (Google Cloud Composer), but usually they are modified in a way that makes the code not portable
- In some cases vendor-specific coding languages / syntax
3. No-Code (e.g. Fivetran)
There are many companies offering proprietary solutions with no-code data pipelines. An example would be Fivetran. They are great if you use standardized and well-known API's like Google Analytics or Shopify and if you don't have advanced customization requirements.
- No or few developers required
- Systematic framework
- Hard to customize: possibility to run into limitations
- Impossible to extend: everything you need has to be supported by the platform
4. Code-Based (e.g. Apache Airflow)
Code-based data integration tools give you the ability to build anything you need in a systematic framework. This means full customization options while providing monitoring, re-scheduling and pre-built connectors for common data sources. A system like this is best for companies with slightly advanced requirements and environments. The most popular data workflow and data engineering system is Apache Airflow, originally released by AirBnb. For the growing field of data engineering, this has been the go-to tool for some time now.
- Python-based tasks, all available libraries might be used
- Pre-built connectors for popular databases, API's, ...
- Full customization ability
- Large community (and large companies behind it)
- Systematic framework for code structure
- Monitoring, Re-scheduling
- Hard to operate when self-hosted, many moving parts
- Hard to scale / no natively integrated scaling solution
- Needs developers / meant for data engineering
5. Fully Managed Open-Source (e.g. Airlaunch)
Using a tool like Apache Airflow as a Managed Service has one big advantage. All the complexities of operating and scaling the system go away, which is often the main reason for data strategies to fail. Too much emphasis on operations issues and not enough on realizing results drains ressources and budgets. Companies like ours (Airlaunch) solve this by offering non-modified open-source tools in combination with a fully managed, enterprise-ready service. We also added additional features, like a solution for running tasks on edge nodes in hybrid cloud environments (IOT, Manufacturing, ...).
- All the advantages of Apache Airflow
- No operations issues
- No scaling issues
- No security worries
- No lock-ins / cloud-agnostic solution
- Much lower operations costs
- Ability to connect on-premises with cloud
- Still needs developers (We can help you with that as well :))
Let us help you!
Book a free initial assessment with us about your data integration needs. We look forward to achieving your data analytics goals with you!