Data Extract¶
The extract module is responsible for getting the data from the source. In this module, basically, you can find the readers classes and its correlated.
File Reader¶
The file reader class is responsible for reading the files from the source and return a Spark dataframe. There are two ways to read:
One dataset¶
If you want to read just one dataset, you have to follow this example:
from pyiris.ingestion.extract import FileReader
from pyiris.infrastructure import Spark
pyiris_spark = Spark()
file_reader_config = FileReader(table_id='atos_inseguros', mount_name='consumezone', country='Brazil', path='Seguranca/AtosInseguros', format='parquet')
dataset = file_reader_config.consume(spark=pyiris_spark)
Reader example usage - brewdat¶
When reading and writing using brewdat as your mount_name at development environment, you will be connected to the Iris Blob (analyticsplatformblob), using the container “brewdat”. When at production environment, you will be connected to the Brewdat Blob (brewdatblobextsagbdev), using the container “iris-plz”.
from pyiris.ingestion.extract import FileReader
from pyiris.infrastructure import Spark
pyiris_spark = Spark()
file_reader_config = FileReader(table_id='atos_inseguros', mount_name='brewdat', country='Brazil', path='Seguranca/AtosInseguros', format='parquet')
dataset = file_reader_config.consume(spark=pyiris_spark)
More than one dataset¶
If you want to read more than one dataset, you have to use the class pyiris.ingestion.extract.extract_service.ExtractService. This class works as a service. Follow this example:
from pyiris.ingestion.extract import FileReader, ExtractService
from pyiris.infrastructure import Spark
pyiris_spark = Spark()
readers = [
FileReader(table_id='atos_inseguros', mount_name='consumezone', country='Brazil', path='Seguranca/AtosInseguros', format='parquet'),
FileReader(table_id='condicao_insegura', mount_name='consumezone', country='Brazil', path='Seguranca/CondicaoInsegura', format='parquet')
]
query = """
SELECT
*
FROM
atos_inseguros
INNER JOIN
condicao_insegura
ON
atos_inseguros.ID == condicao_insegura.ID
"""
extract_service = ExtractService(readers=readers, query=query)
dataset = extract_service.handler(spark=pyiris_spark)
To have more information, please, access the code docstring in Pyiris modules.