Data Transform¶
The transformation module is responsible for making the data transformations.
SQL transformation¶
This module makes SQL expressions available to transform. Example:
from pyiris.ingestion.transform import SqlTransformation
sql_transformation = SqlTransformation(name='divide',
description='Unit price division',
to_column="unit_price",
sql_expression="price/quantity")
transformed_dataframe = sql_transformation.transform(dataframe=extracted_dataset)
Hash transformation¶
This module returns a hash transformation based on an inputted column. Example:
from pyiris.ingestion.transform import HashTransformation
hash_transformation = HashTransformation(name='Hash CPF',
description='Hash CPF to be according to LGPD',
from_columns=["cpf"])
transformed_dataframe = hash_transformation.transform(dataframe=extracted_dataset)
Spark Transformation¶
This module will enable users to define one (or multiple) spark transformations to be applied to the dataframe.
They can either define their own UDFs to be applied or make use of pyspark.sql.functions module. An example of use:
import pyspark.sql.functions as f
from pyiris.ingestion.transform import SparkTransformation
spark_transformation = SparkTransformation(name="circular_transformations",
description="Circular calculations on salary",
from_column="salary",
functions=[f.cos, f.sin, f.tan])
transformed_dataframe = spark_transformation.transform(dataframe=extracted_dataframe)
The user will also be able to define aggregated calculations on the desired column, using one (or more) transformation
window definitions with our other pyiris.ingestion.transform.transform_window.TransformWindow module,
as shown below:
import pyspark.sql.functions as f
from pyiris.ingestion.transform import SparkTransformation
from pyiris.ingestion.transform.transform_window import TransformWindow
range_window = TransformWindow.build_with_range(
window_name="range_window",
partition_by="department",
order_by="user_id",
upper_bound=4,
lower_bound=0
)
spark_transformation = SparkTransformation(name="window_transformations",
description="Window calculations on salary",
from_column="salary",
functions=[f.sum, f.min, f.max, f.count, f.avg],
windows=[range_window])
transformed_dataframe = spark_transformation.transform(dataframe=extracted_dataframe)
Make sure you check the documentation for the TransformWindow module, so you’ll know exactly how to properly
define your transformation window.
Custom transformation¶
This module gives for the user tools to customize the dataframe, with the main custom features. Example of uses:
from pyiris.ingestion.transform.transformations.custom.custom import divide
from pyiris.ingestion.transform.transformations.custom_transformation import CustomTransformation
custom_transformation = CustomTransformation(name='middle_price',
description='Dividing two fictitious columns (price/quantity) to generate column middle_price',
method=divide,
to_column='middle_price',
column1='price',
column2='quantity')
transformed_dataframe = custom_transformation.transform(dataframe=extracted_dataset)
Custom transformation - snakecase_column_names¶
This method intends to rename all columns of a given dataframe to snake case.
The transformations applied are:
replacing letters containing accents and special characters (e.g. replacing ‘á’, ‘à’, ‘ã’ or ‘â’ to ‘a’);
replacing uppercase letters with underscore and lowercase;
removing duplicated undescore;
removing leading and trailing undescore;
removing all characters that are not allowed (a-z0-9_)
Example code¶
from pyiris.ingestion.transform.transformations.custom.custom import snakecase_column_names
from pyiris.ingestion.transform.transformations.custom_transformation import CustomTransformation
custom_transformation = CustomTransformation(name="snakecase_column_names",
description="rename columns to snake case",
method=snakecase_column_names)
transformed_dataframe = custom_transformation.transform(dataframe=extracted_dataset)
Example outputs¶
already_snake_case_column_name → already_snake_case_column_name notSNAKECaseColumnNameOne → not_snake_case_column_name_one NÕTSnákêCãsèColùmnNãmêTWÕ → not_snake_case_column_name_two
Transform Service¶
The class pyiris.ingestion.transform.TransformService works as a service. You can execute some transformations in sequence, or only one. Follow the example of uses:
from pyiris.ingestion.transform import TransformService, HashTransformation, SqlTransformation
transform_service = TransformService(
transformations=[
SqlTransformation(name='divide',
description='Getting middle price',
to_column="middle_price",
sql_expression="price/quantity"),
HashTransformation(name='Hash CPF',
description='Hash CPF to be according to LGPD',
from_columns=["seller_cpf"])
]
)
transformed_dataframe = transform_service.handler(dataframe=extracted_dataset)
To have more information and to know better how to use some module, please, access the code docstrings in our Pyiris modules section on the left panel.